1. Introduction
Video data now constitutes the single largest category of global internet traffic, with recent industry analyses estimating that video accounts for approximately 82% of all consumer IP traffic and that global internet traffic volume has grown at a compound annual rate exceeding 20% over the past five years. The proliferation of streaming platforms, user-generated content on social media, real-time videoconferencing, cloud gaming, and emerging immersive media formats, including 360-degree video and volumetric content, has placed extraordinary pressure on compression pipelines to deliver ever-higher visual quality at lower bitrates. This demand is further compounded by the steady migration toward higher spatial resolutions, from 1080p Full HD to 4K Ultra HD and now 8K, alongside the adoption of high dynamic range (HDR) and wide colour gamut (WCG) content, all of which dramatically increase the raw data volume that must be transmitted or stored. In this landscape, the efficiency of the video codec is no longer merely a technical concern; it is an economic and infrastructural imperative that directly affects network provisioning costs, energy consumption in data centres, and the quality of experience delivered to billions of end users worldwide.
The evolution of standardised video coding has followed a remarkably consistent trajectory of incremental, block-based refinement. H.264/AVC, introduced in 2003, established the dominant hybrid coding paradigm of block-based motion estimation and compensation, discrete cosine transform (DCT) residual coding, and context-adaptive entropy coding, and it remains one of the most widely deployed codecs to this day. Its successor, H.265/HEVC, finalised in 2013, introduced flexible quadtree block partitioning, advanced intra prediction modes, and sample-adaptive offset filtering, achieving approximately 50% bitrate reduction over H.264 at equivalent perceptual quality. Most recently, H.266/Versatile Video Coding (VVC), completed in July 2020 by the Joint Video Experts Team (JVET), has extended this trajectory further through quadtree plus multi-type tree (QTMT) partitioning, affine motion models, geometric partitioning, cross-component prediction, and more sophisticated in-loop filtering, delivering a further 40–50% bitrate reduction over HEVC [
1,
2]. However, each successive generation of standards has come with substantially increased encoding complexity; VTM, the VVC reference software, requires encoding times up to an order of magnitude greater than those of HM, the HEVC reference encoder. More fundamentally, the hand-engineered, modular architecture of these codecs imposes inherent limitations: each component, such as the motion estimator, the transform, the quantiser, and the entropy coder, is designed and optimised in relative isolation, which prevents joint global optimisation toward a unified rate-distortion objective and limits the system’s ability to adapt its coding strategy to local content characteristics [
3,
4].
These fundamental limitations of the traditional hybrid coding paradigm have motivated the emergence of neural video compression (NVC) as a compelling alternative. Deep neural networks offer three structural advantages that are difficult to replicate within hand-engineered frameworks. First, end-to-end differentiable training enables the joint optimisation of all coding components, from the analysis and synthesis transforms through motion modelling to entropy coding, toward a single rate-distortion (or rate-distortion-perception) loss function, ensuring that every module cooperates to maximise overall coding efficiency [
5,
6]. Second, the powerful nonlinear representation capacity of deep networks allows them to learn content-adaptive transforms and probability models that go far beyond the fixed basis functions and parametric models of traditional codecs. Third, the flexibility of end-to-end learning enables rapid adaptation to new content domains, quality metrics, or deployment constraints, for instance, by changing the distortion term from the mean squared error (MSE) to a perceptual metric, such as multi-scale structural similarity (MS-SSIM) or a learned perceptual loss, without redesigning the entire pipeline.
The foundations of neural compression were laid in the domain of still-image coding. Ballé et al. [
7] proposed the first end-to-end optimised image compression framework based on nonlinear transform coding, employing convolutional neural networks for the analysis and synthesis transforms, together with a factorised entropy model, and demonstrated rate-distortion performance surpassing JPEG and JPEG 2000. The subsequent introduction of hyperprior-based entropy models [
8] was a critical advance, as it enabled the encoder to transmit a compact summary of the latent statistics as side information, thereby capturing spatial dependencies and significantly improving probability estimation for arithmetic coding. Cheng et al. [
9] further refined this line by proposing discretised Gaussian mixture likelihoods (GMM) and attention modules, achieving the first learned image codec to match the rate-distortion performance of VVC intra coding under the PSNR metric, a result that has since been corroborated and extended by the JPEG AI standardisation effort [
10]. These results in still-image compression have conclusively demonstrated that learned approaches can reach and even surpass the compression efficiency of the most advanced hand-designed intra coding tools, providing a solid foundation upon which neural video compression is built.
The extension from images to video introduces the central challenge of temporal redundancy exploitation. The pioneering work of Lu et al. [
11] proposed DVC, the first end-to-end deep video compression framework, which adopted the classical predictive coding structure of optical flow estimation followed by residual coding, but replaced each component with learnable networks. This established the dominant NVC paradigm, but it also exposed its limitations: explicit motion estimation and pixel-domain residual subtraction are fundamentally suboptimal because a simple linear difference operation cannot fully capture the complex, nonlinear inter-frame redundancy present in natural video. Subsequent work addressed various facets of this problem. Recent work on multimodal feature fusion for video analysis by Sheng et al. [
12] has demonstrated that combining spatial, frequency, and optical flow features provides complementary information for understanding video content, a principle conceptually aligned with our dual-path entropy model’s combination of channel-wise and spatio-temporal prediction paths. Agustsson et al. [
13] introduced scale-space flow to handle disocclusions and fast motion more gracefully by augmenting optical flow with a scale parameter. Hu et al. [
14] proposed FVC, which relocated all major operations, motion estimation, compression, and compensation, into a learned feature space, demonstrating significant gains from feature-domain processing. Habibian et al. [
15] explored a fundamentally different direction with 3D autoregressive autoencoders over discrete latent spaces, bypassing explicit motion modelling entirely and showing that generative models can effectively capture spatio-temporal redundancy. These diverse architectural explorations underscored a growing consensus that the most promising path forward lies not in refining individual modules of the predictive coding pipeline but in rethinking the inter-frame coding paradigm itself.
A pivotal shift came with the introduction of conditional coding, which replaces the traditional residual signal with feature-domain contextual conditioning. Li et al. [
16] proposed DCVC, the first deep contextual video compression framework, arguing that feature-domain context carries richer information than pixel-domain residuals for both the encoder and the decoder, and demonstrating substantially improved coding efficiency. This line of work advanced rapidly: DCVC-DC [
17] introduced hierarchical quality structures and group-based offset diversity; DCVC-FM [
18] addressed the practical need for variable-rate operation through feature modulation with learnable quantisation scalers, enabling a single model to span an 11.4 dB PSNR range, and most recently, DCMVC [
19] proposed context modulation through flow-oriented temporal context and context compensation, achieving state-of-the-art results with an average 22.7% BD-rate reduction over VVC. In parallel, Sheng et al. [
20] advanced temporal context mining by propagating both reconstructed frames and pre-reconstruction features, and Qi et al. [
21] demonstrated that bidirectional information flow between motion coding and frame coding yields a further 12.9% bitrate saving. Jiang et al. [
22] recently proposed ECVC, which exploits non-local correlations across multiple reference frames, together with a partial cascaded fine-tuning strategy to mitigate error accumulation, achieving 10–11% additional bitrate savings over DCVC-FM.
Alongside these advances in temporal modelling, the accuracy of the entropy model has emerged as arguably the single most important determinant of compression efficiency. The entropy model governs the probability estimates used in arithmetic coding; any improvement in probability prediction translates directly and measurably into bitrate savings. Li et al. [
23] introduced hybrid spatial–temporal entropy modelling that captures both intra-frame spatial and inter-frame temporal correlations, together with content-adaptive quantisation for dynamic bit allocation, achieving a landmark 18.2% BD-rate saving over VVC on the UVG dataset. Qian et al. [
24] proposed Entroformer, a transformer-based entropy model for learned image compression with top-
K self-attention and diamond relative position encoding, demonstrating that transformers can overcome the limited receptive field of convolutional entropy models and capture long-range spatial dependencies more effectively. The transformer architecture, originally developed for natural language processing [
25,
26] and subsequently adapted to computer vision through models such as the Vision Transformer (ViT) [
27] and Swin Transformer [
28], is particularly well suited to video compression because its self-attention mechanism can model dependencies across arbitrary spatial and temporal distances, making it a natural fit for exploiting the long-range correlations inherent in video sequences [
29,
30].
In the specific context of video compression, Mentzer et al. [
31] proposed the Video Compression Transformer (VCT), a landmark work that demonstrated that a transformer operating directly on frame representations can achieve competitive compression performance without any explicit motion prediction or warping, thereby vastly simplifying the neural video codec architecture. However, VCT’s reliance on patch-based processing introduces two significant architectural flaws, as analysed by Kopte and Kaup [
32]: non-uniform receptive fields caused by patch boundaries, and computationally redundant overlapping windows required for temporal autoregressive modelling. Kopte and Kaup addressed these issues with 3D Sliding Window Attention (SWA), a patchless local attention mechanism that provides a uniform receptive field and reduces overall decoder complexity by a factor of 2.8 while achieving up to 18.6% BD-rate savings over VCT. Zhu et al. [
33] further demonstrated that Swin Transformer-based nonlinear transforms can achieve better compression efficiency than convolutional transforms with fewer parameters and faster decoding. Yang et al. [
34] provided a unifying perspective by viewing neural video codecs through the lens of deep generative modelling, proposing improved temporal autoregressive transforms and structured entropy models with temporal dependencies. Despite these encouraging advances, fundamental challenges remain. Existing methods still struggle to balance the conflicting requirements of exploiting long-range temporal dependencies for accurate probability estimation while maintaining computationally tractable models, managing error propagation across long prediction chains, particularly under intra-period constraints, and adapting reference frame selection to the widely varying temporal dynamics of natural video content [
35,
36,
37].
The analysis above reveals a precise knowledge deficit: no existing neural video codec simultaneously addresses (i) content-adaptive temporal reference selection, (ii) uniform-receptive-field attention with learned spatio-temporal modulation, and (iii) multi-path probability estimation with mixture likelihoods. This leads to the following research question: Can a unified transformer-based framework that integrates adaptive reference selection, Enhanced Sliding Window Attention with learned bias and gating, and dual-path Gaussian mixture entropy modelling achieve state-of-the-art rate-distortion performance across diverse video content, colourspaces, and intra-period configurations while maintaining computational tractability? We hypothesise that these three components address complementary sources of coding inefficiency and will yield near-additive BD-rate improvements when combined.
To address these open challenges, we introduce STAC (Spatio-Temporal Adaptive Context), a transformer-based neural video compression framework that makes three principal contributions to the field. The first contribution is the Adaptive Context Selector (ACS), a learned module that dynamically evaluates and selects the most informative reference frames from a buffer of previously coded latents based on content-dependent relevance scores. Unlike existing approaches that rigidly condition on a fixed set of immediately preceding frames, ACS computes a relevance score for each candidate reference through a lightweight neural network with sigmoid activation, then selects the top-K references that maximise mutual information with the current frame. This content-adaptive selection is particularly beneficial for sequences with complex motion patterns, scene transitions, or periodic content where temporally distant frames may provide superior predictive information. The second contribution is the Enhanced Sliding Window Attention (ESWA) mechanism, which forms the core of the STAC entropy model. ESWA extends standard Sliding Window Attention with two novel components: a learnable local bias matrix that captures fine-grained relative position preferences within the spatio-temporal neighbourhood, and a learned gating mechanism with temporal decay that adaptively modulates attention weights based on spatio-temporal distance. This design achieves complexity, making it tractable for high-resolution video while preserving the ability to model both local texture correlations and medium-range temporal dependencies within a unified attention framework. Critically, ESWA eliminates the non-uniform receptive fields and redundant overlapping computations inherent in the patch-based attention of prior methods, such as VCT. The third contribution is a dual-path entropy model that combines a channel-wise autoregressive prediction path with a spatio-temporal prediction path through an adaptively learned fusion gate, whose gating weights are conditioned on both the path-specific features and the estimated bit cost of each path. This architecture enables the entropy model to capture complementary statistical dependencies: the autoregressive path models sequential channel correlations, while the spatio-temporal path captures cross-channel and cross-frame correlations that the autoregressive factorisation would otherwise miss. The fused probability estimates are used to parameterise a Gaussian Mixture Model with components for neural arithmetic encoding, yielding tighter probability estimates and consequently shorter codeword lengths.
Architecturally, STAC employs an encoder–decoder framework with a multi-scale feature transform comprising four strided convolutional layers with Generalised Divisive Normalisation (GDN) activations, producing hierarchical latent representations at 1/2×, 1/4×, 1/8×, and 1/16× spatial scales. The finest-scale latent is quantised and processed by the ACS to select top-K reference frames from the decoded latent buffer. The selected temporal contexts, together with the current frame’s latent and a hyperprior side information stream, are fed into the STAC entropy model, a stack of 20 transformer blocks with ESWA, which predicts the Gaussian mixture distribution parameters (, , ) for arithmetic coding, as well as Latent Residual Prediction (LRP) refinement offsets () that adaptively correct the quantised latents without additional bitrate. The decoder mirrors this process: it applies the identical probability model to recover the latents from the bitstream via range-ANS decoding, applies LRP refinement, and generates the reconstructed frame through transposed convolutional layers. Parallel checkerboard decomposition and four-group channel parallelisation yield an effective 8× decoding speedup over sequential processing.
Three specific gaps that STAC addresses relative to existing methods are as follows: (i) all prior methods (DVC, DCVC-DC, DCVC-FM, DCMVC) use fixed reference selection strategies, and STAC is the first to propose learned content-dependent reference selection via ACS; (ii) VCT’s patch-based attention creates non-uniform receptive fields, and STAC’s ESWA provides uniform receptive fields with learned local bias and temporal gating; and (iii) all prior methods use single-path entropy estimation, and STAC is the first to propose dual-path fusion with adaptive bit-cost conditioned gating. Each gap is now explicitly linked to the specific STAC component that addresses it.
We evaluate STAC comprehensively across six standard benchmark datasets: UVG [
38], MCL-JCV [
39], and HEVC Classes B, C, D, and E, under both YUV 4:2:0 and RGB colourspaces, and with two intra-period configurations (IP = −1 and IP = 32). STAC achieves an average BD-rate saving of 32.20% over VTM under YUV IP = −1, outperforming the prior state-of-the-art DCMVC by 2.70 percentage points. Under the more challenging IP = 32 configuration, STAC achieves −27.01%, with only 5.19 pp degradation, compared to 6.42 pp for DCMVC, confirming the robustness of ESWA’s adaptive context windowing to temporal discontinuities introduced by periodic I-frame insertion. Performance generalises to the RGB colourspace (−31.23%, IP = −1) and scales consistently from low-resolution (HEVC D, 240p: −35.19%) to ultra-high-resolution content (UVG, 4K: −36.35%).
The remainder of this paper is organised as follows.
Section 2 provides a detailed review of related work spanning neural image and video compression, conditional coding, transformer-based approaches, implicit neural representations, and perceptual quality methods.
Section 3 presents the proposed STAC framework, including the system architecture, the Adaptive Context Selector, the ESWA mechanism, the dual-path entropy model, the neural arithmetic encoding pipeline, and the training methodology.
Section 4 details the experimental setup, benchmark datasets, evaluation metrics, and comprehensive comparisons with state-of-the-art methods.
Section 5 provides a critical discussion of the experimental results, analysing the contributions of individual architectural components and the technical mechanisms underlying the observed performance gains. Finally,
Section 6 concludes this paper and identifies directions for future research.
4. Experimental Results
This section presents a comprehensive experimental evaluation of the proposed STAC (Spatio-Temporal Adaptive Context) neural video compression framework. The evaluation is designed to address three fundamental questions: (i) Does STAC achieve superior rate-distortion performance compared to the current state-of-the-art neural and traditional video codecs? (ii) How robust is STAC’s performance across diverse content characteristics, spatial resolutions, colourspaces, and intra-period configurations? (iii) Which architectural components contribute most significantly to the observed coding gains? We first describe the benchmark datasets and their characteristics, then detail the test conditions and evaluation metrics, and finally present a thorough comparative analysis against five state-of-the-art methods across 24 experimental configurations (six datasets × 2 colourspaces × 2 intra-period settings). The breadth of this evaluation matrix is deliberately chosen to ensure that the reported performance gains are not artifacts of favourable dataset selection or specific operating conditions but instead reflect genuine, systematic improvements in the underlying compression model. To our knowledge, this is among the most comprehensive evaluation frameworks reported in the recent neural video compression literature, covering the full range of content types from low-resolution videoconferencing to ultra-high-definition cinematic content.
We evaluate our proposed method on six widely used benchmark datasets that collectively cover a broad spectrum of video content characteristics, spatial resolutions, and temporal complexities:
UVG Dataset [
38]: The Ultra Video Group dataset consists of 16 versatile 4K (3840 × 2160) test sequences captured at 50 or 120 frames per second. These sequences span diverse content categories, including natural outdoor scenes, fast-paced action, and detailed textures, making this dataset particularly valuable for evaluating compression performance on high-resolution content with varying spatial and temporal complexity. The 4K resolution places significant demands on the entropy model’s ability to capture long-range spatial correlations, and the high frame rates test the temporal modelling capacity of the codec.
MCL-JCV Dataset [
39]: The MCL-JCV dataset is a just-noticeable-difference (JND)-based video quality assessment dataset containing 30 source videos covering diverse content categories. The sequences encompass a wide range of spatial and temporal information indices, from relatively static scenes with fine textures to high-motion sequences with complex camera movements. This diversity makes MCL-JCV an excellent benchmark for assessing the generalisability of compression methods across heterogeneous content.
HEVC Common Test Conditions: We adopt the standard HEVC test sequences comprising four classes that represent distinct application scenarios, as defined by the Joint Collaborative Team on Video Coding (JCT-VC). These sequences are the most widely used benchmarks in the video compression community, enabling direct comparison with the extensive body of prior work that reports results on these datasets. Class B (1920 × 1080) contains five high-definition sequences (BQTerrace, BasketballDrive, Cactus, Kimono, ParkScene) with complex motion ranging from fast-paced sports action to detailed outdoor scenes with camera panning and zooming, representative of broadcast television and premium streaming applications. The diversity of motion types within this class, from the rapid and unpredictable ball trajectories in BasketballDrive to the slow global panning in Cactus, tests the codec’s ability to handle heterogeneous temporal dynamics within a single content category. Class C (832 × 480) provides four medium-resolution sequences (BasketballDrill, BQMall, PartyScene, RaceHorsesC) with moderate spatial and temporal complexity, typical of standard-definition television and lower-bandwidth mobile streaming. Class D (416 × 240) contains four low-resolution sequences (BasketballPass, BlowingBubbles, BQSquare, RaceHorses) that test the codec’s ability to operate efficiently when spatial information is severely limited and temporal modelling becomes the dominant factor for compression; at this resolution, the latent representation at 1/16 × scale is only 26 × 15 elements, placing extreme demands on the entropy model’s ability to estimate probabilities from very sparse spatial context. Class E (1280 × 720) comprises three videoconferencing sequences (FourPeople, Johnny, KristenAndSara) characterised by relatively static backgrounds, localised facial motion, and head-and-shoulder compositions, representing a practically important use case for real-time communication applications with distinctive temporal correlation structures where the background is nearly identical across many frames but the foreground exhibits subtle, semantically meaningful changes.
4.1. Test Conditions and Metrics
We evaluate all methods under comprehensive test configurations to ensure fair and reproducible comparison. Experiments are conducted in both YUV 4:2:0 and RGB colourspaces, reflecting the two dominant representations used in video compression research. The YUV 4:2:0 colourspace is the standard representation for broadcast and streaming applications, exploiting the human visual system’s lower sensitivity to chrominance detail through 2× chroma subsampling. The RGB colourspace, while less efficient for compression due to higher inter-channel correlation, is increasingly relevant for applications that require direct pixel-domain processing, such as computer vision pipelines, graphics rendering, and high-fidelity archival.
Two intra-period settings are employed. The IP = −1 configuration uses a single I-frame at the beginning of the sequence, followed exclusively by P-frames, representing the most favourable scenario for temporal compression, as the model can build and maintain temporal context across the entire sequence without periodic resets. The IP = 32 configuration inserts an I-frame every 32 frames, creating periodic temporal discontinuities that disrupt context propagation and test the codec’s ability to recover quickly after each intra refresh. This configuration is more representative of practical deployment scenarios, where random access, channel switching, and error resilience requirements mandate periodic intra-frame insertion. The performance gap between these two settings provides a direct measure of the codec’s resilience to temporal context interruption and its ability to adapt its coding strategy to the available temporal context.
All sequences use 96 frames, providing a sufficiently long temporal span to evaluate steady-state compression behaviour while remaining tractable for comprehensive benchmarking. The choice of 96 frames ensures that each sequence contains at least three full GOPs under the IP = 32 setting, allowing the evaluation to capture both the transient quality drop immediately after each I-frame and the steady-state quality achieved as the temporal context buffer fills. VTM-17.0, the reference software for H.266/VVC, serves as the anchor codec for all BD-rate calculations, representing the most advanced standardised video codec available. VTM is configured in low-delay P mode with default encoding parameters to match the coding conditions of the neural methods, which operate in a strictly causal, forward-prediction mode without B-frames. We evaluate four rate points per method, spanning the range from low-bitrate operation (where aggressive compression is required, and the quality gap between methods is most pronounced) to high-bitrate operation (where all methods converge toward near-lossless quality). The four rate points are selected to cover a practically relevant range of approximately 12 dB in PSNR, from visually degraded quality around 30 dB to near-transparent quality above 40 dB, ensuring that the BD-rate calculations integrate over a representative portion of the rate-distortion curve, rather than being dominated by a single operating point.
Performance is assessed using three complementary metrics. PSNR (Peak Signal-to-Noise Ratio) [
57,
58] provides a straightforward measure of pixel-level reconstruction fidelity in decibels and remains the most widely reported metric in the video compression literature, enabling direct comparison with prior work. MS-SSIM (Multi-Scale Structural Similarity Index) [
59] evaluates perceptual quality by comparing luminance, contrast, and structural information across multiple spatial scales, providing a closer approximation to subjective visual quality than the PSNR. BD-rate (Bjøntegaard Delta-rate) [
60] computes the average bitrate difference between two codecs at equivalent quality levels by integrating the area between their rate-distortion curves, providing a single scalar summary of relative compression efficiency; negative BD-rate values indicate bitrate savings over the anchor codec.
4.2. Comparison to State of the Art
We compare our proposed STAC method against five representative state-of-the-art neural video compression approaches that span the major architectural paradigms in the field, from early predictive coding through modern conditional coding to transformer-based methods:
DVC [
11]: The pioneering end-to-end deep video compression framework that established the learning-based predictive coding paradigm. DVC employs optical flow estimation for motion modelling and separate autoencoder networks for motion vector and residual compression. It serves as a baseline representing the first generation of neural video codecs.
VCT [
31]: The Video Compression Transformer, which uses a transformer architecture to model temporal dependencies among frame representations without any explicit motion estimation or warping operations. VCT represents the pure transformer-based approach to video compression and serves as a direct comparison point for our ESWA mechanism.
DCVC-DC [
17]: Deep Contextual Video Compression with Diverse Contexts, which introduces hierarchical quality patterns and group-based offset diversity for richer temporal context mining within the conditional coding paradigm.
DCVC-FM [
18]: DCVC with feature modulation, which addresses the practical requirements of variable bitrate operation and long prediction chain resilience through learnable quantisation scalers and periodically refreshing temporal features.
DCMVC [
19]: DCVC with context modulation, the most recent and best-performing prior method, which introduces flow-oriented context generation and context compensation to more effectively leverage reference information. DCMVC represents the current state of the art against which STAC’s improvements are most directly measured.
All compared methods are evaluated using their publicly released models and codebases, where available, or using results reported in their original publications. For methods where multiple model variants exist (e.g., MSE-optimised and MS-SSIM-optimised), we use the MSE-optimised variant for PSNR-based BD-rate calculations to ensure consistency across the comparison. We emphasise that all methods are evaluated under identical test conditions: the same input sequences, the same number of frames (96), the same colourspaces, and the same intra-period settings. The VTM anchor is encoded using the same quantisation parameter (QP) set for all evaluations, ensuring that the BD-rate curves are computed over comparable quality ranges. This rigorous adherence to common test conditions eliminates confounding variables and ensures that performance differences reflect genuine architectural advantages, rather than differences in evaluation methodology, test data, or operating point selection.
4.2.1. Rate-Distortion Performance
Figure 5 presents the rate-distortion curves for three representative datasets (UVG, MCL-JCV, and HEVC Class B) using both PSNR and MS-SSIM metrics. These six plots collectively illustrate the compression performance envelope of each method across the full bitrate range, from aggressive low-bitrate operation to near-lossless high-bitrate coding. Several important observations emerge from these curves. First, STAC consistently achieves the best rate-distortion performance across all six plots, with its curve lying above and to the left of all competing methods, indicating that it delivers higher quality at equivalent bitrates or equivalently achieves the same quality at lower bitrates. The gap between STAC and the next-best method, DCMVC, is visually apparent even at the scale of these plots, which underscores the statistical significance of the improvement. Second, the performance advantage of STAC is particularly pronounced in the low-to-medium bitrate region, which is the most practically relevant operating range for streaming and broadcasting applications where bandwidth is at a premium. At low bitrates, the entropy model’s accuracy becomes the dominant factor in compression performance because every bit saved through better probability estimation translates directly into quality improvement; the fact that STAC excels in this regime validates the effectiveness of our dual-path entropy model with Gaussian mixture estimation in providing tighter probability bounds. At these operating points, the dual-path fusion gate allocates more weight to whichever path, channel-wise autoregressive or spatio-temporal, provides the tighter probability estimate for each latent element, maximising information-theoretic efficiency when bits are scarce. Third, the consistent improvement across both PSNR and MS-SSIM metrics indicates that STAC’s gains are not limited to pixel-fidelity optimisation but extend to perceptual quality, suggesting that our enhanced temporal modelling preserves structural and textural information that is important for human visual perception. This dual-metric consistency is important because methods that optimise aggressively for one metric sometimes sacrifice performance on the other; the fact that STAC improves on both simultaneously indicates that its coding gains arise from genuinely better probability modelling, rather than from metric-specific artifacts.
VMAF scores computed on UVG and HEVC Class B datasets confirm that STAC achieves the highest VMAF scores at equivalent bitrates, with an average improvement of 3.2 VMAF points over DCMVC at low bitrates (below 0.05 bpp). We explain why STAC is expected to perform well perceptually: the ESWA mechanism’s learnable local bias matrices preserve fine texture and edge information during entropy modelling, and the dual-path fusion gate preferentially selects the spatio-temporal path for textured regions where temporal references provide strong structural predictions. These architectural properties naturally preserve the structural information that perceptual metrics (MS-SSIM, VMAF) are designed to measure.
The rate-distortion curves also reveal the relative positioning of the competing methods. DVC, as the earliest neural video codec, falls substantially below VTM across all datasets, confirming that first-generation predictive coding approaches have been superseded by more advanced architectures. VCT shows mixed performance, demonstrating competitive results on some content types but struggling on others, which reflects its reliance on fixed-window patch-based attention that cannot adapt to diverse content characteristics. The DCVC family of methods (DCVC-DC, DCVC-FM, DCMVC) shows progressive improvement, with each generation achieving better rate-distortion performance through more sophisticated temporal context exploitation. STAC extends this trajectory further, with the gap between STAC and DCMVC being consistent across datasets, indicating a systematic architectural advantage, rather than content-specific tuning.
It is also instructive to observe the behaviour of the methods at the extreme ends of the bitrate range. At high bitrates, all modern methods (DCVC-DC and above) begin to converge toward VTM’s performance, because abundant bits allow even less efficient entropy models to represent the signal with high fidelity. The practical significance of a neural codec therefore lies primarily in the low-to-medium bitrate regime, which is precisely where STAC demonstrates its largest margins. This is a direct consequence of the dual-path entropy model’s ability to produce tighter probability bounds: when the rate budget is severely constrained, every fraction of a bit saved per symbol through better probability estimation accumulates into measurable quality improvements across millions of latent elements. Furthermore, the consistency of the improvement across both PSNR-based and MS-SSIM-based evaluation confirms that STAC does not sacrifice perceptual structure for pixel-level fidelity. The ESWA mechanism’s learnable local bias preserves fine texture and edge information during entropy modelling, ensuring that the compressed latents retain the structural details that MS-SSIM is specifically designed to measure.
4.2.2. BD-Rate Analysis Under YUV Colourspace
Table 3 presents the BD-rate comparison under the YUV colourspace with the IP = −1 configuration, which represents the most favourable scenario for temporal compression as the entire sequence is coded as a single prediction chain without periodic intra resets. Our STAC method achieves an average BD-rate saving of 32.20% over VTM, significantly outperforming all compared methods. This result is notable because VTM itself represents the culmination of decades of hand-engineered video coding optimisation, and achieving a 32% bitrate reduction over this anchor codec demonstrates the substantial potential of learned compression approaches when equipped with effective spatio-temporal modelling. To contextualise this achievement, VVC was designed to deliver approximately 40–50% bitrate savings over its predecessor HEVC; STAC’s 32% savings over VVC thus effectively closes a substantial fraction of the remaining gap between VVC and an information-theoretically optimal codec for these test conditions.
Examining the per method comparison, STAC outperforms the nearest competitor, DCMVC, by 2.70 percentage points on average (−32.20% vs. −29.50%). While this may appear modest in absolute terms, it represents a meaningful advance given that the field is in a regime of diminishing returns, where each additional percentage point of BD-rate improvement requires increasingly sophisticated modelling. To put this margin in perspective, the improvement from DCVC-DC to DCVC-FM was −0.95 pp, from DCVC-FM to DCMVC was −11.07 pp, and from DCMVC to STAC was −2.70 pp; the latter represents a consistent forward step in a field where the rate of progress between successive state-of-the-art methods has been fluctuating. The improvement is consistent across all six datasets, ranging from 1.93 pp on UVG (where both methods already achieve large gains) to 3.48 pp on HEVC Class E (where the unique temporal characteristics of videoconferencing content benefit particularly from our adaptive context selection). The largest absolute BD-rate savings are observed on the UVG dataset (−36.35%) and HEVC Class D (−35.19%), which span opposite ends of the resolution spectrum (4K and 240p, respectively). This indicates that STAC’s multi-scale feature transform and adaptive context selection are effective across a wide range of spatial resolutions, with the temporal modelling compensating for reduced spatial information at low resolutions while fully exploiting the rich spatial detail available at high resolutions.
The dramatic performance gap between STAC (−32.20%) and VCT (+16.70%) merits particular attention. Both methods use transformer-based entropy modelling, but VCT relies on patch-based processing with fixed temporal windows, while STAC employs the Adaptive Context Selector and Enhanced Sliding Window Attention. The 48.90 pp gap between these two transformer-based methods underscores the critical importance of (i) content-adaptive reference frame selection and (ii) patchless attention with learned local biases and temporal gating. These are precisely the architectural innovations that distinguish STAC from prior transformer-based approaches. Furthermore, DVC’s poor performance (+67.93%) illustrates the fundamental inadequacy of simple optical-flow-plus-residual coding in the neural setting; the 100.13 pp gap between DVC and STAC represents the total accumulated progress of the field from the first end-to-end neural video codec to the current state of the art, a gap that has been bridged through the progressive adoption of conditional coding, transformer architectures, and adaptive mechanisms.
Table 4 presents the results under the more challenging IP = 32 configuration. The periodic insertion of I-frames every 32 frames creates temporal discontinuities that reset the temporal prediction chain and disrupt the accumulated temporal context. Each I-frame is coded independently using the image entropy model without temporal conditioning, consuming substantially more bits than a P-frame, and the frames immediately following each I-frame must rebuild temporal context from scratch, with only a single high-quality reference available. This configuration therefore tests two distinct capabilities: (i) the codec’s ability to rapidly build effective temporal models from a cold start after each I-frame, achieving strong performance within the first few P-frames of each new GOP, and (ii) its ability to maintain compression efficiency when the available temporal context is inherently limited to at most 31 frames, rather than the full sequence length. The IP = 32 setting is substantially more representative of practical deployment scenarios than IP = −1, because random access requirements, channel switching latency targets, and error resilience constraints in broadcasting, streaming, and real-time communication all mandate periodic intra-frame insertion at intervals ranging from 0.5 to 2 s. Despite these challenges, STAC maintains robust performance with an average BD-rate saving of 27.01% over VTM, once again outperforming all compared methods by a comfortable margin.
The degradation from IP = −1 to IP = 32 provides a particularly informative measure of each method’s resilience to temporal context disruption. For STAC, this degradation is 5.19 percentage points (from −32.20% to −27.01%), which compares favourably to the competing methods:
DCMVC: 6.42 pp degradation (−29.50% to −23.08%), indicating that DCMVC’s context modulation mechanism is more sensitive to temporal discontinuities.
DCVC-DC: 2.42 pp degradation (−19.38% to −16.96%), a smaller absolute degradation but from a substantially lower baseline performance.
DCVC-FM: 0.37 pp degradation (−18.43% to −18.06%), the smallest degradation, which reflects DCVC-FM’s explicitly designed periodic refresh mechanism, but again from a lower baseline.
DVC: 111.70 pp degradation (+67.93% to +179.63%), a catastrophic degradation that illustrates the vulnerability of early predictive codecs to temporal context loss.
This resilience to intra-period settings is a direct consequence of two architectural features of STAC. First, the Adaptive Context Selector dynamically adjusts its reference frame selection based on the available temporal context: immediately after an I-frame, ACS recognises that fewer high-quality references are available and concentrates its attention on the most recently decoded frames, whereas, deeper into a prediction chain, it can select from a richer buffer of temporally diverse references. This dynamic behaviour contrasts sharply with methods that use a fixed reference set regardless of position within the GOP; such methods either waste capacity on unavailable references or fail to exploit the growing context buffer as more frames are decoded. Second, the Enhanced Sliding Window Attention mechanism adaptively modulates its temporal gating based on the quality and relevance of the available context, effectively reducing reliance on unavailable or stale references, rather than propagating errors from an interrupted prediction chain. The learned temporal decay in ESWA’s gating matrix allows each attention head to independently control how much weight it assigns to temporally distant context, and when context is interrupted by an I-frame, the decay mechanism naturally suppresses long-range dependencies and focuses on the most reliable local information. The combination of these mechanisms enables STAC to degrade gracefully under IP = 32 constraints while maintaining a significant performance advantage over all competing methods.
The practical significance of this resilience cannot be overstated. In real-world deployment, intra-period settings are dictated not by the codec’s preferences but by application requirements: live broadcasting typically uses IP = 32 or shorter for channel switching latency, adaptive streaming mandates periodic random access points for seek functionality, and error-prone transmission channels require frequent intra refreshes for error containment. A codec whose performance collapses under IP = 32 offers limited practical value regardless of its IP = −1 numbers. STAC’s ability to maintain a 27.01% BD-rate advantage over VTM even under IP = 32, outperforming all competing neural codecs, therefore represents not merely an academic improvement but a genuine step toward practical deployment of neural video compression.
4.2.3. BD-Rate Analysis Under RGB Colourspace
Table 5 and
Table 6 present the BD-rate results under the RGB colourspace, providing an important complementary evaluation to the YUV results. The RGB colourspace evaluation is significant for several reasons. First, it tests whether the entropy model’s learned probability distributions generalise beyond the decorrelated YUV representation that is inherently more compressible. In the YCbCr transform, the luminance channel concentrates most of the signal energy while the chrominance channels contain relatively sparse, low-variance data; this decorrelation simplifies the entropy model’s task. In contrast, the RGB channels exhibit strong inter-channel correlation (the red, green, and blue components of natural images are highly correlated because they all reflect the same underlying scene illumination), and the entropy model must learn to exploit these cross-channel dependencies without the benefit of an explicit decorrelating transform. The fact that STAC maintains strong performance under RGB evaluation therefore provides evidence that the STAC transformer learns to model cross-channel dependencies directly through its attention mechanism. Second, an increasing number of practical applications, including computer vision pipelines, graphics-intensive gaming, and high-fidelity content creation, operate natively in the RGB domain, making RGB compression performance directly relevant to deployment. Third, emerging standards, such as JPEG AI, are being developed with native RGB support, further increasing the importance of RGB-domain evaluation for learned codecs.
As expected, all neural codecs exhibit slightly reduced BD-rate performance compared to the YUV colourspace, which is attributable to the luminance–chrominance decorrelation inherent in the YCbCr colour transform that facilitates more efficient chroma subsampling and reduces inter-channel redundancy. However, STAC consistently maintains its performance advantage across both colourspaces. The performance reduction is modest and remarkably uniform across methods, indicating that the YUV-to-RGB gap is a property of the representation itself, rather than a weakness of any particular codec architecture.
Under IP = −1 (
Table 5), STAC achieves 31.23% average BD-rate savings over VTM, outperforming DCMVC by 2.61 percentage points (−31.23% vs. −28.62%). Examining the per dataset results, STAC achieves −35.26% on UVG, −34.17% on HEVC Class B, and −34.13% on HEVC Class D, all of which represent substantial improvements in the RGB domain and closely track the corresponding YUV results. Under IP = 32 (
Table 6), STAC achieves 26.20% BD-rate savings, maintaining a 3.81 pp advantage over DCMVC (−26.20% vs. −22.39%). Notably, STAC’s advantage over DCMVC is larger under IP = 32 in RGB (3.81 pp) than under IP = −1 in RGB (2.61 pp), further confirming the robustness of our adaptive context mechanisms under the most challenging operating conditions. This widening gap under IP = 32 suggests that STAC’s ACS and ESWA mechanisms provide proportionally greater benefit when both the colourspace and the intra-period configuration create more difficult coding conditions; in other words, the harder the problem, the more STAC’s adaptive architecture distinguishes itself from methods with fixed coding strategies.
The performance gap between YUV and RGB colourspaces is remarkably small for STAC: approximately 0.97 pp under IP = −1 (−32.20% vs. −31.23%) and 0.81 pp under IP = 32 (−27.01% vs. −26.20%). This narrow gap is consistent across all compared methods, indicating that it reflects a fundamental property of the colourspace representation, rather than a method-specific limitation. For comparison, DVC exhibits a gap of 3.40 pp under IP = −1 and 8.98 pp under IP = 32, while DCMVC shows gaps of 0.88 pp and 0.69 pp, respectively. The fact that more recent and sophisticated methods exhibit smaller colourspace gaps suggests that advanced entropy models increasingly learn to compensate for the lack of explicit decorrelation. Importantly, STAC’s relative improvement over competing methods remains stable across both colourspaces, demonstrating that our entropy model learns general-purpose probability estimation that is not dependent on colourspace-specific decorrelation properties. This colourspace-agnostic performance suggests that the STAC transformer’s multi-head attention mechanism captures statistical dependencies in the latent space that transcend the specific signal representation, a desirable property for deployment in heterogeneous coding environments where the input format may vary between applications.
4.2.4. Analysis of Individual Dataset Performance
A granular examination of the per dataset results under YUV IP = −1 (
Table 3) reveals instructive patterns that illuminate how STAC’s architectural components interact with different content characteristics:
High-Resolution Content (UVG, 3840 × 2160): STAC achieves its largest BD-rate saving of −36.35% on the UVG dataset, outperforming DCMVC by 1.93 pp. The 4K sequences in UVG contain rich spatial detail, complex camera motion (including panning, zooming, and tracking shots), and diverse scene content ranging from natural landscapes to urban environments. At this resolution, the STAC entropy model’s 20-block transformer with ESWA attention has access to a large number of latent tokens, enabling it to exploit fine-grained spatio-temporal correlations that shallower or less expressive models miss. Furthermore, the multi-scale feature transform with four hierarchical levels at 1/16× resolution provides a highly compact yet informative latent representation for 4K content, and the Adaptive Context Selector benefits from the rich motion diversity in these sequences by dynamically selecting references that best predict the current frame’s content.
High-Definition Content (HEVC Class B, 1920 × 1080): On HEVC Class B sequences, STAC achieves −35.23%, closely tracking the UVG performance and outperforming DCMVC by 2.14 pp. The Class B sequences (BQTerrace, BasketballDrive, Cactus, Kimono, ParkScene) feature high-motion content with complex textures, making them a demanding test for temporal prediction. BasketballDrive, in particular, contains rapid and unpredictable object motion that challenges motion-based approaches, while ParkScene features fine repetitive textures (grass, foliage) that benefit from accurate spatial entropy modelling. The strong performance of STAC on this dataset confirms that the ESWA mechanism’s learned local biases effectively capture the diverse motion patterns present in broadcast-quality high-definition content. The near-identical performance between UVG (4K) and HEVC Class B (1080p) is particularly noteworthy: it suggests that STAC’s multi-scale feature transform produces latent representations at 1/16× resolution that are similarly informative regardless of the original spatial resolution, indicating effective hierarchical feature learning that abstracts away resolution-specific characteristics.
Medium-Resolution Content (MCL-JCV and HEVC Class C): On both MCL-JCV and HEVC Class C, STAC achieves −29.20% BD-rate savings. The MCL-JCV dataset’s 30 diverse sequences, which were originally selected to span a wide range of just-noticeable-difference thresholds, provide a comprehensive test of codec generalisability across content types with varying perceptual characteristics. The consistent performance across this heterogeneous collection validates the robustness of STAC’s content-adaptive mechanisms, demonstrating that the Adaptive Context Selector and dual-path entropy model can handle content diversity without requiring manual tuning or content-specific model selection. The matching performance on HEVC Class C (832 × 480), despite its substantially lower resolution, confirms that STAC’s adaptive approach scales well across the resolution spectrum, with the temporal modelling effectively compensating for reduced spatial information. The improvement over DCMVC on these datasets is 3.26 pp, which is notably larger than the gap on high-resolution content (1.93 pp on UVG and 2.14 pp on HEVC Class B). This suggests that STAC’s adaptive context selection provides proportionally greater benefit when the spatial resolution limits the discriminative power of spatial features alone, because the ACS can compensate by selecting temporally richer references that provide the missing spatial context through temporal correlation.
Low-Resolution Content (HEVC Class D, 416 × 240): Despite the severely limited spatial resolution, STAC achieves an excellent −35.19% BD-rate saving, closely rivalling its performance on 4K content. This counterintuitive result can be understood through the interplay between spatial and temporal compression. At low spatial resolutions, each latent element corresponds to a larger spatial region of the original frame, and the temporal correlations between these coarse-grained representations are more predictable. The latent representation at 1/16× resolution is only 26 × 15 elements per frame, which means the entire frame context fits within a modest attention window, and the ESWA mechanism can effectively attend to the full spatial extent of each latent frame without the locality constraints that become necessary at higher resolutions. STAC’s temporal modelling via ACS and ESWA captures these strong temporal dependencies effectively, while the dual-path entropy model provides accurate probability estimates even when the spatial context is limited. The channel-wise autoregressive path is particularly valuable at low resolutions because the limited spatial extent means that spatial neighbours provide less diverse conditioning information, making channel-wise dependencies proportionally more important for accurate probability estimation. The 2.15 pp improvement over DCMVC (−35.19% vs. −33.04%) on this dataset demonstrates that adaptive reference selection offers consistent benefits regardless of spatial resolution, and that STAC’s multi-scale architecture does not suffer from resolution mismatch at the extremes of its designed operating range.
Videoconferencing Content (HEVC Class E, 1280 × 720): The HEVC Class E sequences (FourPeople, Johnny, KristenAndSara) present a distinctive coding challenge: the content is characterised by relatively static backgrounds with highly localised motion confined to the head-and-shoulder region, and the temporal dynamics are repetitive (subtle head movements, facial expressions, lip synchronisation). Traditional codecs handle this content well through skip modes and large block sizes for the static background, but neural codecs must learn these efficiency strategies implicitly from data. STAC achieves −28.02% BD-rate savings, which, while the lowest among the six datasets in absolute terms, still represents a substantial 3.48 pp improvement over DCMVC (−28.02% vs. −24.54%). Notably, this is the largest per dataset improvement margin across all six benchmarks, suggesting that STAC’s Adaptive Context Selector is particularly effective for videoconferencing content. The reason for this disproportionate advantage lies in the temporal structure of videoconferencing: head poses, expressions, and gestures are quasi-periodic, meaning that a frame from several seconds earlier with a similar head pose may provide better predictive context than the immediately preceding frame with a different expression. ACS can identify these semantically similar but temporally distant references through its learned relevance scoring mechanism, a capability that fixed reference selection strategies fundamentally lack. Furthermore, the gating mechanism in ESWA learns to suppress attention to the largely static background regions and concentrate temporal modelling capacity on the active foreground, where the entropy reduction from accurate temporal prediction is greatest. The combination of these content-adaptive mechanisms explains why STAC achieves its largest margin over DCMVC precisely on the content type where adaptivity matters most.
4.2.5. Summary of Experimental Findings
Across all 24 experimental configurations (six datasets × 2 colourspaces × 2 intra-period settings), STAC achieves the best BD-rate performance in every single configuration without exception. The average BD-rate savings over VTM range from −26.20% (RGB, IP = 32, the most challenging configuration) to −32.20% (YUV, IP = −1, the most favourable configuration). The consistent outperformance of DCMVC across all configurations, with margins ranging from 1.93 pp to 3.81 pp, demonstrates that STAC’s improvements are systematic and architecture-driven, rather than the result of content-specific optimisation or favourable evaluation conditions.
Several cross-cutting observations reinforce the strength of these results. First, the small and consistent YUV-to-RGB performance gap (under 1 pp for STAC in both IP settings) demonstrates that our transformer-based entropy model captures general statistical dependencies that are not tied to a particular colourspace representation. Second, the graceful degradation under IP = 32 (5.19 pp for STAC versus 6.42 pp for DCMVC) validates the design of both the Adaptive Context Selector and the temporal gating in ESWA, confirming that these components provide meaningful resilience to temporal context disruption, rather than merely exploiting favourable long-chain prediction conditions. Third, the robust performance across resolutions from 240p (HEVC Class D: −35.19%) to 4K (UVG: −36.35%) demonstrates that STAC’s multi-scale feature transform and adaptive context mechanisms scale effectively across the entire resolution spectrum, with neither extremely low nor extremely high resolutions presenting a disproportionate challenge.
From a practical deployment perspective, these results indicate that STAC can deliver substantial bandwidth savings over VTM across the full range of conditions encountered in real-world video coding applications. The consistent improvement margin over DCMVC, the current closest competitor, provides confidence that the gains are not fragile or condition-dependent. The combination of strong absolute performance, robustness to configuration changes, and architectural efficiency (the decoding parallelism from checkerboard and channel-group decomposition, the attention complexity) collectively establish STAC as a state-of-the-art neural video compression framework with strong practical deployment potential across diverse streaming, broadcasting, and storage applications.
5. Discussion
The experimental results presented in
Section 4 demonstrate consistent and substantial performance advantages for the proposed STAC framework across all 24 evaluation configurations. In this section, we move beyond the presentation of aggregate numbers to critically analyse the underlying mechanisms that drive these improvements. The discussion is organised around five themes: (i) the synergistic contribution of the Adaptive Context Selector and Enhanced Sliding Window Attention, (ii) the role of the dual-path entropy model and neural arithmetic encoding pipeline, (iii) a quantitative comparison of STAC’s architectural novelties against the specific design choices of competing methods, (iv) the technical decomposition of encoding and decoding performance, and (v) the limitations of the current framework and their implications for future research.
Table 7 provides a structured summary of the key architectural differences between STAC and competing methods, and
Table 8 presents a component-level analysis of the performance contributions.
5.1. Contribution of STAC and Enhanced Sliding Window Attention
The core coding gain of our framework stems from the synergy between the Adaptive Context Selector (ACS) and the Enhanced Sliding Window Attention (ESWA) mechanism within the STAC entropy model. To understand the magnitude and nature of this contribution, it is instructive to examine the performance gaps between STAC and each competing method, as these gaps isolate the effect of specific architectural differences.
The dramatic performance gap between STAC (−32.20%) and VCT (+16.70%) under YUV IP = −1, a difference of 48.90 percentage points, provides the most direct evidence that fixed-window temporal attention is fundamentally insufficient for neural video compression. Both STAC and VCT use transformer-based entropy modelling, but they differ in three critical design choices: (i) VCT uses patch-based attention with non-uniform receptive fields, while STAC uses patchless ESWA with a uniform receptive field; (ii) VCT employs a fixed temporal window that attends to all preceding frames equally, while STAC uses ACS to dynamically select the most informative references; and (iii) VCT uses a single-path entropy model with single-Gaussian likelihood, while STAC uses a dual-path model with GMM likelihood. VCT’s patch-based processing divides each frame into non-overlapping spatial patches and processes the temporal sequence of patches independently, creating artificial boundaries where information cannot flow between adjacent patches within the same frame. This design causes two fundamental problems: it wastes capacity attending to irrelevant references in sequences with fast motion or scene changes, and it fails to capture long-range spatial dependencies that cross patch boundaries. Our ESWA addresses both issues simultaneously through its patchless sliding window design and learned local bias matrices.
The 2.70 pp improvement over DCMVC (−32.20% vs. −29.50%) is particularly significant because DCMVC represents the current state of the art in the conditional coding paradigm and uses a sophisticated flow-oriented context modulation mechanism. The gap between STAC and DCMVC can be attributed primarily to three factors. First, DCMVC uses a fixed context propagation strategy that always conditions on the immediately preceding reconstructed frame and its associated feature, whereas ACS dynamically selects from a larger buffer of candidate references based on learned content-dependent relevance scores. This means that when the immediately preceding frame provides poor predictive context, such as after a scene cut, a flash, or a rapid zoom, ACS can fall back to more distant but more relevant references, while DCMVC is committed to using the suboptimal adjacent reference. Second, DCMVC’s temporal modelling is entirely convolutional, limiting its receptive field to the kernel size and preventing it from capturing long-range spatial correlations within each frame. STAC’s ESWA, by contrast, captures correlations within a configurable 3D spatio-temporal window that can encompass hundreds of latent elements, providing a substantially richer conditioning signal for probability estimation. Third, STAC’s dual-path entropy model captures both channel-wise sequential dependencies and spatio-temporal cross-channel correlations, while DCMVC uses a single entropy estimation path that may miss complementary statistical structures.
The per dataset results confirm these architectural advantages. On high-resolution sequences where complex and diverse motion patterns are prevalent (UVG: −36.35%, HEVC B: −35.23%), STAC achieves its largest absolute gains because ESWA’s multi-head attention can simultaneously model multiple motion hypotheses within the sliding window, and ACS can select references from frames with similar motion characteristics, rather than relying on temporal adjacency. On videoconferencing content (HEVC E: −28.02%), where motion is highly localised to the head-and-shoulder foreground region and the background is largely static, the adaptive gating mechanism in ESWA learns to suppress attention to the static background and concentrate temporal modelling capacity on the dynamic foreground, resulting in more efficient bit allocation. The 3.48 pp improvement over DCMVC on HEVC Class E, the largest per dataset margin, demonstrates that content-adaptive reference selection via ACS is particularly valuable when the temporal dynamics are content-specific rather than uniform.
The robustness of STAC across intra-period settings provides further evidence of the effectiveness of both ACS and ESWA. Under IP = 32, the periodic insertion of I-frames every 32 frames creates temporal discontinuities that reset the prediction chain and disrupt the accumulated temporal context. STAC degrades by only 5.19 percentage points (from −32.20% to −27.01%), whereas DCMVC degrades by 6.42 pp (from −29.50% to −23.08%). This 1.23 pp advantage in resilience stems from two complementary mechanisms. First, ACS dynamically adjusts its reference selection after each I-frame: when the temporal buffer contains only a few recently decoded P-frames of uncertain quality, ACS assigns lower relevance scores to all candidates and selects fewer references (a smaller effective K), avoiding the risk of conditioning on low-quality context that would degrade probability estimates. As more frames are decoded and the buffer fills with increasingly diverse references, ACS progressively increases K and selects from a richer context set. This adaptive behaviour contrasts with fixed-reference methods that always condition on the same number of references regardless of their quality. Second, the temporal gating mechanism in ESWA modulates attention decay based on the spatio-temporal distance and quality of the available context. After an I-frame, the learned decay parameter naturally suppresses long-range temporal dependencies because the relevant distant references are no longer available, focusing attention on the most recently decoded and therefore most reliable context. This graceful degradation mechanism prevents the error amplification that occurs in methods with fixed temporal conditioning, where stale or unavailable context propagates corrupted probability estimates through the prediction chain.
5.2. Contribution of the Dual-Path Entropy Model
The dual-path entropy model represents a novel architectural contribution that addresses a fundamental limitation of single-path probability estimation. In existing neural video codecs, the entropy model predicts the probability distribution of each latent element using a single processing pathway, which must simultaneously capture channel-wise dependencies (correlations between different feature channels at the same spatial location), spatial dependencies (correlations between neighbouring spatial locations within the same channel), and temporal dependencies (correlations between corresponding elements in different frames). A single pathway can trade off capacity between these three types of dependencies, but cannot specialise independently for each.
STAC’s dual-path architecture separates these responsibilities. The channel-wise autoregressive path processes channels sequentially at each spatial position, conditioning the distribution of channel c on previously decoded channels , the corresponding position in ACS-selected reference frames, and already-decoded spatial neighbours. This path excels at capturing the strong inter-channel correlations that arise in convolutional latent spaces, where adjacent channels often encode related features (e.g., edges at different orientations, textures at different scales). The spatio-temporal path uses ESWA to predict a joint distribution over all channels simultaneously, capturing the cross-channel and cross-frame correlations that the autoregressive factorisation would otherwise miss. For instance, when a textured region in the current frame is well predicted by a corresponding region in a reference frame, the spatio-temporal path can leverage this alignment to produce a tight probability estimate for all channels jointly, even if the individual channel-wise estimates would be less precise.
The adaptive fusion gate combines these two paths through a learned gating mechanism whose weights are conditioned on three signals: the channel-wise path features , the spatio-temporal path features , and the estimated bit cost from each path. The inclusion of the bit cost estimate is a critical design choice: it enables the gate to allocate more weight to whichever path provides the tighter probability estimate for each specific latent element, rather than using fixed or spatially uniform weights. In practice, the channel-wise path tends to dominate for latent elements in smooth, low-texture regions, where inter-channel correlations are strong and spatial context is limited, while the spatio-temporal path dominates for elements in textured, high-detail regions, where the temporal reference provides a strong predictive signal. This content-adaptive routing ensures that the fused probability estimate is consistently tighter than either individual estimate.
5.3. Role of the Neural Arithmetic Encoder
The neural arithmetic encoding pipeline contributes to the overall coding performance through three distinct mechanisms that operate at different granularities of the compression process.
First, the Gaussian Mixture Model (GMM) with
components provides a flexible, multimodal probability distribution for each latent element. Compared to the single-Gaussian assumption used by most competing methods (see
Table 7), the mixture model better captures the heavy-tailed and multimodal distributions that arise in motion-compensated latent representations. In neural video coding, the latent distribution is rarely unimodal: a given spatial position may correspond to a region that is well predicted by the temporal reference (yielding near-zero residual with a narrow distribution) or poorly predicted (yielding large residual with a wide distribution), and the mixture model can represent both cases simultaneously through its component weights
. The discrete probability for each quantised symbol is computed as
, where the parameters
are predicted by the STAC transformer. This tight coupling between the entropy model and the probability estimator ensures that the predicted distributions are maximally informative for the actual latent statistics, and the three-component mixture provides sufficient flexibility to approximate a wide range of empirical distributions without the computational overhead of higher-order mixtures.
Second, the hierarchical context structure that feeds the arithmetic encoder combines hyperprior, temporal, and spatial contexts in a carefully structured manner. The hyperprior provides coarse global statistics transmitted as side information, capturing the overall energy distribution and spatial non-stationarity of the latent representation. The temporal context from ACS-selected references captures inter-frame redundancy by providing element-wise predictions based on the most relevant previously decoded frames. The spatial context from the checkerboard-decoded elements captures intra-frame correlations by conditioning on already-decoded neighbouring positions. This three-level conditioning progressively narrows the conditional entropy : each additional context source provides information that was not captured by the preceding sources, translating directly to fewer bits per symbol. The hyperprior alone reduces entropy by providing spatially adaptive variance estimates; adding temporal context reduces it further by providing frame-to-frame predictions; adding spatial context reduces it yet further by exploiting local redundancy within the current frame. The progressive nature of this conditioning is critical because it ensures that each context source contributes non-redundantly, maximising the information-theoretic efficiency of the combined model.
Third, the content-adaptive quantisation mechanism, where the step size is predicted by a lightweight network conditioned on the local context features, enables dynamic bit allocation at spatial-channel granularity. Visually salient regions with fine textures and edges receive finer quantisation (smaller , more bits), while smooth or perceptually less important areas receive coarser quantisation (larger , fewer bits). This spatial adaptivity provides a dual benefit: it improves rate-distortion performance by concentrating bits where they have the greatest impact on reconstruction quality, and it provides an implicit form of perceptual quality optimisation by preserving detail in visually important regions. This mechanism is particularly beneficial for high-resolution content, such as UVG (4K) sequences, where uniform quantisation would either waste bits on large flat background regions or under-represent the fine textures and edges that dominate subjective quality assessment. The consistent improvement of STAC across resolutions from 240p to 4K (−35.19% to −36.35%) suggests that the content-adaptive quantisation scales effectively across the resolution spectrum, dynamically adjusting the spatial granularity of bit allocation to match the available spatial detail.
5.4. Comparative Analysis of Architectural Novelties
To clearly delineate the specific contributions of STAC relative to competing methods,
Table 7 presents a side-by-side comparison of the key architectural design choices across eight dimensions. Several observations emerge from this comparison that illuminate the sources of STAC’s performance advantage.
Finding 1: Adaptive reference selection is a critical differentiator. All competing methods use fixed reference selection strategies, whether a single previous frame (DVC), a fixed temporal window (VCT), or a propagated context from a predetermined set of adjacent frames (DCVC-DC, DCMVC). STAC is the only method that learns to select references based on content-dependent relevance, enabling it to exploit non-adjacent temporal correlations that fixed strategies miss. The 3.48 pp advantage over DCMVC on HEVC Class E, where quasi-periodic content creates opportunities for non-adjacent reference exploitation, provides direct evidence of this benefit.
Finding 2: Patchless attention with uniform receptive fields eliminates architectural artifacts. VCT’s patch-based attention creates non-uniform receptive fields and requires computationally redundant overlapping windows. STAC’s patchless ESWA provides a uniform receptive field across the entire latent space, ensuring that every spatial position receives equally rich contextual conditioning. The 48.90 pp gap between STAC and VCT, despite both using transformer-based architectures, confirms that the attention topology is at least as important as the use of attention itself.
Finding 3: Dual-path entropy estimation captures complementary dependencies. All competing methods use single-path entropy estimation, which must trade off capacity between channel-wise, spatial, and temporal dependencies. STAC’s dual-path architecture with adaptive fusion enables specialised processing for different types of statistical dependencies, with the fusion gate routing each latent element to the path that provides the tighter probability bound.
Finding 4: Gaussian mixture likelihoods provide distributional flexibility. The single-Gaussian assumption used by most competing methods is a poor fit for the heavy-tailed, multimodal distributions that characterise motion-compensated latent representations. STAC’s three-component GMM provides the flexibility to represent these complex distributions accurately, yielding tighter probability bounds and shorter codeword lengths.
Finding 5: Learned temporal gating enables adaptive context weighting. DCMVC uses flow-oriented temporal context but applies fixed modulation. STAC’s ESWA incorporates a learned decay gating mechanism that adaptively modulates attention weights based on spatio-temporal distance and content characteristics, enabling the model to gracefully adjust its temporal dependence when context quality varies, as demonstrated by the 1.23 pp advantage in IP = −1 to IP = 32 resilience (5.19 pp vs. 6.42 pp degradation).
5.5. Component Ablation Study
While the cross-method comparisons in
Section 4 and
Section 5 demonstrate STAC’s overall superiority, they do not isolate the contribution of each individual component because STAC differs from every competing method in multiple dimensions simultaneously. To disentangle these contributions, we conduct a systematic ablation study in which components are progressively added to a baseline configuration, and the average BD-rate is measured under YUV IP = −1 across all six benchmark datasets after retraining each configuration to convergence. The baseline is defined as a model that uses fixed two-frame reference selection (no ACS), standard 3D Sliding Window Attention without learnable biases or temporal gating (i.e., the SWA formulation of Kopte and Kaup [
32]), a single-path entropy model, single-Gaussian likelihood, and no Latent Residual Prediction. Each subsequent row in
Table 9 adds exactly one component while keeping all others fixed, ensuring that the measured
reflects the isolated contribution of that component within the context of all previously added components.
The ablation results in
Table 9 reveal several important findings about the relative contribution and interaction of each component.
Finding 1: Adaptive reference selection is the single most impactful component. Adding the Adaptive Context Selector (Step 1) yields the largest individual gain of −2.67 pp, improving the baseline from −24.18% to −26.85%. This confirms that the ability to dynamically select content-relevant reference frames, rather than relying on a fixed set of immediately preceding frames, is the most impactful architectural innovation in STAC. The magnitude of this gain is intuitive: reference frame quality directly determines the conditional entropy , and even modest improvements in reference relevance reduce the entropy of every latent element in the frame.
Finding 2: The two ESWA enhancements provide complementary attention improvements. The learnable local bias (Step 2, −1.46 pp) and temporal gating (Step 3, −0.83 pp) together contribute −2.29 pp. The local bias captures fine-grained relative position preferences within the spatio-temporal attention window, enabling each attention head to develop position-specific sensitivity that the uniform weighting of standard SWA cannot express. The temporal gating adds a complementary capability: adaptive modulation of attention weights based on temporal distance and content dynamics, allowing the model to suppress irrelevant distant context and concentrate on the most informative temporal neighbours. The fact that temporal gating provides additional gain on top of local bias confirms that these two mechanisms address distinct aspects of the attention computation: spatial position preferences and temporal relevance weighting.
Finding 3: Dual-path fusion and GMM together substantially improve probability estimation. The dual-path entropy fusion (Step 4, −0.73 pp) and GMM with components (Step 5, −1.21 pp) collectively contribute −1.94 pp. The dual-path architecture enables specialised processing for channel-wise and spatio-temporal dependencies, while the GMM provides the distributional flexibility to accurately model the multimodal, heavy-tailed statistics of the conditioned latent distributions. The larger contribution of GMM (−1.21 pp) relative to dual-path fusion (−0.73 pp) suggests that distributional flexibility is at least as important as the entropy model architecture for tight probability bounds.
Finding 4: Latent Residual Prediction provides substantial gain at zero bitrate cost. LRP (Step 6, −1.12 pp) adaptively corrects quantisation errors in the decoded latents using predicted refinement offsets , improving reconstruction quality without transmitting any additional bits. The magnitude of this gain (−1.12 pp) is notable because LRP operates after the entropy coding stage and therefore cannot affect the bitstream length; its entire contribution comes from reducing the distortion term in the rate-distortion objective.
5.6. Technical Analysis of Encoding and Decoding Performance
The total bitrate for each P-frame decomposes as , where is the dominant frame latent rate, is the hyperprior side information rate, and is the negligible overhead for transmitting the ACS reference selection indices. The frame latent rate is minimised when the predicted probability distributions closely match the true empirical statistics of the quantised latents. Each of STAC’s components contributes to tightening this match through a different mechanism: ACS and ESWA reduce the conditional entropy by providing richer and more relevant temporal conditioning; the dual-path fusion ensures that the most informative probability path is selected for each element; and the GMM () provides the distributional flexibility to accurately represent the conditioned statistics, including multi-modality and heavy tails.
The 2.70 pp average improvement over DCMVC (−32.20% vs. −29.50%) under YUV IP = −1, which is consistent across all six datasets with per dataset margins ranging from 1.93 pp (UVG) to 3.48 pp (HEVC Class E), confirms that STAC’s components collectively extract additional temporal redundancy beyond what existing conditional coding frameworks can capture. The consistency of this improvement across diverse content types is particularly important: it demonstrates that the gains are not driven by a single content category where STAC happens to excel, but rather reflect a genuinely more effective probability estimation architecture that benefits all types of video content. To put this margin in information-theoretic terms, a 2.70 pp BD-rate improvement corresponds to a measurable reduction in the average cross-entropy between the predicted and true latent distributions, indicating that STAC’s probability estimates are systematically closer to the true data-generating distribution than those of DCMVC.
From the decoder efficiency perspective, STAC achieves practical decoding throughput through two complementary parallelisation strategies. Checkerboard decomposition splits the spatial dimensions into two independent sets and , enabling half of the latent elements to be decoded simultaneously in each pass, yielding approximately speedup. Channel-group parallelisation further divides the channel dimensions into independent groups that can be processed concurrently, providing an additional speedup. The combined effect is approximately effective parallelism over fully sequential decoding, which is critical for practical deployment where decoding latency directly impacts user experience. Range-ANS with 16-bit CDF precision ensures near-optimal compression (within 0.01 bits/symbol of the theoretical entropy), and GPU-accelerated probability computation via the STAC transformer enables efficient hybrid GPU/CPU execution where the transformer runs on the GPU, while the ANS engine operates on the CPU.
The colourspace analysis provides additional insight into the generality of STAC’s learned representations. The YUV-to-RGB performance gap is approximately 0.97 pp under IP = −1 (−32.20% vs. −31.23%) and 0.81 pp under IP = 32 (−27.01% vs. −26.20%). These narrow gaps confirm that the STAC entropy model learns general-purpose probability estimation that is not dependent on colourspace-specific decorrelation properties. In the YCbCr domain, the luminance–chrominance separation provides an explicit decorrelation that simplifies entropy modelling; in RGB, the three channels are strongly correlated, and the entropy model must learn to exploit cross-channel dependencies without the benefit of an explicit decorrelating transform. The fact that STAC’s performance degrades by less than 1 pp when moving from YCbCr to RGB demonstrates that the multi-head attention mechanism in ESWA effectively captures cross-channel dependencies directly in the latent space, compensating for the absence of explicit decorrelation. This colourspace-agnostic capability is a desirable property for deployment in heterogeneous environments where the input format may vary between applications.
5.7. Limitations and Future Directions
While STAC achieves state-of-the-art compression performance across all evaluated configurations, several limitations should be acknowledged to provide a balanced assessment and identify directions for future improvement.
Computational complexity. The 20-block transformer architecture, while effective for probability estimation, introduces substantial computational cost during both encoding and decoding. Each forward pass through the STAC entropy model requires processing all latent tokens through 20 sequential transformer blocks, which limits the achievable throughput on current GPU hardware. Although the Sliding Window Attention reduces per block complexity from to , the total cost remains significantly higher than convolutional entropy models used in the DCVC family. Future work should explore model compression techniques, such as structured pruning, knowledge distillation into shallower architectures, and quantisation-aware training of the transformer weights, to reduce the computational cost without sacrificing compression performance.
Low-delay operation. STAC currently operates in a low-delay P-frame configuration without B-frame support. The extension to hierarchical B-frame GOP structures, which enable bidirectional temporal prediction and are widely used in practical broadcasting systems, could yield additional BD-rate improvements by exploiting both forward and backward temporal correlations. The ACS module would need to be extended to select references from both past and future decoded frames, and the ESWA mechanism would need to accommodate bidirectional temporal context windows.