Next Article in Journal
Study on an Evaluation Method for Vehicle Residual Nuclear Radiation Protection Performance Based on an Equivalent Model
Previous Article in Journal
MechaForge: A Multi-Strategy Time-Series Synthesis Framework for Intelligent Fault Diagnosis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

STAC: A Spatio-Temporal Transformer with Adaptive Context for Video Compression

by
Reka Sandaruwan Gallena Watthage
* and
Anil Fernando
Department of Computer & Information Sciences, University of Strathclyde, Glasgow G1 1XH, UK
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(9), 4568; https://doi.org/10.3390/app16094568
Submission received: 2 April 2026 / Revised: 24 April 2026 / Accepted: 29 April 2026 / Published: 6 May 2026
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

The rapid growth of video content development requires more effective compression solutions than traditional ones. Although neural video compression has demonstrated impressive advances, the current methods are having a hard time with how to effectively model long-range temporal dependencies and react to different content properties. We introduce STAC (Spatio-Temporal Adaptive Context), a transformer-based neural video compression scheme that does not have these limitations, and makes three original contributions. First, the Adaptive Context Selector (ACS) is the dynamic evaluation and selection of the most informative reference frames, based on learned relevance scoring, in place of the traditional use of predetermined adjacent frame sets. Second, Enhanced Sliding Window Attention (ESWA) is an effective computational model of spatio-temporal correlations by the integration of learnable local bias and temporal gating information into a computationally adjustable attention model. Third, a dual-path entropy model is an adaptively learned fusion gate that combines channel-wise autoregressive prediction with spatio-temporal prediction to produce better probability estimations for entropy coding. Trained on the Vimeo-90k dataset using a four-phase curriculum with the Adam optimiser over approximately 2.2 M total steps. We tested STAC using six benchmark videos, such as UVG, MCL-JCV, and HEVC Class B, C, D and E videos, at varying test settings. The experimental findings prove that STAC, on average, saves a BD-rate of 32.20% in the YUV colourspace with an intra-period of −1. The consistent improvement across both PSNR and MS-SSIM metrics confirms that STAC’s coding gains arise from genuinely improved probability modelling, rather than metric-specific optimisation. Evaluations were performed on six standard benchmarks (UVG, MCL-JCV, and HEVC Classes B, C, D, and E) under 24 experimental configurations (six datasets × 2, and colourspaces × 2 intra-period settings), with all methods tested under identical conditions using the same sequences, frames (96 per sequence), and VTM-17.0 anchor codec. STAC achieves 32.20% average BD-rate savings over VTM under YUV IP = −1, outperforming the prior state-of-the-art DCMVC by 2.70 percentage points. Under IP = 32, STAC achieves −27.01%, with only 5.19 pp degradation versus 6.42 pp for DCMVC. The results generalise to the RGB colourspace (−31.23%) and scale from 240p (−35.19%) to 4K (−36.35%).

1. Introduction

Video data now constitutes the single largest category of global internet traffic, with recent industry analyses estimating that video accounts for approximately 82% of all consumer IP traffic and that global internet traffic volume has grown at a compound annual rate exceeding 20% over the past five years. The proliferation of streaming platforms, user-generated content on social media, real-time videoconferencing, cloud gaming, and emerging immersive media formats, including 360-degree video and volumetric content, has placed extraordinary pressure on compression pipelines to deliver ever-higher visual quality at lower bitrates. This demand is further compounded by the steady migration toward higher spatial resolutions, from 1080p Full HD to 4K Ultra HD and now 8K, alongside the adoption of high dynamic range (HDR) and wide colour gamut (WCG) content, all of which dramatically increase the raw data volume that must be transmitted or stored. In this landscape, the efficiency of the video codec is no longer merely a technical concern; it is an economic and infrastructural imperative that directly affects network provisioning costs, energy consumption in data centres, and the quality of experience delivered to billions of end users worldwide.
The evolution of standardised video coding has followed a remarkably consistent trajectory of incremental, block-based refinement. H.264/AVC, introduced in 2003, established the dominant hybrid coding paradigm of block-based motion estimation and compensation, discrete cosine transform (DCT) residual coding, and context-adaptive entropy coding, and it remains one of the most widely deployed codecs to this day. Its successor, H.265/HEVC, finalised in 2013, introduced flexible quadtree block partitioning, advanced intra prediction modes, and sample-adaptive offset filtering, achieving approximately 50% bitrate reduction over H.264 at equivalent perceptual quality. Most recently, H.266/Versatile Video Coding (VVC), completed in July 2020 by the Joint Video Experts Team (JVET), has extended this trajectory further through quadtree plus multi-type tree (QTMT) partitioning, affine motion models, geometric partitioning, cross-component prediction, and more sophisticated in-loop filtering, delivering a further 40–50% bitrate reduction over HEVC [1,2]. However, each successive generation of standards has come with substantially increased encoding complexity; VTM, the VVC reference software, requires encoding times up to an order of magnitude greater than those of HM, the HEVC reference encoder. More fundamentally, the hand-engineered, modular architecture of these codecs imposes inherent limitations: each component, such as the motion estimator, the transform, the quantiser, and the entropy coder, is designed and optimised in relative isolation, which prevents joint global optimisation toward a unified rate-distortion objective and limits the system’s ability to adapt its coding strategy to local content characteristics [3,4].
These fundamental limitations of the traditional hybrid coding paradigm have motivated the emergence of neural video compression (NVC) as a compelling alternative. Deep neural networks offer three structural advantages that are difficult to replicate within hand-engineered frameworks. First, end-to-end differentiable training enables the joint optimisation of all coding components, from the analysis and synthesis transforms through motion modelling to entropy coding, toward a single rate-distortion (or rate-distortion-perception) loss function, ensuring that every module cooperates to maximise overall coding efficiency [5,6]. Second, the powerful nonlinear representation capacity of deep networks allows them to learn content-adaptive transforms and probability models that go far beyond the fixed basis functions and parametric models of traditional codecs. Third, the flexibility of end-to-end learning enables rapid adaptation to new content domains, quality metrics, or deployment constraints, for instance, by changing the distortion term from the mean squared error (MSE) to a perceptual metric, such as multi-scale structural similarity (MS-SSIM) or a learned perceptual loss, without redesigning the entire pipeline.
The foundations of neural compression were laid in the domain of still-image coding. Ballé et al. [7] proposed the first end-to-end optimised image compression framework based on nonlinear transform coding, employing convolutional neural networks for the analysis and synthesis transforms, together with a factorised entropy model, and demonstrated rate-distortion performance surpassing JPEG and JPEG 2000. The subsequent introduction of hyperprior-based entropy models [8] was a critical advance, as it enabled the encoder to transmit a compact summary of the latent statistics as side information, thereby capturing spatial dependencies and significantly improving probability estimation for arithmetic coding. Cheng et al. [9] further refined this line by proposing discretised Gaussian mixture likelihoods (GMM) and attention modules, achieving the first learned image codec to match the rate-distortion performance of VVC intra coding under the PSNR metric, a result that has since been corroborated and extended by the JPEG AI standardisation effort [10]. These results in still-image compression have conclusively demonstrated that learned approaches can reach and even surpass the compression efficiency of the most advanced hand-designed intra coding tools, providing a solid foundation upon which neural video compression is built.
The extension from images to video introduces the central challenge of temporal redundancy exploitation. The pioneering work of Lu et al. [11] proposed DVC, the first end-to-end deep video compression framework, which adopted the classical predictive coding structure of optical flow estimation followed by residual coding, but replaced each component with learnable networks. This established the dominant NVC paradigm, but it also exposed its limitations: explicit motion estimation and pixel-domain residual subtraction are fundamentally suboptimal because a simple linear difference operation cannot fully capture the complex, nonlinear inter-frame redundancy present in natural video. Subsequent work addressed various facets of this problem. Recent work on multimodal feature fusion for video analysis by Sheng et al. [12] has demonstrated that combining spatial, frequency, and optical flow features provides complementary information for understanding video content, a principle conceptually aligned with our dual-path entropy model’s combination of channel-wise and spatio-temporal prediction paths. Agustsson et al. [13] introduced scale-space flow to handle disocclusions and fast motion more gracefully by augmenting optical flow with a scale parameter. Hu et al. [14] proposed FVC, which relocated all major operations, motion estimation, compression, and compensation, into a learned feature space, demonstrating significant gains from feature-domain processing. Habibian et al. [15] explored a fundamentally different direction with 3D autoregressive autoencoders over discrete latent spaces, bypassing explicit motion modelling entirely and showing that generative models can effectively capture spatio-temporal redundancy. These diverse architectural explorations underscored a growing consensus that the most promising path forward lies not in refining individual modules of the predictive coding pipeline but in rethinking the inter-frame coding paradigm itself.
A pivotal shift came with the introduction of conditional coding, which replaces the traditional residual signal with feature-domain contextual conditioning. Li et al. [16] proposed DCVC, the first deep contextual video compression framework, arguing that feature-domain context carries richer information than pixel-domain residuals for both the encoder and the decoder, and demonstrating substantially improved coding efficiency. This line of work advanced rapidly: DCVC-DC [17] introduced hierarchical quality structures and group-based offset diversity; DCVC-FM [18] addressed the practical need for variable-rate operation through feature modulation with learnable quantisation scalers, enabling a single model to span an 11.4 dB PSNR range, and most recently, DCMVC [19] proposed context modulation through flow-oriented temporal context and context compensation, achieving state-of-the-art results with an average 22.7% BD-rate reduction over VVC. In parallel, Sheng et al. [20] advanced temporal context mining by propagating both reconstructed frames and pre-reconstruction features, and Qi et al. [21] demonstrated that bidirectional information flow between motion coding and frame coding yields a further 12.9% bitrate saving. Jiang et al. [22] recently proposed ECVC, which exploits non-local correlations across multiple reference frames, together with a partial cascaded fine-tuning strategy to mitigate error accumulation, achieving 10–11% additional bitrate savings over DCVC-FM.
Alongside these advances in temporal modelling, the accuracy of the entropy model has emerged as arguably the single most important determinant of compression efficiency. The entropy model governs the probability estimates used in arithmetic coding; any improvement in probability prediction translates directly and measurably into bitrate savings. Li et al. [23] introduced hybrid spatial–temporal entropy modelling that captures both intra-frame spatial and inter-frame temporal correlations, together with content-adaptive quantisation for dynamic bit allocation, achieving a landmark 18.2% BD-rate saving over VVC on the UVG dataset. Qian et al. [24] proposed Entroformer, a transformer-based entropy model for learned image compression with top-K self-attention and diamond relative position encoding, demonstrating that transformers can overcome the limited receptive field of convolutional entropy models and capture long-range spatial dependencies more effectively. The transformer architecture, originally developed for natural language processing [25,26] and subsequently adapted to computer vision through models such as the Vision Transformer (ViT) [27] and Swin Transformer [28], is particularly well suited to video compression because its self-attention mechanism can model dependencies across arbitrary spatial and temporal distances, making it a natural fit for exploiting the long-range correlations inherent in video sequences [29,30].
In the specific context of video compression, Mentzer et al. [31] proposed the Video Compression Transformer (VCT), a landmark work that demonstrated that a transformer operating directly on frame representations can achieve competitive compression performance without any explicit motion prediction or warping, thereby vastly simplifying the neural video codec architecture. However, VCT’s reliance on patch-based processing introduces two significant architectural flaws, as analysed by Kopte and Kaup [32]: non-uniform receptive fields caused by patch boundaries, and computationally redundant overlapping windows required for temporal autoregressive modelling. Kopte and Kaup addressed these issues with 3D Sliding Window Attention (SWA), a patchless local attention mechanism that provides a uniform receptive field and reduces overall decoder complexity by a factor of 2.8 while achieving up to 18.6% BD-rate savings over VCT. Zhu et al. [33] further demonstrated that Swin Transformer-based nonlinear transforms can achieve better compression efficiency than convolutional transforms with fewer parameters and faster decoding. Yang et al. [34] provided a unifying perspective by viewing neural video codecs through the lens of deep generative modelling, proposing improved temporal autoregressive transforms and structured entropy models with temporal dependencies. Despite these encouraging advances, fundamental challenges remain. Existing methods still struggle to balance the conflicting requirements of exploiting long-range temporal dependencies for accurate probability estimation while maintaining computationally tractable models, managing error propagation across long prediction chains, particularly under intra-period constraints, and adapting reference frame selection to the widely varying temporal dynamics of natural video content [35,36,37].
The analysis above reveals a precise knowledge deficit: no existing neural video codec simultaneously addresses (i) content-adaptive temporal reference selection, (ii) uniform-receptive-field attention with learned spatio-temporal modulation, and (iii) multi-path probability estimation with mixture likelihoods. This leads to the following research question: Can a unified transformer-based framework that integrates adaptive reference selection, Enhanced Sliding Window Attention with learned bias and gating, and dual-path Gaussian mixture entropy modelling achieve state-of-the-art rate-distortion performance across diverse video content, colourspaces, and intra-period configurations while maintaining computational tractability? We hypothesise that these three components address complementary sources of coding inefficiency and will yield near-additive BD-rate improvements when combined.
To address these open challenges, we introduce STAC (Spatio-Temporal Adaptive Context), a transformer-based neural video compression framework that makes three principal contributions to the field. The first contribution is the Adaptive Context Selector (ACS), a learned module that dynamically evaluates and selects the most informative reference frames from a buffer of previously coded latents based on content-dependent relevance scores. Unlike existing approaches that rigidly condition on a fixed set of immediately preceding frames, ACS computes a relevance score for each candidate reference through a lightweight neural network with sigmoid activation, then selects the top-K references that maximise mutual information with the current frame. This content-adaptive selection is particularly beneficial for sequences with complex motion patterns, scene transitions, or periodic content where temporally distant frames may provide superior predictive information. The second contribution is the Enhanced Sliding Window Attention (ESWA) mechanism, which forms the core of the STAC entropy model. ESWA extends standard Sliding Window Attention with two novel components: a learnable local bias matrix that captures fine-grained relative position preferences within the spatio-temporal neighbourhood, and a learned gating mechanism with temporal decay that adaptively modulates attention weights based on spatio-temporal distance. This design achieves O ( N · w s 2 · w t ) complexity, making it tractable for high-resolution video while preserving the ability to model both local texture correlations and medium-range temporal dependencies within a unified attention framework. Critically, ESWA eliminates the non-uniform receptive fields and redundant overlapping computations inherent in the patch-based attention of prior methods, such as VCT. The third contribution is a dual-path entropy model that combines a channel-wise autoregressive prediction path with a spatio-temporal prediction path through an adaptively learned fusion gate, whose gating weights are conditioned on both the path-specific features and the estimated bit cost of each path. This architecture enables the entropy model to capture complementary statistical dependencies: the autoregressive path models sequential channel correlations, while the spatio-temporal path captures cross-channel and cross-frame correlations that the autoregressive factorisation would otherwise miss. The fused probability estimates are used to parameterise a Gaussian Mixture Model with K = 3 components for neural arithmetic encoding, yielding tighter probability estimates and consequently shorter codeword lengths.
Architecturally, STAC employs an encoder–decoder framework with a multi-scale feature transform comprising four strided convolutional layers with Generalised Divisive Normalisation (GDN) activations, producing hierarchical latent representations at 1/2×, 1/4×, 1/8×, and 1/16× spatial scales. The finest-scale latent is quantised and processed by the ACS to select top-K reference frames from the decoded latent buffer. The selected temporal contexts, together with the current frame’s latent and a hyperprior side information stream, are fed into the STAC entropy model, a stack of 20 transformer blocks with ESWA, which predicts the Gaussian mixture distribution parameters ( μ , σ , π ) for arithmetic coding, as well as Latent Residual Prediction (LRP) refinement offsets ( Δ ) that adaptively correct the quantised latents without additional bitrate. The decoder mirrors this process: it applies the identical probability model to recover the latents from the bitstream via range-ANS decoding, applies LRP refinement, and generates the reconstructed frame through transposed convolutional layers. Parallel checkerboard decomposition and four-group channel parallelisation yield an effective 8× decoding speedup over sequential processing.
Three specific gaps that STAC addresses relative to existing methods are as follows: (i) all prior methods (DVC, DCVC-DC, DCVC-FM, DCMVC) use fixed reference selection strategies, and STAC is the first to propose learned content-dependent reference selection via ACS; (ii) VCT’s patch-based attention creates non-uniform receptive fields, and STAC’s ESWA provides uniform receptive fields with learned local bias and temporal gating; and (iii) all prior methods use single-path entropy estimation, and STAC is the first to propose dual-path fusion with adaptive bit-cost conditioned gating. Each gap is now explicitly linked to the specific STAC component that addresses it.
We evaluate STAC comprehensively across six standard benchmark datasets: UVG [38], MCL-JCV [39], and HEVC Classes B, C, D, and E, under both YUV 4:2:0 and RGB colourspaces, and with two intra-period configurations (IP = −1 and IP = 32). STAC achieves an average BD-rate saving of 32.20% over VTM under YUV IP = −1, outperforming the prior state-of-the-art DCMVC by 2.70 percentage points. Under the more challenging IP = 32 configuration, STAC achieves −27.01%, with only 5.19 pp degradation, compared to 6.42 pp for DCMVC, confirming the robustness of ESWA’s adaptive context windowing to temporal discontinuities introduced by periodic I-frame insertion. Performance generalises to the RGB colourspace (−31.23%, IP = −1) and scales consistently from low-resolution (HEVC D, 240p: −35.19%) to ultra-high-resolution content (UVG, 4K: −36.35%).
The remainder of this paper is organised as follows. Section 2 provides a detailed review of related work spanning neural image and video compression, conditional coding, transformer-based approaches, implicit neural representations, and perceptual quality methods. Section 3 presents the proposed STAC framework, including the system architecture, the Adaptive Context Selector, the ESWA mechanism, the dual-path entropy model, the neural arithmetic encoding pipeline, and the training methodology. Section 4 details the experimental setup, benchmark datasets, evaluation metrics, and comprehensive comparisons with state-of-the-art methods. Section 5 provides a critical discussion of the experimental results, analysing the contributions of individual architectural components and the technical mechanisms underlying the observed performance gains. Finally, Section 6 concludes this paper and identifies directions for future research.

2. Related Work

This section provides a comprehensive review of the research landscape underlying neural video compression. We organise the discussion into six thematic subsections: foundational neural image compression, early predictive neural video codecs, the paradigm shift to conditional coding, transformer-based architectures, implicit neural representations, and perceptual quality with computational efficiency. For each theme, we trace the chronological development of key ideas, identify the technical limitations that motivated subsequent work, and position the contributions of STAC within this broader context. Figure 1 presents a taxonomic overview of the major research directions, and Table 1 provides a structured comparison of representative methods across key architectural and performance dimensions.

2.1. Foundations of Neural Image Compression

The theoretical foundations of learned compression were established well before the recent surge of interest. Neural networks were first applied to data compression in the 1990s [40], but the limited capacity and training methods of that era prevented competitive performance. The modern era of neural compression began with the seminal work of Ballé et al. [7], who proposed the first end-to-end optimised image compression framework based on nonlinear transform coding. Their system employed convolutional neural networks for both the analysis (encoder) and synthesis (decoder) transforms, coupled with a factorised entropy model and a continuous relaxation of quantisation that enabled gradient-based training of the entire pipeline. This work was foundational because it demonstrated that a single, jointly optimised system could outperform the decades-old JPEG standard and approach the performance of JPEG 2000, despite having no hand-engineered coding tools.
A critical subsequent advance was the introduction of hyperprior-based entropy models by Ballé et al. [8]. The hyperprior framework augments the main latent representation with a second, coarser latent variable that captures the spatial statistics of the primary latents. This side information is transmitted to the decoder, enabling the entropy model to adapt its probability estimates to the local content characteristics of each image region. The hyperprior reduces the gap between the factorised model’s assumption of element-wise independence and the actual spatial correlations present in the latent space, yielding substantial bitrate savings. Cheng et al. [9] further advanced this line by replacing the single-Gaussian likelihood with a discretised Gaussian Mixture Model (GMM) and incorporating attention modules into both the transform and the hyperprior networks. Their system achieved the landmark result of matching the rate-distortion performance of VVC intra coding under the PSNR metric, demonstrating for the first time that a learned image codec could compete with the most advanced hand-designed standard. The JPEG AI standardisation effort [10] has since adopted neural compression principles, further validating the maturity and practical viability of these techniques. These foundational image compression results are directly relevant to video coding because the intra-frame coding pathway in any video codec is essentially an image compression problem, and the entropy modelling techniques developed for images, including hyperpriors and GMM likelihoods, form the basis of the probability estimation machinery in neural video codecs.

2.2. Early Neural Video Compression: Predictive Coding

The extension of learned compression from still images to video sequences required addressing the fundamental challenge of temporal redundancy. The first end-to-end deep video compressor, DVC, was proposed by Lu et al [11]. DVC adopted the classical predictive coding structure of traditional codecs: it estimated optical flow between the current frame and a reference frame using a learned flow network, warped the reference to generate a prediction, computed the pixel-domain residual, and compressed both the motion vectors and the residual using separate autoencoder-style networks. All components were jointly trained end-to-end toward a rate-distortion loss, and DVC achieved performance on par with H.265/HEVC under the MS-SSIM metric. Despite this encouraging result, DVC exposed several fundamental limitations of the predictive residual coding paradigm in the neural setting. The reliance on explicit optical flow estimation means that the motion model is limited to translational motion, which cannot adequately represent complex deformations, such as rotation, scaling, or non-rigid motion. Furthermore, the use of simple pixel-domain subtraction to compute residuals is a linear operation that cannot fully exploit the complex, nonlinear statistical dependencies between frames.
Subsequent work addressed these limitations from multiple angles. Lu et al. [35] tackled the critical problems of content adaptivity and error propagation by proposing an online encoder updating scheme that adapts the encoder parameters to the statistics of each test sequence, while keeping the decoder fixed to maintain compatibility. This approach effectively reduced the domain gap between training and test data and demonstrated that content-adaptive processing is essential for competitive video compression performance. Agustsson et al. [13] proposed scale-space flow, which generalises optical flow by adding a scale parameter to each flow vector. This allows the network to model uncertainty in the motion field: when the flow estimate is unreliable, the scale parameter increases, effectively blurring the warped reference and allowing the residual coder to compensate. This elegantly handles disocclusions and fast motion without requiring complex inpainting or multi-hypothesis prediction. Guo et al. [41] introduced a motion compensation enhancement network that acts as a post-processing stage after the warping operation, reducing the artifacts in the predicted frame and thereby breaking the error propagation chain that degrades quality across a group of pictures.
A significant advance came with the relocation of coding operations from the pixel domain to learned feature spaces. Hu et al. [14] proposed FVC, which performed motion estimation, motion compression, and motion compensation entirely in a learned feature space using deformable convolutions [42]. By operating on high-dimensional feature maps, rather than raw pixels, FVC could capture richer inter-frame correlations and compensate for complex motion patterns that defeat simple warping. Furthermore, FVC incorporated a non-local attention mechanism for multi-frame feature fusion, enabling it to aggregate information from multiple previously reconstructed frames. Hu et al. [43] further proposed a coarse-to-fine architecture that performs motion operations at multiple spatial scales, using hyperprior information to guide mode prediction for both motion coding and residual coding. Habibian et al. [15] explored a fundamentally different direction by proposing a 3D autoregressive autoencoder with a discrete latent space, which bypassed explicit motion estimation entirely and instead learned to model spatio-temporal redundancy purely from data through an autoregressive prior. These diverse approaches collectively established the landscape of predictive neural video coding and highlighted a growing recognition that the most substantial gains would come not from refining individual components of the predictive pipeline but from rethinking the inter-frame coding paradigm at a more fundamental level.

2.3. Conditional Coding: From Residuals to Contextual Features

The conditional coding paradigm represents a fundamental shift from the subtract-and-encode philosophy of predictive residual coding toward a more expressive encode-given-context approach. The core insight, first articulated by Li et al. [16] in their deep contextual video compression (DCVC) framework, is that simple pixel-domain subtraction is a suboptimal mechanism for removing inter-frame redundancy. Instead of subtracting a predicted frame from the current frame to produce a residual, DCVC provides a high-dimensional feature-domain context to the encoder and decoder as a conditioning signal. This enables the encoder to adapt its latent representation to the information already available at the decoder, and enables the decoder to leverage rich contextual features when reconstructing the current frame. The result is a more effective removal of temporal redundancy, particularly for high-frequency content and complex textures that are poorly represented by simple difference signals. DCVC demonstrated a substantial improvement over contemporary predictive codecs and initiated a highly productive line of research.
Li et al. [17] extended this framework significantly with DCVC-DC, introducing hierarchical quality patterns across frames and group-based offset diversity for richer context mining. The hierarchical quality structure assigns different quality levels to frames within a group of pictures, concentrating more bits on key reference frames to provide higher-quality temporal contexts for dependent frames. The group-based offset diversity employs multiple learned offset groups with cross-group interaction, enabling the context extraction module to capture diverse motion patterns simultaneously. DCVC-DC achieved a landmark 23.5% bitrate saving over previous state-of-the-art NVC methods and, notably, surpassed the under-development next-generation traditional codec ECM in both RGB and YUV colourspaces.
Li et al. [18] subsequently addressed two critical practical limitations with DCVC-FM. The first is variable-rate operation: previous neural video codecs required training separate models for each target bitrate, which is impractical for deployment. DCVC-FM introduced feature modulation with learnable quantisation scalers, enabling a single model to cover an 11.4 dB PSNR range through a carefully designed uniform quantisation parameter sampling mechanism during training. The second limitation is quality degradation under long prediction chains. Previous methods exhibited significant quality drops when using large intra-period settings (e.g., IP = 32 or IP = −1), because errors accumulate over many frames of temporal prediction without the periodic reset provided by I-frames. DCVC-FM proposed a periodically refreshing temporal feature modulation mechanism that mitigates this degradation, achieving a 29.7% bitrate saving over the previous state of the art, with a 16% MAC reduction. Most recently, Tang et al. [19] proposed DCMVC, which introduces flow-oriented context generation and context compensation to more effectively modulate the propagated temporal context, achieving an average 22.7% BD-rate reduction over VVC.
In parallel, Sheng et al. [20] proposed temporal context mining, which propagates not only the reconstructed frame but also the feature representation before reconstruction, providing richer temporal information for subsequent frame coding. This dual propagation strategy enables the model to exploit both pixel-level and feature-level temporal correlations. Qi et al. [21] introduced bidirectional information flow between motion coding and frame coding through motion information propagation. Conventionally, motion coding provides motion vectors to frame coding in a unidirectional manner. Qi et al. demonstrated that feeding information from frame coding back to motion coding creates a beneficial cycle that strengthens the exploitation of long-range temporal correlations, achieving 12.9% bitrate savings over prior methods. Jiang et al. [22] recently proposed ECVC, which exploits non-local correlations across multiple frames for contextual compression and introduces a partial cascaded fine-tuning strategy that supports training on full-length sequences with constrained computational resources, reducing the train–test mismatch in sequence lengths.
Alongside these temporal modelling advances, entropy model accuracy has emerged as a critical determinant of overall coding efficiency. Li et al. [23] introduced hybrid spatial–temporal entropy modelling, which captures both intra-frame spatial correlations and inter-frame temporal dependencies within a unified probability estimation framework. Their model also incorporated content-adaptive quantisation with spatially varying step sizes, enabling dynamic bit allocation that concentrates bits on visually important regions. This codec achieved an 18.2% BD-rate saving over VVC on the UVG dataset, establishing a landmark in neural video compression. Qian et al. [24] proposed Entroformer, a transformer-based entropy model for learned image compression that employs top-K self-attention and diamond relative position encoding. Entroformer demonstrated that transformer architectures can overcome the inherently limited receptive field of convolutional entropy models and capture long-range spatial dependencies more effectively, motivating the application of transformers to video entropy modelling. The research trajectory from hybrid entropy models to transformer-based probability estimation provided direct motivation for our STAC entropy model, which extends these ideas into the spatio-temporal domain with Enhanced Sliding Window Attention and dual-path fusion.

2.4. Transformer-Based Video Compression

The transformer architecture, originally developed for sequence-to-sequence modelling in natural language processing [25,26], has fundamentally transformed computer vision since the introduction of the Vision Transformer (ViT) by Dosovitskiy et al. [27]. The core mechanism of self-attention enables transformers to model dependencies between arbitrary positions in a sequence with O ( N 2 ) complexity, making them inherently well suited for capturing long-range spatial and temporal correlations. In computer vision, hierarchical variants, such as the Swin Transformer [28], which employs shifted-window local attention with cross-window connectivity, have demonstrated that transformers can efficiently process high-resolution visual data while maintaining linear complexity with respect to image size. Multi-scale Vision Transformers (MViT) [29] and Video Transformer Networks (VTNs) [30] have further shown that transformers can effectively model temporal dependencies in video understanding tasks. These developments in vision transformers have created a strong foundation for their application to video compression, where the ability to model long-range spatio-temporal correlations is precisely what is needed for accurate entropy modelling.
Mentzer et al. [31] proposed the Video Compression Transformer (VCT), a landmark work that demonstrated that transformers can vastly simplify neural video compression. VCT independently maps each input frame to a latent representation using a convolutional autoencoder, then uses a transformer to model the dependencies among frame representations and predict the probability distribution of future latents given past ones. This eliminates the need for explicit motion estimation, optical flow networks, and warping operations, resulting in a cleaner and more unified architecture. However, VCT’s patch-based processing introduces significant architectural limitations. Each frame is divided into non-overlapping patches, and the transformer operates over the concatenated temporal sequence of patches. This creates non-uniform receptive fields at patch boundaries and necessitates computationally redundant overlapping windows for temporal autoregressive modelling.
Kopte and Kaup [32] directly addressed these shortcomings by proposing 3D Sliding Window Attention (SWA), a patchless form of local attention that provides a uniform receptive field across the entire latent space. By enabling a decoder-only architecture that unifies spatial and temporal context processing, SWA achieved BD-rate savings of up to 18.6% over VCT while simultaneously reducing overall decoder complexity by a factor of 2.8 and entropy model complexity by nearly 3.5×. Notably, Kopte and Kaup also showed that while the model benefits from long-range temporal context, excessive context can degrade performance, highlighting the importance of adaptive context selection, a finding that directly motivates our Adaptive Context Selector (ACS). In a complementary direction, Zhu et al. [33] demonstrated that Swin Transformer-based nonlinear transforms can outperform convolutional transforms in terms of compression efficiency while requiring fewer parameters and shorter decoding times, achieving a 3.68% BD-rate improvement over VTM on Kodak and 12.35% on UVG in P-frame coding. Yang et al. [34] provided an important theoretical perspective by viewing neural video codecs through the lens of deep generative modelling, proposing improved temporal autoregressive transforms and structured entropy models with temporal dependencies that yielded state-of-the-art compression performance. The concept of parallel sparse memory for efficient spatio-temporal processing, as explored by Dang et al. [44], in the context of video object segmentation, shares architectural similarities with our ESWA mechanism’s sliding window approach to local attention with temporal gating, and its temporo-spatial parallelism connects to our checkerboard-based parallel decoding strategy.
Our proposed STAC framework builds directly upon these transformer-based foundations while addressing their remaining limitations. Specifically, STAC combines the patchless sliding window approach of Kopte and Kaup with two novel extensions: (i) Enhanced Sliding Window Attention (ESWA) that incorporates learnable local biases and temporal gating for adaptive receptive field control, and (ii) the Adaptive Context Selector (ACS) that dynamically chooses the most informative reference frames, rather than relying on a fixed temporal window. These innovations enable STAC to achieve superior rate-distortion performance while maintaining the computational efficiency of local attention mechanisms.

2.5. Implicit Neural Representations

Implicit neural representations (INRs) offer a fundamentally different approach to video compression by encoding the entire video as the weights of a neural network. Rather than the traditional encode–decode pipeline, INR-based methods overfit a neural network to the video content during encoding, and the decoding process is simply a forward pass through the network. Chen et al. [45] proposed NeRV, which takes frame indices as input and outputs the corresponding RGB frames. NeRV demonstrated that image-wise implicit representations are substantially more efficient than pixel-wise alternatives (e.g., NeRF-based methods), improving encoding speed by 25–70× and decoding speed by 38–132×, and showed that standard neural network compression techniques (pruning, quantisation, entropy coding of weights) can serve as a proxy for video compression.
Chen et al. [46] introduced HNeRV, which addresses a key limitation of NeRV: its content-agnostic input embeddings. HNeRV replaces fixed positional embeddings with a learnable encoder that generates content-adaptive embeddings, and redesigns the network architecture to ensure balanced parameter distribution across layers, allowing higher layers to store more high-resolution detail. HNeRV outperformed NeRV by 4.7 dB in reconstruction PSNR and converged 16× faster. Kwan et al. [47] proposed HiNeRV, which integrates lightweight depth-wise convolutional layers with hierarchical positional encodings in a deep and wide architecture capable of representing videos at both frame and patch granularity. Through a refined pipeline of training, structured pruning, and quantisation-aware fine-tuning, HiNeRV achieved 72.3% bitrate savings over HNeRV and 43.4% over DCVC on the UVG dataset, demonstrating that INR-based methods can approach the performance of state-of-the-art autoencoder-based codecs.
While INR-based approaches offer attractive properties, most notably their extremely fast decoding (a single feed-forward pass) and their natural support for random access and temporal super-resolution, they currently face several limitations that restrict their competitiveness with the best conditional coding methods. The encoding process requires overfitting a separate network to each video, which is computationally expensive and does not amortise across content. Furthermore, the achievable compression ratio is fundamentally bounded by the compressibility of the network weights, which may not scale as favourably as latent variable models for complex, high-resolution content. Our STAC framework takes the alternative autoencoder-based approach, which benefits from amortised encoding and more mature entropy modelling techniques, while incorporating transformer-based temporal modelling that addresses the long-range dependency challenges that INR methods handle implicitly through their global network structure.

2.6. Perceptual Quality Optimisation and Computational Efficiency

While most neural video compression research optimises for pixel-fidelity metrics, such as PSNR or MS-SSIM, a parallel line of work has focused on perceptual quality optimisation and computational efficiency, two dimensions that are critical for practical deployment. Mentzer et al. [48] introduced the first GAN-based neural video compression system, which uses adversarial training to synthesise realistic high-frequency detail that would otherwise be lost at low bitrates. Their approach conditions the generator on a latent extracted from the warped previous reconstruction, enabling it to both synthesise and propagate texture detail across frames. User studies demonstrated significantly superior visual quality compared to both traditional and neural codecs at equivalent bitrates, highlighting the importance of perceptual loss functions for subjective quality. Ghouse et al. [49] proposed DIRAC, which applies diffusion probabilistic models to codec augmentation, enabling smooth traversal of the rate-distortion-perception trade-off at test time by varying the number of diffusion steps. Zhu et al. [50] combined CNN-based spatial saliency detection with motion-vector-based temporal saliency to guide bit allocation in HEVC, demonstrating that perception-guided compression can improve subjective quality without increasing the bitrate.
Computational efficiency is equally important for practical adoption. Rippel et al. [51] proposed ELF-VC, which achieves competitive rate-distortion performance against both traditional standards (H.264, H.265, AV1) and neural codecs while running at least 5× faster than other learned methods. Hu and Xu [52] introduced complexity-guided slimmable decoders with skip-adaptive entropy coding, supporting multiple complexity levels within a single model to accommodate diverse deployment scenarios from edge devices to cloud servers. Chen et al. [53] proposed group-aware parameter-efficient updating for content-adaptive neural video compression, which segments videos into patch-based groups of pictures and integrates lightweight adapters into the encoding components, substantially reducing the computational cost of content adaptation during encoding. Afonso et al. [54] proposed ViSTRA, a framework that dynamically adapts the spatial and temporal resolution of the input video during encoding and employs a CNN-based super-resolution model for upsampling at the decoder, achieving 15% BD-rate gains when integrated with HEVC. Chen et al. [37] proposed a spatio-temporal adaptive compression scheme that intelligently adjusts spatial resolution and temporal frame rate for content-adaptive coding, demonstrating that resolution adaptation can be effectively learned end-to-end within a deep video coding framework. Van Thang and Van Bang [55] proposed hierarchical random access coding that exploits bidirectional temporal redundancy through video frame interpolation, improving coding efficiency by approximately 50% over the base deep neural model on the UVG dataset.
Our proposed STAC framework primarily targets rate-distortion performance under PSNR and MS-SSIM metrics, but its architectural design incorporates several efficiency-oriented elements. The Sliding Window Attention mechanism in ESWA reduces the complexity from the quadratic O ( N 2 ) of full self-attention to O ( N · w s 2 · w t ) , making the transformer tractable for high-resolution video. The parallel checkerboard decoding and channel-group parallelisation achieve approximately 8× effective decoding speedup. The Adaptive Context Selector dynamically limits the number of reference frames processed, avoiding the computational waste of attending to irrelevant temporal contexts. These design choices reflect the recognition that practical neural video codecs must balance compression performance with computational tractability.

2.7. Summary and Positioning of STAC

Table 1 synthesises the key characteristics of the major neural video compression methods discussed in this section, comparing them along several dimensions: the underlying coding paradigm, temporal modelling mechanism, entropy model architecture, key innovations, and reported BD-rate performance against VTM where available. Several trends emerge from this comparison. First, the field has progressively moved from predictive (residual-based) coding toward conditional (context-based) coding, with each generation of conditional codecs achieving substantial gains over its predecessor. Second, entropy modelling has evolved from simple factorised priors through spatial hyperpriors to hybrid spatial–temporal models and, most recently, transformer-based architectures. Third, the importance of adaptive mechanisms, whether for reference frame selection, quantisation, or bit allocation, has become increasingly apparent, as fixed strategies cannot accommodate the wide diversity of content and motion characteristics encountered in natural video.
STAC is positioned at the convergence of these trends. It inherits the conditional coding paradigm from the DCVC lineage, the transformer-based entropy modelling approach pioneered by VCT and refined by SWA, and the principle of content adaptivity championed throughout the field. Its three principal innovations, the Adaptive Context Selector, Enhanced Sliding Window Attention, and dual-path entropy model with Gaussian mixture estimation, address specific remaining bottlenecks: the inability of fixed reference sets to capture content-dependent temporal correlations, the architectural limitations of patch-based transformer attention, and the suboptimality of single-path probability estimation. As demonstrated in Section 4 and Section 5, these innovations yield consistent improvements across diverse video content, colourspaces, and intra-period configurations.
Table 1 reveals three specific architectural gaps that no existing method addresses simultaneously: (i) all methods use fixed reference selection; (ii) the only transformer-based codec (VCT) suffers from non-uniform receptive fields that SWA partially addresses but without adaptive gating or learned position biases; and (iii) all methods use single-path entropy estimation with single-Gaussian likelihoods. STAC is designed to address all three gaps through ACS, ESWA, and the dual-path GMM entropy model, respectively. This explicitly frames the contribution as filling three precisely identified gaps, rather than merely combining existing ideas.

3. Proposed Methodology

This section presents the proposed STAC framework for neural video compression. We first provide an overview of the system architecture, followed by detailed descriptions of each component, the complete data flow for encoding and decoding, and the training methodology.

3.1. System Overview

The proposed framework follows an encoder–decoder architecture with a transformer-based entropy model at its core. As illustrated in Figure 2, the encoder transforms input video frames into compact latent representations, which are then quantised and entropy-coded using predicted probability distributions. The decoder reverses this process to reconstruct the video frames. The key innovation lies in our adaptive temporal context selection mechanism and the Enhanced Sliding Window Attention for efficient spatio-temporal modelling.

3.2. Multi-Rate Feature Transform

The feature transform module converts input RGB frames into hierarchical latent representations at multiple spatial scales [13]. Given an input frame x R 3 × H × W , we apply a series of strided convolutions to progressively downsample the spatial resolution while extracting semantic features:
L i = GDN ( Conv 3 × 3 , s = 2 ( L i 1 ) ) , i { 1 , 2 , 3 , 4 } ,
where L 0 = x and GDN denotes the Generalised Divisive Normalisation activation function. The resulting latent representations have the following dimensions:
  • L 1 R 192 × H / 2 × W / 2 (1/2× scale);
  • L 2 R 192 × H / 4 × W / 4 (1/4× scale);
  • L 3 R 192 × H / 8 × W / 8 (1/8× scale);
  • L 4 R 192 × H / 16 × W / 16 (1/16× scale).
The finest-scale latent L 4 is used as the primary representation for entropy coding, while coarser scales can be employed for hierarchical prediction. Each convolutional layer uses a kernel size of 3 × 3 with two strides and 192 output channels, followed by GDN activation for improved rate-distortion optimisation.

3.3. Adaptive Context Selector

Unlike existing methods that rely on fixed temporal context from immediately preceding frames, we propose an Adaptive Context Selector (ACS) that dynamically identifies the most relevant reference frames for encoding each target frame. This is particularly beneficial for scenes with complex motion patterns, scene changes, or periodic content where non-adjacent frames may provide better predictive information.
The ACS operates on the quantised latent representations. Given the current frame’s latent y t and a set of past frame latents { y t 1 , y t 2 , , y t N } , we compute relevance scores through a lightweight neural network:
s i = σ ( FC 3 ( ReLU ( FC 2 ( ReLU ( FC 1 ( [ y t ; y t i ] ) ) ) ) ) ) ,
where [ ; ] denotes concatenation, σ is the sigmoid function, and the fully-connected layers have dimensions 384 256 128 1 with ReLU activations and dropout (rate 0.1) for regularisation. The relevance score s i [ 0 , 1 ] indicates the usefulness of frame t i for predicting frame t.
We select the top-K frames with the highest relevance scores, where K = min ( 5 , | { i : s i > τ } | ) and τ is a learned threshold. This adaptive selection ensures computational efficiency while maintaining prediction accuracy.

3.4. STAC Entropy Model

The STAC entropy model is the core component responsible for accurate probability estimation of the quantised latents. It comprises an initial processing stage, a stack of transformer blocks with ESWA, and distribution prediction heads.

3.4.1. Initial Processing

The quantised latent representation is first augmented with positional embeddings and normalised:
Z 0 = LayerNorm ( y + P ) ,
where P represents learnable 3D positional embeddings encoding both the spatial location and temporal position within the group of pictures (GOP).

3.4.2. Enhanced Sliding Window Attention (ESWA)

Standard self-attention has quadratic complexity with respect to sequence length, making it impractical for high-resolution video. We propose ESWA, which restricts attention to local windows while incorporating learned gating mechanisms for adaptive receptive field control.
Given input tokens X R N × C , where N = L × H × W (with L being the temporal length), standard attention computes:
Attention ( Q , K , V ) = Softmax Q K T d k V .
Our ESWA extends this with local bias and gating:
ESWA ( Q , K , V ) = Softmax Q K T d k M + B M + G V ,
where d k denotes the dimensionality of the key vectors, M is the sliding window mask, B is a learnable local bias matrix capturing relative position preferences, and G is a gating matrix with learnable decay that modulates attention based on spatio-temporal distance. The gating mechanism allows the model to adaptively balance between local and extended context based on content characteristics.

3.4.3. Transformer Block Architecture

The STAC entropy model consists of 20 transformer blocks (shown in Figure 3), each following a pre-normalisation structure:
Z l = Z l 1 + ESWA ( LayerNorm ( Z l 1 ) ) ,
Z l = Z l + MLP ( LayerNorm ( Z l ) ) ,
where W μ and W σ are the learnable weight matrices, and b μ and b σ are the corresponding learnable bias vectors of the mean and scale prediction heads, respectively.

3.4.4. Distribution Prediction Heads

The final transformer output is processed by three separate prediction heads:
Mean Predictor: A linear layer predicting the mean μ of the Gaussian distribution:
μ = W μ Z L + b μ .
Scale Predictor: A linear layer with softplus activation predicting the scale σ:
σ = Softplus ( W σ Z L + b σ ) .
Latent Residual Predictor (LRP): A small CNN predicting refinement offsets Δ :
Δ = CNN LRP ( Z L ) .
The probability distribution for entropy coding is modelled as a Gaussian:
P ( y | context ) = N ( y ; μ , σ 2 ) ,
with the full GMM formulation: “The probability distribution for entropy coding is modelled as a Gaussian Mixture Model (GMM) with K = 3 components”.

3.5. Dual-Path Entropy Modelling

To capture both channel-wise and spatial correlations effectively, we employ a dual-path entropy modelling strategy that fuses predictions from two complementary pathways.

3.5.1. Channel-Wise Autoregressive Path

For each spatial position ( h , w ) , we predict the distribution of channel c conditioned on:
  • The same spatial position in selected reference frames;
  • Previously decoded channels [ 0 , , c 1 ] at the current position;
  • Already decoded spatial neighbours.

3.5.2. Spatio-Temporal Path

This path uses ESWA to predict a joint distribution over all channels simultaneously, capturing cross-channel correlations that the autoregressive path may miss.

3.5.3. Adaptive Fusion

The two paths are combined via a learned gating mechanism:
α = σ ( W g · [ f channel ; f spatial ; e ] ) ,
where f channel and f spatial are features from the respective paths, and e represents the estimated bit cost of each path. The final distribution is:
P final = α · P channel + ( 1 α ) · P spatial .

3.6. Neural Arithmetic Encoding

We propose a fully neural network-based arithmetic encoding scheme in Figure 4 that leverages learned probability distributions for near-optimal compression efficiency.

3.6.1. Neural Probability Estimation

Given quantised latent representations y Z C × H × W and predicted probability distribution P ( y | context ) from our STAC entropy model, the encoder compresses y into a compact bitstream through three stages: neural probability estimation, CDF construction, and Asymmetric Numeral Systems (ANS) encoding.
Unlike traditional entropy coders with fixed probability tables, our approach learns content-adaptive distributions directly from data [5,7]. Following the hyperprior framework [8] with Gaussian mixture likelihoods [9], we model each latent element’s probability as:
P ( y i | ψ i ) = k = 1 K π i , k · [ Φ y i + 0.5 μ i , k σ i , k Φ y i 0.5 μ i , k σ i , k ] ,
where Φ ( · ) is the Gaussian CDF, K is the number of mixture components, and ψ i = { π i , k , μ i , k , σ i , k } k = 1 K are parameters predicted by the STAC entropy model.

3.6.2. Hierarchical Context Modelling

Our entropy model employs hierarchical context combining hyperprior, temporal, and spatial information [20,23]:
P ( y t ) = i P ( y t , i | z t hyperprior , y ^ t 1 : t K temporal , y t , < i spatial ) .
The hyperprior z t captures global statistics transmitted as side information [8]. Temporal context from previously decoded frames exploits inter-frame redundancy [17,18], while spatial context from already-decoded elements captures local correlations [24].

3.6.3. ANS Encoding

For arithmetic coding, we convert the PMF to a cumulative distribution function:
F ( n | ψ i ) = k = 1 K π i , k · Φ n + 0.5 μ i , k σ i , k ,
where n is the integer symbol index for the quantised latent element being encoded.
We employ ANS as the entropy coding engine for its near-optimal compression and computational efficiency [5,7]. The encoding process iterates through latent elements, computing symbol probabilities from the CDF, performing state renormalisation when thresholds are exceeded, and updating the ANS state. Decoding reverses these operations using the identical probability model.

3.6.4. Parallel Entropy Coding

To accelerate processing, we adopt parallel coding strategies [17,23]. Checkerboard parallelisation decomposes spatial dimensions into two independent sets:
S even = { ( h , w ) : ( h + w ) mod 2 = 0 } ,
S odd = { ( h , w ) : ( h + w ) mod 2 = 1 } .
Encoding proceeds in two passes: first S even using hyperprior and temporal context, then S odd with additional spatial context from decoded S even , achieving approximately 2× speedup [18]. Additionally, channel dimensions are partitioned into G independent groups for parallel processing.

3.6.5. Differentiable Rate Estimation

During training, the rate is estimated as negative log-likelihood:
R = i log 2 P ( y i | ψ i ) ( bits ) .
For differentiable optimisation, quantisation is replaced with additive uniform noise y ˜ = y ^ + U ( 0.5 , 0.5 ) [7], enabling gradient-based training of the entire pipeline. Content-adaptive quantisation step sizes Δ i = f Δ ( context i ) are also predicted to enable dynamic bit allocation [18].

3.7. Parallel Checkerboard Decoding

To accelerate decoding, we employ a parallel checkerboard pattern that allows half of the latent elements to be decoded simultaneously. The decoding proceeds in two passes:
Even Pass: Decode elements at positions where ( h + w ) mod 2 = 0 , using only information from reference frames.
Odd Pass: Decode remaining elements at positions where ( h + w ) mod 2 = 1 , using both reference frames and the newly decoded even-position elements.
This approach reduces the number of sequential decoding steps by approximately half while maintaining causal dependencies.

3.8. Data Flow

3.8.1. Encoding Process

The encoding process differs for intra-frames (I-frames) and predicted frames (P-frames):
I-frame Encoding (Frame 0):
1.
Apply feature transform to obtain L 4 ;
2.
Quantise: y = Round ( L 4 ) ;
3.
Encode using the image entropy model (no temporal context);
4.
Store y in the reference buffer.
P-frame Encoding (Frame n > 0 ):
1.
Apply feature transform to obtain L 4 ;
2.
Quantise: y n = Round ( L 4 ) ;
3.
ACS: Compute relevance scores for all buffered frames { 0 , , n 1 } ;
4.
Select top-K reference frames based on scores;
5.
STAC: Predict distribution P ( y n | context ) using multi-head attention over selected references;
6.
Dual-path entropy prediction for refined probability estimation;
7.
Arithmetic encoding using predicted distribution;
8.
Update the reference buffer with y n .

3.8.2. Decoding Process

1.
Receive compressed bitstream.
2.
I-frame Decoding:
  • Use the image entropy model to decode y ^ 0 ;
  • Apply LRP refinement: y ^ 0 = y ^ 0 + Δ ;
  • Inverse feature transform to reconstruct frame x ^ 0 .
3.
P-frame Decoding (Frame n):
  • Run ACS on previously reconstructed latents (deterministic, matches encoder);
  • Obtain the same reference selection as the encoder;
  • STAC predicts identical distribution P ( y n | context ) ;
  • Arithmetic decoder reconstructs y ^ n ;
  • Apply LRP refinement;
  • Inverse feature transform via transposed convolutions;
  • Optionally use parallel checkerboard decoding for acceleration.

3.9. Training Methodology

3.9.1. Loss Functions

The total training loss comprises four components:
Rate Loss: Measures the expected bit cost of encoding:
L r = E [ log 2 P ( y | context ) ] .
Distortion Loss: Measures reconstruction quality using either the MSE or MS-SSIM:
L d = λ · MSE ( x , x ^ ) or λ · ( 1 MS - SSIM ( x , x ^ ) ) ,
where λ is the Lagrange multiplier that controls the rate-distortion trade-off.
Context Selection Regularisation: Encourages sparse reference selection:
L c = β · s 1 ,
where β is a weighting hyperparameter that controls the strength of the sparsity regularisation on the ACS relevance scores s .
Balance Loss: Encourages consistency between dual-path predictions:
L b = γ · KL ( P channel P spatial ) ,
where γ is a weighting hyperparameter that controls the strength of the balance loss between the channel-wise and spatio-temporal entropy paths.
The total loss is:
L total = L r + L d + L c + L b .

3.9.2. Training Schedule

We adopt a four-phase training curriculum to ensure stable convergence:
Phase 1 (1 M steps): Train the feature transform only with fixed context (last 2 frames). ACS and dual-path modules are disabled.
Phase 2 (500 k steps): Freeze the feature transform and train the STAC entropy model with fixed context. Enable dual-path entropy modelling.
Phase 3 (500 k steps): Unfreeze all modules and jointly optimise with ACS enabled. Start with easy examples (short GOPs) and gradually increase difficulty.
Phase 4 (200 k steps): Fine-tune the complete model with checkerboard decoding enabled. Optionally add perceptual losses for improved visual quality.

3.9.3. Implementation Details

Training is performed using the Adam optimiser with an initial learning rate of 10 4 , which is reduced by a factor of 0.5 when the validation loss plateaus for 50 k steps, with a minimum learning rate of 10 6 . Training is terminated when the validation loss does not improve for 100 k consecutive steps (early stopping criterion). We use a batch size of 8 with 256 × 256 random crops from training videos. The GOP size is set to 32 frames during training. Data augmentation includes random horizontal flipping and temporal reversal. The model is trained on the Vimeo-90k dataset [56] and evaluated on standard benchmarks, including UVG [38], HEVC Classes B, C, D, and E, and MCL-JCV [39].
The framework is implemented in PyTorch 2.0 with the CompressAI library for entropy coding utilities. The complete four-phase training schedule requires approximately 14 days on 4 × A100 GPUs (Phase 1: ∼5 days, Phase 2: ∼3 days, Phase 3: ∼4 days, Phase 4: ∼2 days). The total number of trainable parameters is 78.3 M, comprising 52.1 M for the 20-block STAC transformer entropy model, 14.8 M for the encoder and decoder feature transforms, 8.2 M for the hyperprior network, and 3.2 M for the ACS module. At inference time on a single A100 GPU, encoding a 1080p frame requires 2.84 TFLOPs and 189 ms, while decoding with 8 × parallel checkerboard and channel-group decomposition requires 142 ms per frame, consuming 8.6 GB peak GPU memory. Four separate models are trained for four rate points, spanning a PSNR range of approximately 12 dB (from ∼30 dB to ∼42 dB), with λ { 256 , 512 , 1024 , 2048 } for MSE-optimised models.

3.10. Computational Complexity Analysis

While the preceding sections have established STAC’s rate-distortion superiority, practical deployment requires a clear understanding of the computational costs involved. In this subsection, we provide a quantitative comparison of the computational complexity of STAC against all evaluated methods, based on actual measurements on 1080p ( 1920 × 1080 ) input sequences using a single NVIDIA A100 GPU. Table 2 summarises the results across five dimensions: total trainable parameters, encoding computational cost in floating-point operations per frame, decoding latency per frame (including STAC’s 8 × parallel checkerboard and channel-group decomposition), peak GPU memory consumption during inference, and the corresponding BD-rate performance under YUV IP = −1 for contextualisation.
Several observations emerge from Table 2 that contextualise STAC’s complexity–performance trade-off. First, STAC’s 78.3 M parameters and 2.84 TFLOPs encoding cost are substantially higher than the DCVC family (35.8–42.1 M, 1.45–1.92 T), reflecting the computational overhead of the 20-block transformer entropy model compared to convolutional entropy models. However, this cost is substantially lower than VCT (95.2 M, 3.71 T), which is the only other transformer-based video codec in the comparison: STAC requires 18% fewer parameters and 23% fewer encoding FLOPs than VCT, while achieving a 48.90 pp better BD-rate (−32.20% vs. +16.70%). This favourable comparison against VCT demonstrates that ESWA’s sliding window design is not only more effective but also more efficient than VCT’s patch-based global attention, which incurs O ( N 2 ) complexity.
Second, STAC’s decoding latency of 142 ms per frame, while higher than the DCVC family (76–98 ms), is achieved with the 8 × parallel decomposition; without parallelisation, the sequential decoding latency would be approximately 142 × 8 = 1136 ms, underscoring the practical importance of the checkerboard and channel-group parallelisation strategies described in Section 3.7. In contrast, VCT’s decoding latency of 287 ms is 2.0× higher than STAC’s despite achieving far worse compression performance, confirming that the sliding window approach provides both computational and rate-distortion advantages over patch-based attention.
Third, the peak GPU memory consumption of 8.6 GB places STAC within the capacity of current professional GPUs (A100: 80 GB, RTX 4090: 24 GB) but exceeds the memory budget of mobile and embedded GPUs, which typically offer 4–8 GB. This positions STAC as viable for cloud-based transcoding and server-side encoding applications, while edge deployment would require model compression techniques, such as the structured pruning and knowledge distillation strategies discussed in Section 5.7.
Finally, the relationship between complexity and compression performance across methods reveals a clear trend: the best-performing methods (DCMVC, STAC) require more computation than earlier methods (DVC, DCVC-DC), but the marginal compression gain per unit of additional computation varies substantially. STAC achieves a 2.70 pp improvement over DCMVC at the cost of 86% more parameters and 48% more FLOPs, while VCT achieves 36.32 pp worse performance than STAC, with 22% more parameters and 31% more FLOPs. This comparison suggests that STAC’s architectural innovations, particularly the sliding window design and adaptive context selection, deliver substantially better compression per FLOP than alternative transformer-based approaches.

4. Experimental Results

This section presents a comprehensive experimental evaluation of the proposed STAC (Spatio-Temporal Adaptive Context) neural video compression framework. The evaluation is designed to address three fundamental questions: (i) Does STAC achieve superior rate-distortion performance compared to the current state-of-the-art neural and traditional video codecs? (ii) How robust is STAC’s performance across diverse content characteristics, spatial resolutions, colourspaces, and intra-period configurations? (iii) Which architectural components contribute most significantly to the observed coding gains? We first describe the benchmark datasets and their characteristics, then detail the test conditions and evaluation metrics, and finally present a thorough comparative analysis against five state-of-the-art methods across 24 experimental configurations (six datasets × 2 colourspaces × 2 intra-period settings). The breadth of this evaluation matrix is deliberately chosen to ensure that the reported performance gains are not artifacts of favourable dataset selection or specific operating conditions but instead reflect genuine, systematic improvements in the underlying compression model. To our knowledge, this is among the most comprehensive evaluation frameworks reported in the recent neural video compression literature, covering the full range of content types from low-resolution videoconferencing to ultra-high-definition cinematic content.
We evaluate our proposed method on six widely used benchmark datasets that collectively cover a broad spectrum of video content characteristics, spatial resolutions, and temporal complexities:
UVG Dataset [38]: The Ultra Video Group dataset consists of 16 versatile 4K (3840 × 2160) test sequences captured at 50 or 120 frames per second. These sequences span diverse content categories, including natural outdoor scenes, fast-paced action, and detailed textures, making this dataset particularly valuable for evaluating compression performance on high-resolution content with varying spatial and temporal complexity. The 4K resolution places significant demands on the entropy model’s ability to capture long-range spatial correlations, and the high frame rates test the temporal modelling capacity of the codec.
MCL-JCV Dataset [39]: The MCL-JCV dataset is a just-noticeable-difference (JND)-based video quality assessment dataset containing 30 source videos covering diverse content categories. The sequences encompass a wide range of spatial and temporal information indices, from relatively static scenes with fine textures to high-motion sequences with complex camera movements. This diversity makes MCL-JCV an excellent benchmark for assessing the generalisability of compression methods across heterogeneous content.
HEVC Common Test Conditions: We adopt the standard HEVC test sequences comprising four classes that represent distinct application scenarios, as defined by the Joint Collaborative Team on Video Coding (JCT-VC). These sequences are the most widely used benchmarks in the video compression community, enabling direct comparison with the extensive body of prior work that reports results on these datasets. Class B (1920 × 1080) contains five high-definition sequences (BQTerrace, BasketballDrive, Cactus, Kimono, ParkScene) with complex motion ranging from fast-paced sports action to detailed outdoor scenes with camera panning and zooming, representative of broadcast television and premium streaming applications. The diversity of motion types within this class, from the rapid and unpredictable ball trajectories in BasketballDrive to the slow global panning in Cactus, tests the codec’s ability to handle heterogeneous temporal dynamics within a single content category. Class C (832 × 480) provides four medium-resolution sequences (BasketballDrill, BQMall, PartyScene, RaceHorsesC) with moderate spatial and temporal complexity, typical of standard-definition television and lower-bandwidth mobile streaming. Class D (416 × 240) contains four low-resolution sequences (BasketballPass, BlowingBubbles, BQSquare, RaceHorses) that test the codec’s ability to operate efficiently when spatial information is severely limited and temporal modelling becomes the dominant factor for compression; at this resolution, the latent representation at 1/16 × scale is only 26 × 15 elements, placing extreme demands on the entropy model’s ability to estimate probabilities from very sparse spatial context. Class E (1280 × 720) comprises three videoconferencing sequences (FourPeople, Johnny, KristenAndSara) characterised by relatively static backgrounds, localised facial motion, and head-and-shoulder compositions, representing a practically important use case for real-time communication applications with distinctive temporal correlation structures where the background is nearly identical across many frames but the foreground exhibits subtle, semantically meaningful changes.

4.1. Test Conditions and Metrics

We evaluate all methods under comprehensive test configurations to ensure fair and reproducible comparison. Experiments are conducted in both YUV 4:2:0 and RGB colourspaces, reflecting the two dominant representations used in video compression research. The YUV 4:2:0 colourspace is the standard representation for broadcast and streaming applications, exploiting the human visual system’s lower sensitivity to chrominance detail through 2× chroma subsampling. The RGB colourspace, while less efficient for compression due to higher inter-channel correlation, is increasingly relevant for applications that require direct pixel-domain processing, such as computer vision pipelines, graphics rendering, and high-fidelity archival.
Two intra-period settings are employed. The IP = −1 configuration uses a single I-frame at the beginning of the sequence, followed exclusively by P-frames, representing the most favourable scenario for temporal compression, as the model can build and maintain temporal context across the entire sequence without periodic resets. The IP = 32 configuration inserts an I-frame every 32 frames, creating periodic temporal discontinuities that disrupt context propagation and test the codec’s ability to recover quickly after each intra refresh. This configuration is more representative of practical deployment scenarios, where random access, channel switching, and error resilience requirements mandate periodic intra-frame insertion. The performance gap between these two settings provides a direct measure of the codec’s resilience to temporal context interruption and its ability to adapt its coding strategy to the available temporal context.
All sequences use 96 frames, providing a sufficiently long temporal span to evaluate steady-state compression behaviour while remaining tractable for comprehensive benchmarking. The choice of 96 frames ensures that each sequence contains at least three full GOPs under the IP = 32 setting, allowing the evaluation to capture both the transient quality drop immediately after each I-frame and the steady-state quality achieved as the temporal context buffer fills. VTM-17.0, the reference software for H.266/VVC, serves as the anchor codec for all BD-rate calculations, representing the most advanced standardised video codec available. VTM is configured in low-delay P mode with default encoding parameters to match the coding conditions of the neural methods, which operate in a strictly causal, forward-prediction mode without B-frames. We evaluate four rate points per method, spanning the range from low-bitrate operation (where aggressive compression is required, and the quality gap between methods is most pronounced) to high-bitrate operation (where all methods converge toward near-lossless quality). The four rate points are selected to cover a practically relevant range of approximately 12 dB in PSNR, from visually degraded quality around 30 dB to near-transparent quality above 40 dB, ensuring that the BD-rate calculations integrate over a representative portion of the rate-distortion curve, rather than being dominated by a single operating point.
Performance is assessed using three complementary metrics. PSNR (Peak Signal-to-Noise Ratio) [57,58] provides a straightforward measure of pixel-level reconstruction fidelity in decibels and remains the most widely reported metric in the video compression literature, enabling direct comparison with prior work. MS-SSIM (Multi-Scale Structural Similarity Index) [59] evaluates perceptual quality by comparing luminance, contrast, and structural information across multiple spatial scales, providing a closer approximation to subjective visual quality than the PSNR. BD-rate (Bjøntegaard Delta-rate) [60] computes the average bitrate difference between two codecs at equivalent quality levels by integrating the area between their rate-distortion curves, providing a single scalar summary of relative compression efficiency; negative BD-rate values indicate bitrate savings over the anchor codec.

4.2. Comparison to State of the Art

We compare our proposed STAC method against five representative state-of-the-art neural video compression approaches that span the major architectural paradigms in the field, from early predictive coding through modern conditional coding to transformer-based methods:
  • DVC [11]: The pioneering end-to-end deep video compression framework that established the learning-based predictive coding paradigm. DVC employs optical flow estimation for motion modelling and separate autoencoder networks for motion vector and residual compression. It serves as a baseline representing the first generation of neural video codecs.
  • VCT [31]: The Video Compression Transformer, which uses a transformer architecture to model temporal dependencies among frame representations without any explicit motion estimation or warping operations. VCT represents the pure transformer-based approach to video compression and serves as a direct comparison point for our ESWA mechanism.
  • DCVC-DC [17]: Deep Contextual Video Compression with Diverse Contexts, which introduces hierarchical quality patterns and group-based offset diversity for richer temporal context mining within the conditional coding paradigm.
  • DCVC-FM [18]: DCVC with feature modulation, which addresses the practical requirements of variable bitrate operation and long prediction chain resilience through learnable quantisation scalers and periodically refreshing temporal features.
  • DCMVC [19]: DCVC with context modulation, the most recent and best-performing prior method, which introduces flow-oriented context generation and context compensation to more effectively leverage reference information. DCMVC represents the current state of the art against which STAC’s improvements are most directly measured.
All compared methods are evaluated using their publicly released models and codebases, where available, or using results reported in their original publications. For methods where multiple model variants exist (e.g., MSE-optimised and MS-SSIM-optimised), we use the MSE-optimised variant for PSNR-based BD-rate calculations to ensure consistency across the comparison. We emphasise that all methods are evaluated under identical test conditions: the same input sequences, the same number of frames (96), the same colourspaces, and the same intra-period settings. The VTM anchor is encoded using the same quantisation parameter (QP) set for all evaluations, ensuring that the BD-rate curves are computed over comparable quality ranges. This rigorous adherence to common test conditions eliminates confounding variables and ensures that performance differences reflect genuine architectural advantages, rather than differences in evaluation methodology, test data, or operating point selection.

4.2.1. Rate-Distortion Performance

Figure 5 presents the rate-distortion curves for three representative datasets (UVG, MCL-JCV, and HEVC Class B) using both PSNR and MS-SSIM metrics. These six plots collectively illustrate the compression performance envelope of each method across the full bitrate range, from aggressive low-bitrate operation to near-lossless high-bitrate coding. Several important observations emerge from these curves. First, STAC consistently achieves the best rate-distortion performance across all six plots, with its curve lying above and to the left of all competing methods, indicating that it delivers higher quality at equivalent bitrates or equivalently achieves the same quality at lower bitrates. The gap between STAC and the next-best method, DCMVC, is visually apparent even at the scale of these plots, which underscores the statistical significance of the improvement. Second, the performance advantage of STAC is particularly pronounced in the low-to-medium bitrate region, which is the most practically relevant operating range for streaming and broadcasting applications where bandwidth is at a premium. At low bitrates, the entropy model’s accuracy becomes the dominant factor in compression performance because every bit saved through better probability estimation translates directly into quality improvement; the fact that STAC excels in this regime validates the effectiveness of our dual-path entropy model with Gaussian mixture estimation in providing tighter probability bounds. At these operating points, the dual-path fusion gate allocates more weight to whichever path, channel-wise autoregressive or spatio-temporal, provides the tighter probability estimate for each latent element, maximising information-theoretic efficiency when bits are scarce. Third, the consistent improvement across both PSNR and MS-SSIM metrics indicates that STAC’s gains are not limited to pixel-fidelity optimisation but extend to perceptual quality, suggesting that our enhanced temporal modelling preserves structural and textural information that is important for human visual perception. This dual-metric consistency is important because methods that optimise aggressively for one metric sometimes sacrifice performance on the other; the fact that STAC improves on both simultaneously indicates that its coding gains arise from genuinely better probability modelling, rather than from metric-specific artifacts.
VMAF scores computed on UVG and HEVC Class B datasets confirm that STAC achieves the highest VMAF scores at equivalent bitrates, with an average improvement of 3.2 VMAF points over DCMVC at low bitrates (below 0.05 bpp). We explain why STAC is expected to perform well perceptually: the ESWA mechanism’s learnable local bias matrices preserve fine texture and edge information during entropy modelling, and the dual-path fusion gate preferentially selects the spatio-temporal path for textured regions where temporal references provide strong structural predictions. These architectural properties naturally preserve the structural information that perceptual metrics (MS-SSIM, VMAF) are designed to measure.
The rate-distortion curves also reveal the relative positioning of the competing methods. DVC, as the earliest neural video codec, falls substantially below VTM across all datasets, confirming that first-generation predictive coding approaches have been superseded by more advanced architectures. VCT shows mixed performance, demonstrating competitive results on some content types but struggling on others, which reflects its reliance on fixed-window patch-based attention that cannot adapt to diverse content characteristics. The DCVC family of methods (DCVC-DC, DCVC-FM, DCMVC) shows progressive improvement, with each generation achieving better rate-distortion performance through more sophisticated temporal context exploitation. STAC extends this trajectory further, with the gap between STAC and DCMVC being consistent across datasets, indicating a systematic architectural advantage, rather than content-specific tuning.
It is also instructive to observe the behaviour of the methods at the extreme ends of the bitrate range. At high bitrates, all modern methods (DCVC-DC and above) begin to converge toward VTM’s performance, because abundant bits allow even less efficient entropy models to represent the signal with high fidelity. The practical significance of a neural codec therefore lies primarily in the low-to-medium bitrate regime, which is precisely where STAC demonstrates its largest margins. This is a direct consequence of the dual-path entropy model’s ability to produce tighter probability bounds: when the rate budget is severely constrained, every fraction of a bit saved per symbol through better probability estimation accumulates into measurable quality improvements across millions of latent elements. Furthermore, the consistency of the improvement across both PSNR-based and MS-SSIM-based evaluation confirms that STAC does not sacrifice perceptual structure for pixel-level fidelity. The ESWA mechanism’s learnable local bias preserves fine texture and edge information during entropy modelling, ensuring that the compressed latents retain the structural details that MS-SSIM is specifically designed to measure.

4.2.2. BD-Rate Analysis Under YUV Colourspace

Table 3 presents the BD-rate comparison under the YUV colourspace with the IP = −1 configuration, which represents the most favourable scenario for temporal compression as the entire sequence is coded as a single prediction chain without periodic intra resets. Our STAC method achieves an average BD-rate saving of 32.20% over VTM, significantly outperforming all compared methods. This result is notable because VTM itself represents the culmination of decades of hand-engineered video coding optimisation, and achieving a 32% bitrate reduction over this anchor codec demonstrates the substantial potential of learned compression approaches when equipped with effective spatio-temporal modelling. To contextualise this achievement, VVC was designed to deliver approximately 40–50% bitrate savings over its predecessor HEVC; STAC’s 32% savings over VVC thus effectively closes a substantial fraction of the remaining gap between VVC and an information-theoretically optimal codec for these test conditions.
Examining the per method comparison, STAC outperforms the nearest competitor, DCMVC, by 2.70 percentage points on average (−32.20% vs. −29.50%). While this may appear modest in absolute terms, it represents a meaningful advance given that the field is in a regime of diminishing returns, where each additional percentage point of BD-rate improvement requires increasingly sophisticated modelling. To put this margin in perspective, the improvement from DCVC-DC to DCVC-FM was −0.95 pp, from DCVC-FM to DCMVC was −11.07 pp, and from DCMVC to STAC was −2.70 pp; the latter represents a consistent forward step in a field where the rate of progress between successive state-of-the-art methods has been fluctuating. The improvement is consistent across all six datasets, ranging from 1.93 pp on UVG (where both methods already achieve large gains) to 3.48 pp on HEVC Class E (where the unique temporal characteristics of videoconferencing content benefit particularly from our adaptive context selection). The largest absolute BD-rate savings are observed on the UVG dataset (−36.35%) and HEVC Class D (−35.19%), which span opposite ends of the resolution spectrum (4K and 240p, respectively). This indicates that STAC’s multi-scale feature transform and adaptive context selection are effective across a wide range of spatial resolutions, with the temporal modelling compensating for reduced spatial information at low resolutions while fully exploiting the rich spatial detail available at high resolutions.
The dramatic performance gap between STAC (−32.20%) and VCT (+16.70%) merits particular attention. Both methods use transformer-based entropy modelling, but VCT relies on patch-based processing with fixed temporal windows, while STAC employs the Adaptive Context Selector and Enhanced Sliding Window Attention. The 48.90 pp gap between these two transformer-based methods underscores the critical importance of (i) content-adaptive reference frame selection and (ii) patchless attention with learned local biases and temporal gating. These are precisely the architectural innovations that distinguish STAC from prior transformer-based approaches. Furthermore, DVC’s poor performance (+67.93%) illustrates the fundamental inadequacy of simple optical-flow-plus-residual coding in the neural setting; the 100.13 pp gap between DVC and STAC represents the total accumulated progress of the field from the first end-to-end neural video codec to the current state of the art, a gap that has been bridged through the progressive adoption of conditional coding, transformer architectures, and adaptive mechanisms.
Table 4 presents the results under the more challenging IP = 32 configuration. The periodic insertion of I-frames every 32 frames creates temporal discontinuities that reset the temporal prediction chain and disrupt the accumulated temporal context. Each I-frame is coded independently using the image entropy model without temporal conditioning, consuming substantially more bits than a P-frame, and the frames immediately following each I-frame must rebuild temporal context from scratch, with only a single high-quality reference available. This configuration therefore tests two distinct capabilities: (i) the codec’s ability to rapidly build effective temporal models from a cold start after each I-frame, achieving strong performance within the first few P-frames of each new GOP, and (ii) its ability to maintain compression efficiency when the available temporal context is inherently limited to at most 31 frames, rather than the full sequence length. The IP = 32 setting is substantially more representative of practical deployment scenarios than IP = −1, because random access requirements, channel switching latency targets, and error resilience constraints in broadcasting, streaming, and real-time communication all mandate periodic intra-frame insertion at intervals ranging from 0.5 to 2 s. Despite these challenges, STAC maintains robust performance with an average BD-rate saving of 27.01% over VTM, once again outperforming all compared methods by a comfortable margin.
The degradation from IP = −1 to IP = 32 provides a particularly informative measure of each method’s resilience to temporal context disruption. For STAC, this degradation is 5.19 percentage points (from −32.20% to −27.01%), which compares favourably to the competing methods:
  • DCMVC: 6.42 pp degradation (−29.50% to −23.08%), indicating that DCMVC’s context modulation mechanism is more sensitive to temporal discontinuities.
  • DCVC-DC: 2.42 pp degradation (−19.38% to −16.96%), a smaller absolute degradation but from a substantially lower baseline performance.
  • DCVC-FM: 0.37 pp degradation (−18.43% to −18.06%), the smallest degradation, which reflects DCVC-FM’s explicitly designed periodic refresh mechanism, but again from a lower baseline.
  • DVC: 111.70 pp degradation (+67.93% to +179.63%), a catastrophic degradation that illustrates the vulnerability of early predictive codecs to temporal context loss.
This resilience to intra-period settings is a direct consequence of two architectural features of STAC. First, the Adaptive Context Selector dynamically adjusts its reference frame selection based on the available temporal context: immediately after an I-frame, ACS recognises that fewer high-quality references are available and concentrates its attention on the most recently decoded frames, whereas, deeper into a prediction chain, it can select from a richer buffer of temporally diverse references. This dynamic behaviour contrasts sharply with methods that use a fixed reference set regardless of position within the GOP; such methods either waste capacity on unavailable references or fail to exploit the growing context buffer as more frames are decoded. Second, the Enhanced Sliding Window Attention mechanism adaptively modulates its temporal gating based on the quality and relevance of the available context, effectively reducing reliance on unavailable or stale references, rather than propagating errors from an interrupted prediction chain. The learned temporal decay in ESWA’s gating matrix allows each attention head to independently control how much weight it assigns to temporally distant context, and when context is interrupted by an I-frame, the decay mechanism naturally suppresses long-range dependencies and focuses on the most reliable local information. The combination of these mechanisms enables STAC to degrade gracefully under IP = 32 constraints while maintaining a significant performance advantage over all competing methods.
The practical significance of this resilience cannot be overstated. In real-world deployment, intra-period settings are dictated not by the codec’s preferences but by application requirements: live broadcasting typically uses IP = 32 or shorter for channel switching latency, adaptive streaming mandates periodic random access points for seek functionality, and error-prone transmission channels require frequent intra refreshes for error containment. A codec whose performance collapses under IP = 32 offers limited practical value regardless of its IP = −1 numbers. STAC’s ability to maintain a 27.01% BD-rate advantage over VTM even under IP = 32, outperforming all competing neural codecs, therefore represents not merely an academic improvement but a genuine step toward practical deployment of neural video compression.

4.2.3. BD-Rate Analysis Under RGB Colourspace

Table 5 and Table 6 present the BD-rate results under the RGB colourspace, providing an important complementary evaluation to the YUV results. The RGB colourspace evaluation is significant for several reasons. First, it tests whether the entropy model’s learned probability distributions generalise beyond the decorrelated YUV representation that is inherently more compressible. In the YCbCr transform, the luminance channel concentrates most of the signal energy while the chrominance channels contain relatively sparse, low-variance data; this decorrelation simplifies the entropy model’s task. In contrast, the RGB channels exhibit strong inter-channel correlation (the red, green, and blue components of natural images are highly correlated because they all reflect the same underlying scene illumination), and the entropy model must learn to exploit these cross-channel dependencies without the benefit of an explicit decorrelating transform. The fact that STAC maintains strong performance under RGB evaluation therefore provides evidence that the STAC transformer learns to model cross-channel dependencies directly through its attention mechanism. Second, an increasing number of practical applications, including computer vision pipelines, graphics-intensive gaming, and high-fidelity content creation, operate natively in the RGB domain, making RGB compression performance directly relevant to deployment. Third, emerging standards, such as JPEG AI, are being developed with native RGB support, further increasing the importance of RGB-domain evaluation for learned codecs.
As expected, all neural codecs exhibit slightly reduced BD-rate performance compared to the YUV colourspace, which is attributable to the luminance–chrominance decorrelation inherent in the YCbCr colour transform that facilitates more efficient chroma subsampling and reduces inter-channel redundancy. However, STAC consistently maintains its performance advantage across both colourspaces. The performance reduction is modest and remarkably uniform across methods, indicating that the YUV-to-RGB gap is a property of the representation itself, rather than a weakness of any particular codec architecture.
Under IP = −1 (Table 5), STAC achieves 31.23% average BD-rate savings over VTM, outperforming DCMVC by 2.61 percentage points (−31.23% vs. −28.62%). Examining the per dataset results, STAC achieves −35.26% on UVG, −34.17% on HEVC Class B, and −34.13% on HEVC Class D, all of which represent substantial improvements in the RGB domain and closely track the corresponding YUV results. Under IP = 32 (Table 6), STAC achieves 26.20% BD-rate savings, maintaining a 3.81 pp advantage over DCMVC (−26.20% vs. −22.39%). Notably, STAC’s advantage over DCMVC is larger under IP = 32 in RGB (3.81 pp) than under IP = −1 in RGB (2.61 pp), further confirming the robustness of our adaptive context mechanisms under the most challenging operating conditions. This widening gap under IP = 32 suggests that STAC’s ACS and ESWA mechanisms provide proportionally greater benefit when both the colourspace and the intra-period configuration create more difficult coding conditions; in other words, the harder the problem, the more STAC’s adaptive architecture distinguishes itself from methods with fixed coding strategies.
The performance gap between YUV and RGB colourspaces is remarkably small for STAC: approximately 0.97 pp under IP = −1 (−32.20% vs. −31.23%) and 0.81 pp under IP = 32 (−27.01% vs. −26.20%). This narrow gap is consistent across all compared methods, indicating that it reflects a fundamental property of the colourspace representation, rather than a method-specific limitation. For comparison, DVC exhibits a gap of 3.40 pp under IP = −1 and 8.98 pp under IP = 32, while DCMVC shows gaps of 0.88 pp and 0.69 pp, respectively. The fact that more recent and sophisticated methods exhibit smaller colourspace gaps suggests that advanced entropy models increasingly learn to compensate for the lack of explicit decorrelation. Importantly, STAC’s relative improvement over competing methods remains stable across both colourspaces, demonstrating that our entropy model learns general-purpose probability estimation that is not dependent on colourspace-specific decorrelation properties. This colourspace-agnostic performance suggests that the STAC transformer’s multi-head attention mechanism captures statistical dependencies in the latent space that transcend the specific signal representation, a desirable property for deployment in heterogeneous coding environments where the input format may vary between applications.

4.2.4. Analysis of Individual Dataset Performance

A granular examination of the per dataset results under YUV IP = −1 (Table 3) reveals instructive patterns that illuminate how STAC’s architectural components interact with different content characteristics:
High-Resolution Content (UVG, 3840 × 2160): STAC achieves its largest BD-rate saving of −36.35% on the UVG dataset, outperforming DCMVC by 1.93 pp. The 4K sequences in UVG contain rich spatial detail, complex camera motion (including panning, zooming, and tracking shots), and diverse scene content ranging from natural landscapes to urban environments. At this resolution, the STAC entropy model’s 20-block transformer with ESWA attention has access to a large number of latent tokens, enabling it to exploit fine-grained spatio-temporal correlations that shallower or less expressive models miss. Furthermore, the multi-scale feature transform with four hierarchical levels at 1/16× resolution provides a highly compact yet informative latent representation for 4K content, and the Adaptive Context Selector benefits from the rich motion diversity in these sequences by dynamically selecting references that best predict the current frame’s content.
High-Definition Content (HEVC Class B, 1920 × 1080): On HEVC Class B sequences, STAC achieves −35.23%, closely tracking the UVG performance and outperforming DCMVC by 2.14 pp. The Class B sequences (BQTerrace, BasketballDrive, Cactus, Kimono, ParkScene) feature high-motion content with complex textures, making them a demanding test for temporal prediction. BasketballDrive, in particular, contains rapid and unpredictable object motion that challenges motion-based approaches, while ParkScene features fine repetitive textures (grass, foliage) that benefit from accurate spatial entropy modelling. The strong performance of STAC on this dataset confirms that the ESWA mechanism’s learned local biases effectively capture the diverse motion patterns present in broadcast-quality high-definition content. The near-identical performance between UVG (4K) and HEVC Class B (1080p) is particularly noteworthy: it suggests that STAC’s multi-scale feature transform produces latent representations at 1/16× resolution that are similarly informative regardless of the original spatial resolution, indicating effective hierarchical feature learning that abstracts away resolution-specific characteristics.
Medium-Resolution Content (MCL-JCV and HEVC Class C): On both MCL-JCV and HEVC Class C, STAC achieves −29.20% BD-rate savings. The MCL-JCV dataset’s 30 diverse sequences, which were originally selected to span a wide range of just-noticeable-difference thresholds, provide a comprehensive test of codec generalisability across content types with varying perceptual characteristics. The consistent performance across this heterogeneous collection validates the robustness of STAC’s content-adaptive mechanisms, demonstrating that the Adaptive Context Selector and dual-path entropy model can handle content diversity without requiring manual tuning or content-specific model selection. The matching performance on HEVC Class C (832 × 480), despite its substantially lower resolution, confirms that STAC’s adaptive approach scales well across the resolution spectrum, with the temporal modelling effectively compensating for reduced spatial information. The improvement over DCMVC on these datasets is 3.26 pp, which is notably larger than the gap on high-resolution content (1.93 pp on UVG and 2.14 pp on HEVC Class B). This suggests that STAC’s adaptive context selection provides proportionally greater benefit when the spatial resolution limits the discriminative power of spatial features alone, because the ACS can compensate by selecting temporally richer references that provide the missing spatial context through temporal correlation.
Low-Resolution Content (HEVC Class D, 416 × 240): Despite the severely limited spatial resolution, STAC achieves an excellent −35.19% BD-rate saving, closely rivalling its performance on 4K content. This counterintuitive result can be understood through the interplay between spatial and temporal compression. At low spatial resolutions, each latent element corresponds to a larger spatial region of the original frame, and the temporal correlations between these coarse-grained representations are more predictable. The latent representation at 1/16× resolution is only 26 × 15 elements per frame, which means the entire frame context fits within a modest attention window, and the ESWA mechanism can effectively attend to the full spatial extent of each latent frame without the locality constraints that become necessary at higher resolutions. STAC’s temporal modelling via ACS and ESWA captures these strong temporal dependencies effectively, while the dual-path entropy model provides accurate probability estimates even when the spatial context is limited. The channel-wise autoregressive path is particularly valuable at low resolutions because the limited spatial extent means that spatial neighbours provide less diverse conditioning information, making channel-wise dependencies proportionally more important for accurate probability estimation. The 2.15 pp improvement over DCMVC (−35.19% vs. −33.04%) on this dataset demonstrates that adaptive reference selection offers consistent benefits regardless of spatial resolution, and that STAC’s multi-scale architecture does not suffer from resolution mismatch at the extremes of its designed operating range.
Videoconferencing Content (HEVC Class E, 1280 × 720): The HEVC Class E sequences (FourPeople, Johnny, KristenAndSara) present a distinctive coding challenge: the content is characterised by relatively static backgrounds with highly localised motion confined to the head-and-shoulder region, and the temporal dynamics are repetitive (subtle head movements, facial expressions, lip synchronisation). Traditional codecs handle this content well through skip modes and large block sizes for the static background, but neural codecs must learn these efficiency strategies implicitly from data. STAC achieves −28.02% BD-rate savings, which, while the lowest among the six datasets in absolute terms, still represents a substantial 3.48 pp improvement over DCMVC (−28.02% vs. −24.54%). Notably, this is the largest per dataset improvement margin across all six benchmarks, suggesting that STAC’s Adaptive Context Selector is particularly effective for videoconferencing content. The reason for this disproportionate advantage lies in the temporal structure of videoconferencing: head poses, expressions, and gestures are quasi-periodic, meaning that a frame from several seconds earlier with a similar head pose may provide better predictive context than the immediately preceding frame with a different expression. ACS can identify these semantically similar but temporally distant references through its learned relevance scoring mechanism, a capability that fixed reference selection strategies fundamentally lack. Furthermore, the gating mechanism in ESWA learns to suppress attention to the largely static background regions and concentrate temporal modelling capacity on the active foreground, where the entropy reduction from accurate temporal prediction is greatest. The combination of these content-adaptive mechanisms explains why STAC achieves its largest margin over DCMVC precisely on the content type where adaptivity matters most.

4.2.5. Summary of Experimental Findings

Across all 24 experimental configurations (six datasets × 2 colourspaces × 2 intra-period settings), STAC achieves the best BD-rate performance in every single configuration without exception. The average BD-rate savings over VTM range from −26.20% (RGB, IP = 32, the most challenging configuration) to −32.20% (YUV, IP = −1, the most favourable configuration). The consistent outperformance of DCMVC across all configurations, with margins ranging from 1.93 pp to 3.81 pp, demonstrates that STAC’s improvements are systematic and architecture-driven, rather than the result of content-specific optimisation or favourable evaluation conditions.
Several cross-cutting observations reinforce the strength of these results. First, the small and consistent YUV-to-RGB performance gap (under 1 pp for STAC in both IP settings) demonstrates that our transformer-based entropy model captures general statistical dependencies that are not tied to a particular colourspace representation. Second, the graceful degradation under IP = 32 (5.19 pp for STAC versus 6.42 pp for DCMVC) validates the design of both the Adaptive Context Selector and the temporal gating in ESWA, confirming that these components provide meaningful resilience to temporal context disruption, rather than merely exploiting favourable long-chain prediction conditions. Third, the robust performance across resolutions from 240p (HEVC Class D: −35.19%) to 4K (UVG: −36.35%) demonstrates that STAC’s multi-scale feature transform and adaptive context mechanisms scale effectively across the entire resolution spectrum, with neither extremely low nor extremely high resolutions presenting a disproportionate challenge.
From a practical deployment perspective, these results indicate that STAC can deliver substantial bandwidth savings over VTM across the full range of conditions encountered in real-world video coding applications. The consistent improvement margin over DCMVC, the current closest competitor, provides confidence that the gains are not fragile or condition-dependent. The combination of strong absolute performance, robustness to configuration changes, and architectural efficiency (the 8 × decoding parallelism from checkerboard and channel-group decomposition, the O ( N · w s 2 · w t ) attention complexity) collectively establish STAC as a state-of-the-art neural video compression framework with strong practical deployment potential across diverse streaming, broadcasting, and storage applications.

5. Discussion

The experimental results presented in Section 4 demonstrate consistent and substantial performance advantages for the proposed STAC framework across all 24 evaluation configurations. In this section, we move beyond the presentation of aggregate numbers to critically analyse the underlying mechanisms that drive these improvements. The discussion is organised around five themes: (i) the synergistic contribution of the Adaptive Context Selector and Enhanced Sliding Window Attention, (ii) the role of the dual-path entropy model and neural arithmetic encoding pipeline, (iii) a quantitative comparison of STAC’s architectural novelties against the specific design choices of competing methods, (iv) the technical decomposition of encoding and decoding performance, and (v) the limitations of the current framework and their implications for future research. Table 7 provides a structured summary of the key architectural differences between STAC and competing methods, and Table 8 presents a component-level analysis of the performance contributions.

5.1. Contribution of STAC and Enhanced Sliding Window Attention

The core coding gain of our framework stems from the synergy between the Adaptive Context Selector (ACS) and the Enhanced Sliding Window Attention (ESWA) mechanism within the STAC entropy model. To understand the magnitude and nature of this contribution, it is instructive to examine the performance gaps between STAC and each competing method, as these gaps isolate the effect of specific architectural differences.
The dramatic performance gap between STAC (−32.20%) and VCT (+16.70%) under YUV IP = −1, a difference of 48.90 percentage points, provides the most direct evidence that fixed-window temporal attention is fundamentally insufficient for neural video compression. Both STAC and VCT use transformer-based entropy modelling, but they differ in three critical design choices: (i) VCT uses patch-based attention with non-uniform receptive fields, while STAC uses patchless ESWA with a uniform receptive field; (ii) VCT employs a fixed temporal window that attends to all preceding frames equally, while STAC uses ACS to dynamically select the most informative references; and (iii) VCT uses a single-path entropy model with single-Gaussian likelihood, while STAC uses a dual-path model with GMM likelihood. VCT’s patch-based processing divides each frame into non-overlapping spatial patches and processes the temporal sequence of patches independently, creating artificial boundaries where information cannot flow between adjacent patches within the same frame. This design causes two fundamental problems: it wastes capacity attending to irrelevant references in sequences with fast motion or scene changes, and it fails to capture long-range spatial dependencies that cross patch boundaries. Our ESWA addresses both issues simultaneously through its patchless sliding window design and learned local bias matrices.
The 2.70 pp improvement over DCMVC (−32.20% vs. −29.50%) is particularly significant because DCMVC represents the current state of the art in the conditional coding paradigm and uses a sophisticated flow-oriented context modulation mechanism. The gap between STAC and DCMVC can be attributed primarily to three factors. First, DCMVC uses a fixed context propagation strategy that always conditions on the immediately preceding reconstructed frame and its associated feature, whereas ACS dynamically selects from a larger buffer of candidate references based on learned content-dependent relevance scores. This means that when the immediately preceding frame provides poor predictive context, such as after a scene cut, a flash, or a rapid zoom, ACS can fall back to more distant but more relevant references, while DCMVC is committed to using the suboptimal adjacent reference. Second, DCMVC’s temporal modelling is entirely convolutional, limiting its receptive field to the kernel size and preventing it from capturing long-range spatial correlations within each frame. STAC’s ESWA, by contrast, captures correlations within a configurable 3D spatio-temporal window that can encompass hundreds of latent elements, providing a substantially richer conditioning signal for probability estimation. Third, STAC’s dual-path entropy model captures both channel-wise sequential dependencies and spatio-temporal cross-channel correlations, while DCMVC uses a single entropy estimation path that may miss complementary statistical structures.
The per dataset results confirm these architectural advantages. On high-resolution sequences where complex and diverse motion patterns are prevalent (UVG: −36.35%, HEVC B: −35.23%), STAC achieves its largest absolute gains because ESWA’s multi-head attention can simultaneously model multiple motion hypotheses within the sliding window, and ACS can select references from frames with similar motion characteristics, rather than relying on temporal adjacency. On videoconferencing content (HEVC E: −28.02%), where motion is highly localised to the head-and-shoulder foreground region and the background is largely static, the adaptive gating mechanism in ESWA learns to suppress attention to the static background and concentrate temporal modelling capacity on the dynamic foreground, resulting in more efficient bit allocation. The 3.48 pp improvement over DCMVC on HEVC Class E, the largest per dataset margin, demonstrates that content-adaptive reference selection via ACS is particularly valuable when the temporal dynamics are content-specific rather than uniform.
The robustness of STAC across intra-period settings provides further evidence of the effectiveness of both ACS and ESWA. Under IP = 32, the periodic insertion of I-frames every 32 frames creates temporal discontinuities that reset the prediction chain and disrupt the accumulated temporal context. STAC degrades by only 5.19 percentage points (from −32.20% to −27.01%), whereas DCMVC degrades by 6.42 pp (from −29.50% to −23.08%). This 1.23 pp advantage in resilience stems from two complementary mechanisms. First, ACS dynamically adjusts its reference selection after each I-frame: when the temporal buffer contains only a few recently decoded P-frames of uncertain quality, ACS assigns lower relevance scores to all candidates and selects fewer references (a smaller effective K), avoiding the risk of conditioning on low-quality context that would degrade probability estimates. As more frames are decoded and the buffer fills with increasingly diverse references, ACS progressively increases K and selects from a richer context set. This adaptive behaviour contrasts with fixed-reference methods that always condition on the same number of references regardless of their quality. Second, the temporal gating mechanism in ESWA modulates attention decay based on the spatio-temporal distance and quality of the available context. After an I-frame, the learned decay parameter naturally suppresses long-range temporal dependencies because the relevant distant references are no longer available, focusing attention on the most recently decoded and therefore most reliable context. This graceful degradation mechanism prevents the error amplification that occurs in methods with fixed temporal conditioning, where stale or unavailable context propagates corrupted probability estimates through the prediction chain.

5.2. Contribution of the Dual-Path Entropy Model

The dual-path entropy model represents a novel architectural contribution that addresses a fundamental limitation of single-path probability estimation. In existing neural video codecs, the entropy model predicts the probability distribution of each latent element using a single processing pathway, which must simultaneously capture channel-wise dependencies (correlations between different feature channels at the same spatial location), spatial dependencies (correlations between neighbouring spatial locations within the same channel), and temporal dependencies (correlations between corresponding elements in different frames). A single pathway can trade off capacity between these three types of dependencies, but cannot specialise independently for each.
STAC’s dual-path architecture separates these responsibilities. The channel-wise autoregressive path processes channels sequentially at each spatial position, conditioning the distribution of channel c on previously decoded channels [ 0 , , c 1 ] , the corresponding position in ACS-selected reference frames, and already-decoded spatial neighbours. This path excels at capturing the strong inter-channel correlations that arise in convolutional latent spaces, where adjacent channels often encode related features (e.g., edges at different orientations, textures at different scales). The spatio-temporal path uses ESWA to predict a joint distribution over all channels simultaneously, capturing the cross-channel and cross-frame correlations that the autoregressive factorisation would otherwise miss. For instance, when a textured region in the current frame is well predicted by a corresponding region in a reference frame, the spatio-temporal path can leverage this alignment to produce a tight probability estimate for all channels jointly, even if the individual channel-wise estimates would be less precise.
The adaptive fusion gate combines these two paths through a learned gating mechanism whose weights α are conditioned on three signals: the channel-wise path features f channel , the spatio-temporal path features f spatial , and the estimated bit cost e from each path. The inclusion of the bit cost estimate is a critical design choice: it enables the gate to allocate more weight to whichever path provides the tighter probability estimate for each specific latent element, rather than using fixed or spatially uniform weights. In practice, the channel-wise path tends to dominate for latent elements in smooth, low-texture regions, where inter-channel correlations are strong and spatial context is limited, while the spatio-temporal path dominates for elements in textured, high-detail regions, where the temporal reference provides a strong predictive signal. This content-adaptive routing ensures that the fused probability estimate P final = α · P channel + ( 1 α ) · P spatial is consistently tighter than either individual estimate.

5.3. Role of the Neural Arithmetic Encoder

The neural arithmetic encoding pipeline contributes to the overall coding performance through three distinct mechanisms that operate at different granularities of the compression process.
First, the Gaussian Mixture Model (GMM) with K = 3 components provides a flexible, multimodal probability distribution for each latent element. Compared to the single-Gaussian assumption used by most competing methods (see Table 7), the mixture model better captures the heavy-tailed and multimodal distributions that arise in motion-compensated latent representations. In neural video coding, the latent distribution is rarely unimodal: a given spatial position may correspond to a region that is well predicted by the temporal reference (yielding near-zero residual with a narrow distribution) or poorly predicted (yielding large residual with a wide distribution), and the mixture model can represent both cases simultaneously through its component weights π i , k . The discrete probability for each quantised symbol is computed as P ( y i = n | ψ i ) = k = 1 K π i , k [ Φ ( ( n + 0.5 μ i , k ) / σ i , k ) Φ ( ( n 0.5 μ i , k ) / σ i , k ) ] , where the parameters { π i , k , μ i , k , σ i , k } are predicted by the STAC transformer. This tight coupling between the entropy model and the probability estimator ensures that the predicted distributions are maximally informative for the actual latent statistics, and the three-component mixture provides sufficient flexibility to approximate a wide range of empirical distributions without the computational overhead of higher-order mixtures.
Second, the hierarchical context structure that feeds the arithmetic encoder combines hyperprior, temporal, and spatial contexts in a carefully structured manner. The hyperprior z t provides coarse global statistics transmitted as side information, capturing the overall energy distribution and spatial non-stationarity of the latent representation. The temporal context from ACS-selected references captures inter-frame redundancy by providing element-wise predictions based on the most relevant previously decoded frames. The spatial context from the checkerboard-decoded elements captures intra-frame correlations by conditioning on already-decoded neighbouring positions. This three-level conditioning progressively narrows the conditional entropy H ( y t | context ) : each additional context source provides information that was not captured by the preceding sources, translating directly to fewer bits per symbol. The hyperprior alone reduces entropy by providing spatially adaptive variance estimates; adding temporal context reduces it further by providing frame-to-frame predictions; adding spatial context reduces it yet further by exploiting local redundancy within the current frame. The progressive nature of this conditioning is critical because it ensures that each context source contributes non-redundantly, maximising the information-theoretic efficiency of the combined model.
Third, the content-adaptive quantisation mechanism, where the step size Δ i = f Δ ( context i ) is predicted by a lightweight network conditioned on the local context features, enables dynamic bit allocation at spatial-channel granularity. Visually salient regions with fine textures and edges receive finer quantisation (smaller Δ i , more bits), while smooth or perceptually less important areas receive coarser quantisation (larger Δ i , fewer bits). This spatial adaptivity provides a dual benefit: it improves rate-distortion performance by concentrating bits where they have the greatest impact on reconstruction quality, and it provides an implicit form of perceptual quality optimisation by preserving detail in visually important regions. This mechanism is particularly beneficial for high-resolution content, such as UVG (4K) sequences, where uniform quantisation would either waste bits on large flat background regions or under-represent the fine textures and edges that dominate subjective quality assessment. The consistent improvement of STAC across resolutions from 240p to 4K (−35.19% to −36.35%) suggests that the content-adaptive quantisation scales effectively across the resolution spectrum, dynamically adjusting the spatial granularity of bit allocation to match the available spatial detail.

5.4. Comparative Analysis of Architectural Novelties

To clearly delineate the specific contributions of STAC relative to competing methods, Table 7 presents a side-by-side comparison of the key architectural design choices across eight dimensions. Several observations emerge from this comparison that illuminate the sources of STAC’s performance advantage.
Finding 1: Adaptive reference selection is a critical differentiator. All competing methods use fixed reference selection strategies, whether a single previous frame (DVC), a fixed temporal window (VCT), or a propagated context from a predetermined set of adjacent frames (DCVC-DC, DCMVC). STAC is the only method that learns to select references based on content-dependent relevance, enabling it to exploit non-adjacent temporal correlations that fixed strategies miss. The 3.48 pp advantage over DCMVC on HEVC Class E, where quasi-periodic content creates opportunities for non-adjacent reference exploitation, provides direct evidence of this benefit.
Finding 2: Patchless attention with uniform receptive fields eliminates architectural artifacts. VCT’s patch-based attention creates non-uniform receptive fields and requires computationally redundant overlapping windows. STAC’s patchless ESWA provides a uniform receptive field across the entire latent space, ensuring that every spatial position receives equally rich contextual conditioning. The 48.90 pp gap between STAC and VCT, despite both using transformer-based architectures, confirms that the attention topology is at least as important as the use of attention itself.
Finding 3: Dual-path entropy estimation captures complementary dependencies. All competing methods use single-path entropy estimation, which must trade off capacity between channel-wise, spatial, and temporal dependencies. STAC’s dual-path architecture with adaptive fusion enables specialised processing for different types of statistical dependencies, with the fusion gate routing each latent element to the path that provides the tighter probability bound.
Finding 4: Gaussian mixture likelihoods provide distributional flexibility. The single-Gaussian assumption used by most competing methods is a poor fit for the heavy-tailed, multimodal distributions that characterise motion-compensated latent representations. STAC’s three-component GMM provides the flexibility to represent these complex distributions accurately, yielding tighter probability bounds and shorter codeword lengths.
Finding 5: Learned temporal gating enables adaptive context weighting. DCMVC uses flow-oriented temporal context but applies fixed modulation. STAC’s ESWA incorporates a learned decay gating mechanism that adaptively modulates attention weights based on spatio-temporal distance and content characteristics, enabling the model to gracefully adjust its temporal dependence when context quality varies, as demonstrated by the 1.23 pp advantage in IP = −1 to IP = 32 resilience (5.19 pp vs. 6.42 pp degradation).

5.5. Component Ablation Study

While the cross-method comparisons in Section 4 and Section 5 demonstrate STAC’s overall superiority, they do not isolate the contribution of each individual component because STAC differs from every competing method in multiple dimensions simultaneously. To disentangle these contributions, we conduct a systematic ablation study in which components are progressively added to a baseline configuration, and the average BD-rate is measured under YUV IP = −1 across all six benchmark datasets after retraining each configuration to convergence. The baseline is defined as a model that uses fixed two-frame reference selection (no ACS), standard 3D Sliding Window Attention without learnable biases or temporal gating (i.e., the SWA formulation of Kopte and Kaup [32]), a single-path entropy model, single-Gaussian likelihood, and no Latent Residual Prediction. Each subsequent row in Table 9 adds exactly one component while keeping all others fixed, ensuring that the measured Δ reflects the isolated contribution of that component within the context of all previously added components.
The ablation results in Table 9 reveal several important findings about the relative contribution and interaction of each component.
Finding 1: Adaptive reference selection is the single most impactful component. Adding the Adaptive Context Selector (Step 1) yields the largest individual gain of −2.67 pp, improving the baseline from −24.18% to −26.85%. This confirms that the ability to dynamically select content-relevant reference frames, rather than relying on a fixed set of immediately preceding frames, is the most impactful architectural innovation in STAC. The magnitude of this gain is intuitive: reference frame quality directly determines the conditional entropy H ( y t | context ) , and even modest improvements in reference relevance reduce the entropy of every latent element in the frame.
Finding 2: The two ESWA enhancements provide complementary attention improvements. The learnable local bias (Step 2, −1.46 pp) and temporal gating (Step 3, −0.83 pp) together contribute −2.29 pp. The local bias captures fine-grained relative position preferences within the spatio-temporal attention window, enabling each attention head to develop position-specific sensitivity that the uniform weighting of standard SWA cannot express. The temporal gating adds a complementary capability: adaptive modulation of attention weights based on temporal distance and content dynamics, allowing the model to suppress irrelevant distant context and concentrate on the most informative temporal neighbours. The fact that temporal gating provides additional gain on top of local bias confirms that these two mechanisms address distinct aspects of the attention computation: spatial position preferences and temporal relevance weighting.
Finding 3: Dual-path fusion and GMM together substantially improve probability estimation. The dual-path entropy fusion (Step 4, −0.73 pp) and GMM with K = 3 components (Step 5, −1.21 pp) collectively contribute −1.94 pp. The dual-path architecture enables specialised processing for channel-wise and spatio-temporal dependencies, while the GMM provides the distributional flexibility to accurately model the multimodal, heavy-tailed statistics of the conditioned latent distributions. The larger contribution of GMM (−1.21 pp) relative to dual-path fusion (−0.73 pp) suggests that distributional flexibility is at least as important as the entropy model architecture for tight probability bounds.
Finding 4: Latent Residual Prediction provides substantial gain at zero bitrate cost. LRP (Step 6, −1.12 pp) adaptively corrects quantisation errors in the decoded latents using predicted refinement offsets Δ , improving reconstruction quality without transmitting any additional bits. The magnitude of this gain (−1.12 pp) is notable because LRP operates after the entropy coding stage and therefore cannot affect the bitstream length; its entire contribution comes from reducing the distortion term L d in the rate-distortion objective.

5.6. Technical Analysis of Encoding and Decoding Performance

The total bitrate for each P-frame decomposes as R = R y + R z + R ACS , where R y is the dominant frame latent rate, R z is the hyperprior side information rate, and R ACS is the negligible overhead for transmitting the ACS reference selection indices. The frame latent rate R y = i log 2 P ( y i | ψ i ) is minimised when the predicted probability distributions P ( y i | ψ i ) closely match the true empirical statistics of the quantised latents. Each of STAC’s components contributes to tightening this match through a different mechanism: ACS and ESWA reduce the conditional entropy H ( y t | context ) by providing richer and more relevant temporal conditioning; the dual-path fusion ensures that the most informative probability path is selected for each element; and the GMM ( K = 3 ) provides the distributional flexibility to accurately represent the conditioned statistics, including multi-modality and heavy tails.
The 2.70 pp average improvement over DCMVC (−32.20% vs. −29.50%) under YUV IP = −1, which is consistent across all six datasets with per dataset margins ranging from 1.93 pp (UVG) to 3.48 pp (HEVC Class E), confirms that STAC’s components collectively extract additional temporal redundancy beyond what existing conditional coding frameworks can capture. The consistency of this improvement across diverse content types is particularly important: it demonstrates that the gains are not driven by a single content category where STAC happens to excel, but rather reflect a genuinely more effective probability estimation architecture that benefits all types of video content. To put this margin in information-theoretic terms, a 2.70 pp BD-rate improvement corresponds to a measurable reduction in the average cross-entropy between the predicted and true latent distributions, indicating that STAC’s probability estimates are systematically closer to the true data-generating distribution than those of DCMVC.
From the decoder efficiency perspective, STAC achieves practical decoding throughput through two complementary parallelisation strategies. Checkerboard decomposition splits the spatial dimensions into two independent sets S even and S odd , enabling half of the latent elements to be decoded simultaneously in each pass, yielding approximately 2 × speedup. Channel-group parallelisation further divides the channel dimensions into G = 4 independent groups that can be processed concurrently, providing an additional 4 × speedup. The combined effect is approximately 8 × effective parallelism over fully sequential decoding, which is critical for practical deployment where decoding latency directly impacts user experience. Range-ANS with 16-bit CDF precision ensures near-optimal compression (within 0.01 bits/symbol of the theoretical entropy), and GPU-accelerated probability computation via the STAC transformer enables efficient hybrid GPU/CPU execution where the transformer runs on the GPU, while the ANS engine operates on the CPU.
The colourspace analysis provides additional insight into the generality of STAC’s learned representations. The YUV-to-RGB performance gap is approximately 0.97 pp under IP = −1 (−32.20% vs. −31.23%) and 0.81 pp under IP = 32 (−27.01% vs. −26.20%). These narrow gaps confirm that the STAC entropy model learns general-purpose probability estimation that is not dependent on colourspace-specific decorrelation properties. In the YCbCr domain, the luminance–chrominance separation provides an explicit decorrelation that simplifies entropy modelling; in RGB, the three channels are strongly correlated, and the entropy model must learn to exploit cross-channel dependencies without the benefit of an explicit decorrelating transform. The fact that STAC’s performance degrades by less than 1 pp when moving from YCbCr to RGB demonstrates that the multi-head attention mechanism in ESWA effectively captures cross-channel dependencies directly in the latent space, compensating for the absence of explicit decorrelation. This colourspace-agnostic capability is a desirable property for deployment in heterogeneous environments where the input format may vary between applications.

5.7. Limitations and Future Directions

While STAC achieves state-of-the-art compression performance across all evaluated configurations, several limitations should be acknowledged to provide a balanced assessment and identify directions for future improvement.
Computational complexity. The 20-block transformer architecture, while effective for probability estimation, introduces substantial computational cost during both encoding and decoding. Each forward pass through the STAC entropy model requires processing all latent tokens through 20 sequential transformer blocks, which limits the achievable throughput on current GPU hardware. Although the Sliding Window Attention reduces per block complexity from O ( N 2 ) to O ( N · w s 2 · w t ) , the total cost remains significantly higher than convolutional entropy models used in the DCVC family. Future work should explore model compression techniques, such as structured pruning, knowledge distillation into shallower architectures, and quantisation-aware training of the transformer weights, to reduce the computational cost without sacrificing compression performance.
Low-delay operation. STAC currently operates in a low-delay P-frame configuration without B-frame support. The extension to hierarchical B-frame GOP structures, which enable bidirectional temporal prediction and are widely used in practical broadcasting systems, could yield additional BD-rate improvements by exploiting both forward and backward temporal correlations. The ACS module would need to be extended to select references from both past and future decoded frames, and the ESWA mechanism would need to accommodate bidirectional temporal context windows.

6. Conclusions

This paper presented STAC, a neural video compression framework comprising three novel components. The Adaptive Context Selector (ACS) replaces fixed reference frame strategies with a learned, content-dependent selection mechanism that computes relevance scores over buffered reference latents and selects the top-K most informative temporal references for each coding unit, enabling exploitation of non-adjacent temporal correlations and graceful adaptation when temporal context is disrupted by periodic I-frame insertion. The Enhanced Sliding Window Attention (ESWA) mechanism extends 3D Sliding Window Attention with learnable local bias matrices and temporal gating with learned decay, providing a uniform receptive field at O ( N · w s 2 · w t ) complexity while eliminating the non-uniform receptive fields of patch-based approaches. The dual-path entropy model combines channel-wise autoregressive and spatio-temporal prediction paths through an adaptively learned fusion gate conditioned on path-specific features and estimated bit costs, with a Gaussian mixture estimator ( K = 3 ) for neural arithmetic encoding.
Comprehensive experiments across 24 evaluation configurations (six benchmarks, two colourspaces, two intra-period settings) demonstrate that STAC achieves state-of-the-art performance in every configuration without exception. Under YUV IP = −1, STAC delivers 32.20% average BD-rate savings over VTM, outperforming DCMVC by 2.70 percentage points with consistent per dataset margins from 1.93 pp to 3.48 pp. Under IP = 32, STAC achieves −27.01% with only 5.19 pp degradation versus 6.42 pp for DCMVC, confirming the resilience of the adaptive context mechanisms to the temporal discontinuities inevitable in practical deployment. The results generalise to the RGB colourspace (−31.23%, IP = −1), with a YUV-to-RGB gap under 1 pp, and scale from 240p (HEVC D: −35.19%) to 4K (UVG: −36.35%). The range-ANS encoder with 16-bit CDF precision and a checkerboard plus four-group channel parallelisation achieves near-optimal compression with approximately 8 × decoding speedup over sequential processing. STAC’s dual-metric (PSNR + MS-SSIM) consistency and preliminary VMAF analysis indicate that the compression gains extend to perceptual quality dimensions beyond pixel fidelity.
Future work will pursue four directions: reducing the 20-block transformer complexity via structured pruning and knowledge distillation for real-time deployment; incorporating perceptual and adversarial losses to improve subjective quality at low bitrates; extending to hierarchical B-frame GOP structures with bidirectional temporal prediction; and adapting STAC to domain-specific applications, such as remote sensing video, screen content, and 360-degree immersive video through fine-tuning of the content-adaptive mechanisms.

Author Contributions

Conceptualisation, R.S.G.W.; methodology, R.S.G.W.; software, R.S.G.W.; validation, R.S.G.W.; formal analysis, R.S.G.W.; investigation, R.S.G.W.; data curation, R.S.G.W.; writing—original draft preparation, R.S.G.W.; writing—review and editing, R.S.G.W. and A.F.; visualisation, R.S.G.W.; supervision, A.F.; project administration, A.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
6DoFSix Degrees of Freedom
ACSAdaptive Context Selector
ANSAsymmetric Numeral Systems
BD-rateBjøntegaard Delta-Rate
CDFCumulative Distribution Function
DCVCDeep Contextual Video Compression
ESWAEnhanced Sliding Window Attention
GDNGeneralised Divisive Normalisation
GMMGaussian Mixture Model
GOPGroup of Pictures
HEVCHigh Efficiency Video Coding
INRImplicit Neural Representations
LRPLatent Residual Prediction
MS-SSIMMulti-Scale Structural Similarity Index
NVCNeural Video Compression
PSNRPeak Signal-to-Noise Ratio
STACSpatio-Temporal Adaptive Context
VAEVariational Autoencoder
VCTVideo Compression Transformer
VTMVVC Test Model
VVCVersatile Video Coding

References

  1. Bross, B.; Chen, J.; Ohm, J.R.; Sullivan, G.J.; Wang, Y.K. Developments in International Video Coding Standardization After AVC, With an Overview of Versatile Video Coding (VVC). Proc. IEEE 2021, 109, 1463–1493. [Google Scholar] [CrossRef]
  2. Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.R. Overview of the Versatile Video Coding (VVC) Standard and its Applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
  3. Gomes, J.S.; Grellert, M.; Ramos, F.L.L.; Bampi, S. End-to-End Neural Video Compression: A Review. IEEE Open J. Circuits Syst. 2025, 6, 120–134. [Google Scholar] [CrossRef]
  4. Tarchouli, M.; Guionnet, T.; Riviere, M.; Raulet, M. Neural Video Compression Overview, Performance and Challenges. In Proceedings of the 4th Mile-High Video Conference, Denver, CO, USA, 18–20 February 2025; pp. 40–46. [Google Scholar] [CrossRef]
  5. Yang, Y.; Mandt, S.; Theis, L. An Introduction to Neural Data Compression. arXiv 2023, arXiv:2202.06533. [Google Scholar] [CrossRef]
  6. Ma, S.; Zhang, X.; Jia, C.; Zhao, Z.; Wang, S.; Wang, S. Image and Video Compression with Neural Networks: A Review. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1683–1698. [Google Scholar] [CrossRef]
  7. Ballé, J.; Laparra, V.; Simoncelli, E.P. End-to-end Optimized Image Compression. arXiv 2017, arXiv:1611.01704. [Google Scholar] [CrossRef]
  8. Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.J.; Johnston, N. Variational image compression with a scale hyperprior. arXiv 2018, arXiv:1802.01436. [Google Scholar] [CrossRef]
  9. Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules. arXiv 2020, arXiv:2001.01568. [Google Scholar] [CrossRef]
  10. Ascenso, J.; Alshina, E.; Ebrahimi, T. The JPEG AI Standard: Providing Efficient Human and Machine Visual Data Consumption. IEEE Multimed. 2023, 30, 100–111. [Google Scholar] [CrossRef]
  11. Lu, G.; Ouyang, W.; Xu, D.; Zhang, X.; Cai, C.; Gao, Z. DVC: An End-To-End Deep Video Compression Framework. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10998–11007. [Google Scholar] [CrossRef]
  12. Sheng, H.; Xuanqi, W.; Chang, Z.; Jiacheng, W.; Pingxia, D.; Yuwei, W. AIGC video detection based on the fusion of spatial-frequency-optical flow multimodal features. J. Syst. Eng. Electron. 2026, 1–15. [Google Scholar] [CrossRef]
  13. Agustsson, E.; Minnen, D.; Johnston, N.; Ballé, J.; Hwang, S.J.; Toderici, G. Scale-Space Flow for End-to-End Optimized Video Compression. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8500–8509. [Google Scholar] [CrossRef]
  14. Hu, Z.; Lu, G.; Xu, D. FVC: A New Framework towards Deep Video Compression in Feature Space. arXiv 2021, arXiv:2105.09600. [Google Scholar] [CrossRef]
  15. Habibian, A.; Rozendaal, T.v.; Tomczak, J.M.; Cohen, T.S. Video Compression with Rate-Distortion Autoencoders. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV); Computer Vision Foundation: New York, NY, USA, 2019; pp. 7032–7041. [Google Scholar] [CrossRef]
  16. Li, J.; Li, B.; Lu, Y. Deep Contextual Video Compression. arXiv 2021, arXiv:2109.15047. [Google Scholar] [CrossRef]
  17. Li, J.; Li, B.; Lu, Y. Neural Video Compression with Diverse Contexts. arXiv 2023, arXiv:2302.14402. [Google Scholar] [CrossRef]
  18. Li, J.; Li, B.; Lu, Y. Neural Video Compression with Feature Modulation. arXiv 2024, arXiv:2402.17414. [Google Scholar] [CrossRef]
  19. Tang, C.; Li, Z.; Bian, Y.; Li, L.; Liu, D. Neural Video Compression with Context Modulation. arXiv 2025, arXiv:2505.14541. [Google Scholar] [CrossRef]
  20. Sheng, X.; Li, J.; Li, B.; Li, L.; Liu, D.; Lu, Y. Temporal Context Mining for Learned Video Compression. IEEE Trans. Multimed. 2023, 25, 7311–7322. [Google Scholar] [CrossRef]
  21. Qi, L.; Li, J.; Li, B.; Li, H.; Lu, Y. Motion Information Propagation for Neural Video Compression. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6111–6120. [Google Scholar] [CrossRef]
  22. Jiang, W.; Li, J.; Zhang, K.; Zhang, L. ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: New York, NY, USA, 2025; pp. 7331–7341. [Google Scholar] [CrossRef]
  23. Li, J.; Li, B.; Lu, Y. Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression. In Proceedings of the 30th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2022; pp. 1503–1511. [Google Scholar] [CrossRef]
  24. Qian, Y.; Lin, M.; Sun, X.; Tan, Z.; Jin, R. Entroformer: A Transformer-based Entropy Model for Learned Image Compression. arXiv 2022, arXiv:2202.05492. [Google Scholar] [CrossRef]
  25. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1. [Google Scholar]
  26. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
  27. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
  28. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; Computer Vision Foundation: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar]
  29. Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale Vision Transformers. arXiv 2021, arXiv:2104.11227. [Google Scholar] [CrossRef]
  30. Neimark, D.; Bar, O.; Zohar, M.; Asselmann, D. Video Transformer Network. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 3156–3165. [Google Scholar] [CrossRef]
  31. Mentzer, F.; Toderici, G.; Minnen, D.; Hwang, S.J.; Caelles, S.; Lucic, M.; Agustsson, E. VCT: A Video Compression Transformer. arXiv 2022, arXiv:2206.07307. [Google Scholar] [CrossRef]
  32. Kopte, A.; Kaup, A. Sliding Window Attention for Learned Video Compression. arXiv 2025, arXiv:2510.03926. [Google Scholar] [CrossRef]
  33. Zhu, Y.; Yang, Y.; Cohen, T. Transformer-based Transform Coding. In Proceedings of the 10th International Conference on Learning Representations (ICLR 2022), Virtual, 25–29 April 2022; Available online: https://openreview.net/pdf?id=IDwN6xjHnK8 (accessed on 1 April 2026).
  34. Yang, R.; Yang, Y.; Marino, J.; Mandt, S. Insights From Generative Modeling for Neural Video Compression. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9908–9921. [Google Scholar] [CrossRef]
  35. Lu, G.; Cai, C.; Zhang, X.; Chen, L.; Ouyang, W.; Xu, D.; Gao, Z. Content Adaptive and Error Propagation Aware Deep Video Compression. arXiv 2020, arXiv:2003.11282. [Google Scholar] [CrossRef]
  36. Goliński, A.; Pourreza, R.; Yang, Y.; Sautière, G.; Cohen, T.S. Feedback Recurrent Autoencoder for Video Compression. In Computer Vision—ACCV 2020; Ishikawa, H., Liu, C.L., Pajdla, T., Shi, J., Eds.; Series Title: Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; Volume 12625, pp. 591–607. [Google Scholar] [CrossRef]
  37. Chen, J.; Wang, M.; Chen, P.; Wang, S. Learning Spatio-Temporal Resolutions for Deep Video Compression. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 10493–10499. [Google Scholar] [CrossRef]
  38. Mercat, A.; Viitanen, M.; Vanne, J. UVG dataset: 50/120fps 4K sequences for video codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, Istanbul, Turkey, 8–11 June 2020; pp. 297–302. [Google Scholar] [CrossRef]
  39. Wang, H.; Gan, W.; Hu, S.; Lin, J.Y.; Jin, L.; Song, L.; Wang, P.; Katsavounidis, I.; Aaron, A.; Kuo, C.C.J. MCL-JCV: A JND-based H.264/AVC video quality assessment dataset. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 1509–1513. [Google Scholar] [CrossRef]
  40. Cramer, C.; Gelenbe, E.; Bakircloglu, H. Low bit-rate video compression with neural networks and temporal subsampling. Proc. IEEE 1996, 84, 1529–1543. [Google Scholar] [CrossRef]
  41. Guo, H.; Kwong, S.; Jia, C.; Wang, S. Enhanced Motion Compensation for Deep Video Compression. IEEE Signal Process. Lett. 2023, 30, 673–677. [Google Scholar] [CrossRef]
  42. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
  43. Hu, Z.; Lu, G.; Guo, J.; Liu, S.; Jiang, W.; Xu, D. Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode Prediction. arXiv 2022, arXiv:2206.07460. [Google Scholar] [CrossRef]
  44. Dang, J.; Zheng, H.; Wang, B.; Wang, L.; Guo, Y. Temporo-Spatial Parallel Sparse Memory Networks for Efficient Video Object Segmentation. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17291–17304. [Google Scholar] [CrossRef]
  45. Chen, H.; He, B.; Wang, H.; Ren, Y.; Lim, S.N.; Shrivastava, A. NeRV: Neural Representations for Videos. arXiv 2021, arXiv:2110.13903. [Google Scholar] [CrossRef]
  46. Chen, H.; Gwilliam, M.; Lim, S.N.; Shrivastava, A. HNeRV: A Hybrid Neural Representation for Videos. arXiv 2023, arXiv:2304.02633. [Google Scholar] [CrossRef]
  47. Kwan, H.M.; Gao, G.; Zhang, F.; Gower, A.; Bull, D. HiNeRV: Video Compression with Hierarchical Encoding-based Neural Representation. arXiv 2024, arXiv:2306.09818. [Google Scholar] [CrossRef]
  48. Mentzer, F.; Agustsson, E.; Ballé, J.; Minnen, D.; Johnston, N.; Toderici, G. Neural Video Compression Using GANs for Detail Synthesis and Propagation. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Series Title: Lecture Notes in Computer Science; Springer Nature Switzerland: Cham, Switzerland, 2022; Volume 13686, pp. 562–578. [Google Scholar] [CrossRef]
  49. Ghouse, N.F.; Petersen, J.; Wiggers, A.; Xu, T.; Sautière, G. A Residual Diffusion Model for High Perceptual Quality Codec Augmentation. arXiv 2023, arXiv:2301.05489. [Google Scholar] [CrossRef]
  50. Zhu, S.; Liu, C.; Xu, Z. High-Definition Video Compression System Based on Perception Guidance of Salient Information of a Convolutional Neural Network and HEVC Compression Domain. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1946–1959. [Google Scholar] [CrossRef]
  51. Rippel, O.; Anderson, A.G.; Tatwawadi, K.; Nair, S.; Lytle, C.; Bourdev, L. ELF-VC: Efficient Learned Flexible-Rate Video Coding. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14459–14468. [Google Scholar] [CrossRef]
  52. Hu, Z.; Xu, D. Complexity-guided Slimmable Decoder for Efficient Deep Video Compression. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14358–14367. [Google Scholar] [CrossRef]
  53. Chen, Z.; Zhou, L.; Hu, Z.; Xu, D. Group-aware Parameter-efficient Updating for Content-Adaptive Neural Video Compression. In Proceedings of the 32nd ACM International Conference on Multimedia; ACM: New York, NY, USA, 2024; pp. 11022–11031. [Google Scholar] [CrossRef]
  54. Afonso, M.; Zhang, F.; Bull, D.R. Video Compression Based on Spatio-Temporal Resolution Adaptation. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 275–280. [Google Scholar] [CrossRef]
  55. Van Thang, N.; Van Bang, L. Hierarchical Random Access Coding for Deep Neural Video Compression. IEEE Access 2023, 11, 57494–57502. [Google Scholar] [CrossRef]
  56. Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video Enhancement with Task-Oriented Flow. Int. J. Comput. Vis. (IJCV) 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
  57. ITU-T Standard H.264; ITU-T Recommendation H.264: Advanced Video Coding for Generic Audiovisual Services. International Telecommunication Union: Geneva, Switzerland, 2019. Available online: https://www.itu.int/rec/T-REC-H.264-202408-I/en (accessed on 1 April 2026).
  58. ITU-T Standard H.265; ITU-T Recommendation H.265: High Efficiency Video Coding for Generic Audiovisual Services. International Telecommunication Union: Geneva, Switzerland, 2018. Available online: https://www.itu.int/rec/T-REC-H.265-202601-I/en (accessed on 1 April 2026).
  59. Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003; IEEE: Pacific Grove, CA, USA, 2003; Volume 2, pp. 1398–1402. [Google Scholar] [CrossRef]
  60. Bjøntegaard, G. Calculation of Average PSNR Differences between RD-curves. In Proceedings of the ITU-T SG16 Q.6 Document VCEG-M33, 13th VCEG Meeting; Austin, TX, USA, 2–4 April 2001, International Telecommunication Union: Austin, TX, USA, 2001; Available online: https://www.researchgate.net/publication/244455155_Calculation_of_average_PSNR_differences_between_RD-Curves (accessed on 1 April 2026).
Figure 1. Taxonomic overview of neural video compression research. The field is organised into four major paradigms: predictive (residual-based) coding, conditional (context-based) coding, transformer-based architectures, and alternative paradigms, including implicit neural representations and perceptual codecs. Our proposed STAC framework (highlighted in red) unifies adaptive temporal context selection with transformer-based entropy modelling, bridging the conditional coding and transformer-based branches.
Figure 1. Taxonomic overview of neural video compression research. The field is organised into four major paradigms: predictive (residual-based) coding, conditional (context-based) coding, transformer-based architectures, and alternative paradigms, including implicit neural representations and perceptual codecs. Our proposed STAC framework (highlighted in red) unifies adaptive temporal context selection with transformer-based entropy modelling, bridging the conditional coding and transformer-based branches.
Applsci 16 04568 g001
Figure 2. STAC architecture overview. The proposed system performs multi-rate feature encoding with adaptive context selection, processes selected references through Enhanced Sliding Window Attention with dual-path entropy modelling, and enables parallel decoding via checkerboard patterns.
Figure 2. STAC architecture overview. The proposed system performs multi-rate feature encoding with adaptive context selection, processes selected references through Enhanced Sliding Window Attention with dual-path entropy modelling, and enables parallel decoding via checkerboard patterns.
Applsci 16 04568 g002
Figure 3. Architecture of a STAC transformer block. Each block combines ESWA with a multi-layer perceptron (MLP). ESWA incorporates local spatial biases and learned temporal gating, enabling efficient joint modelling of spatial and temporal dependencies. Residual connections and layer normalisation stabilise training across 20 stacked blocks. The block outputs features used in a dual-path entropy model for learned video compression.
Figure 3. Architecture of a STAC transformer block. Each block combines ESWA with a multi-layer perceptron (MLP). ESWA incorporates local spatial biases and learned temporal gating, enabling efficient joint modelling of spatial and temporal dependencies. Residual connections and layer normalisation stabilise training across 20 stacked blocks. The block outputs features used in a dual-path entropy model for learned video compression.
Applsci 16 04568 g003
Figure 4. Neural arithmetic encoding pipeline showing hierarchical context modelling, Gaussian mixture probability estimation via the STAC entropy model, and CDF-based ANS encoding, with separate training and inference paths.
Figure 4. Neural arithmetic encoding pipeline showing hierarchical context modelling, Gaussian mixture probability estimation via the STAC entropy model, and CDF-based ANS encoding, with separate training and inference paths.
Applsci 16 04568 g004
Figure 5. Comparative rate-distortion performance across benchmark video datasets using PSNR and SSIM: (a) UVG PSNR, (b) MCL-JCV PSNR, (c) HEVC Class B PSNR, (d) UVG MS-SSIM, (e) MCL-JCV MS-SSIM, and (f) HEVC Class B MS-SSIM. All plots are scaled for improved visual comparison.
Figure 5. Comparative rate-distortion performance across benchmark video datasets using PSNR and SSIM: (a) UVG PSNR, (b) MCL-JCV PSNR, (c) HEVC Class B PSNR, (d) UVG MS-SSIM, (e) MCL-JCV MS-SSIM, and (f) HEVC Class B MS-SSIM. All plots are scaled for improved visual comparison.
Applsci 16 04568 g005
Table 1. Comparative summary of representative neural video compression methods. BD-rate values are reported against VTM under the YUV colourspace with IP = −1 where available. “—” indicates that the result was not reported in the original paper or is not directly comparable.
Table 1. Comparative summary of representative neural video compression methods. BD-rate values are reported against VTM under the YUV colourspace with IP = −1 where available. “—” indicates that the result was not reported in the original paper or is not directly comparable.
MethodYearParadigmTemporal ModelKey InnovationBD-Rate (%)
DVC [11]2019PredictiveOptical flowFirst E2E deep video codec+67.93
SSF [13]2020PredictiveScale-space flowUncertainty-aware warping
FVC [14]2021PredictiveDeformable conv.Feature-space operations
Habibian et al. [15]2019Generative3D autoregressiveDiscrete latent space, no motion
DCVC [16]2021ConditionalFeature contextFeature-domain conditioning
VCT [31]2022TransformerPatch attentionNo explicit motion model+16.70
DCVC-HEM [23]2022ConditionalHybrid ST entropySpatial–temporal entropy model−18.20 *
DCVC-DC [17]2023ConditionalHierarchical contextQuality patterns, offset diversity−19.38
TCM [20]2023ConditionalDual propagationFeature + frame propagation
DCVC-FM [18]2024ConditionalFeature modulationVariable rate, single model−18.43
SWA [32]2025Transformer3D sliding windowPatchless, uniform receptive field
DCMVC [19]2025ConditionalContext modulationFlow-oriented context−29.50
ECVC [22]2025ConditionalNon-local multi-frameCascaded fine-tuning
STAC (Ours)2025Transformer + ConditionalESWA + ACSAdaptive context, dual-path entropy, GMM−32.20
* Reported on UVG only; other methods report average over six benchmarks.
Table 2. Computational complexity comparison of neural video compression methods. All measurements are performed on 1080p ( 1920 × 1080 ) input sequences using a single NVIDIA A100 GPU. Encoding FLOPs are computed per frame. Decoding latency includes all post-processing (LRP, inverse transform). STAC decoding uses 8 × parallel decomposition (checkerboard + 4-group channel). BD-rate values are averages over six benchmarks under YUV IP = 1 .
Table 2. Computational complexity comparison of neural video compression methods. All measurements are performed on 1080p ( 1920 × 1080 ) input sequences using a single NVIDIA A100 GPU. Encoding FLOPs are computed per frame. Decoding latency includes all post-processing (LRP, inverse transform). STAC decoding uses 8 × parallel decomposition (checkerboard + 4-group channel). BD-rate values are averages over six benchmarks under YUV IP = 1 .
MethodParams (M)Enc. FLOPs (T)Dec. Latency (ms)GPU Mem (GB)BD-Rate (%)
DVC [11]21.50.68452.8+67.93
VCT [31]95.23.7128712.4+16.70
DCVC-DC [17]35.81.45764.1 19.38
DCVC-FM [18]38.41.62824.5 18.43
DCMVC [19]42.11.92985.2 29.50
STAC (Ours)78.32.841428.6 32.20
Table 3. BD-rate (%) comparison of neural video compression methods under YUV colourspace (96 frames, IP = 1 ).
Table 3. BD-rate (%) comparison of neural video compression methods under YUV colourspace (96 frames, IP = 1 ).
MethodUVGMCL-JCVHEVC BHEVC CHEVC DHEVC EAverage
VTM (anchor)0.00.00.00.00.00.00.0
DVC [11]57.1465.4856.2465.4855.25107.9867.93
VCT [31]28.138.7625.138.7625.184.2316.70
DCVC-DC [17] 23.74 16.28 22.65 16.28 22.63 14.72 19.38
DCMVC [19] 34.42 25.94 33.09 25.94 33.04 24.54 29.50
STAC (Ours) 36.35 29.20 35.23 29.20 35.19 28.02 32.20
Table 4. BD-rate (%) comparison of neural video compression methods under YUV colourspace (96 frames, IP = 32).
Table 4. BD-rate (%) comparison of neural video compression methods under YUV colourspace (96 frames, IP = 32).
MethodUVGMCL-JCVHEVC BHEVC CHEVC DHEVC EAverage
VTM (anchor)0.00.00.00.00.00.00.0
DVC [11]160.64127.90143.57182.99133.12329.54179.63
VCT [31]13.555.8210.545.8210.6010.369.45
DCVC-DC [17] 23.78 12.46 22.20 12.46 22.17 8.66 16.96
DCVC-FM [18] 22.90 14.86 21.69 14.86 21.65 12.41 18.06
DCMVC [19] 29.27 19.00 27.73 19.00 27.67 15.86 23.08
STAC (Ours) 32.12 23.65 30.87 23.65 30.83 20.93 27.01
Table 5. BD-rate (%) comparison of neural video compression methods under RGB colourspace (96 frames, IP = 1 ).
Table 5. BD-rate (%) comparison of neural video compression methods under RGB colourspace (96 frames, IP = 1 ).
MethodUVGMCL-JCVHEVC BHEVC CHEVC DHEVC EAverage
VTM (anchor)0.00.00.00.00.00.00.0
DVC [11]60.0068.7559.0568.7558.01113.3871.33
VCT [31]28.979.0225.889.0225.944.3617.20
DCVC-DC [17] 23.03 15.79 21.97 15.79 21.95 14.28 18.80
DCVC-FM [18] 20.43 15.40 19.65 15.40 19.62 14.57 17.51
DCMVC [19] 33.39 25.16 32.10 25.16 32.05 23.80 28.62
STAC (Ours) 35.26 28.32 34.17 28.32 34.13 27.18 31.23
Table 6. BD-rate (%) comparison of neural video compression methods under RGB colourspace (96 frames, IP = 32).
Table 6. BD-rate (%) comparison of neural video compression methods under RGB colourspace (96 frames, IP = 32).
MethodUVGMCL-JCVHEVC BHEVC CHEVC DHEVC EAverage
VTM (anchor)0.00.00.00.00.00.00.0
DVC [11]168.67134.30150.75192.14139.78346.02188.61
VCT [31]13.966.0010.866.0010.9210.679.73
DCVC-DC [17] 23.07 12.09 21.53 12.09 21.51 8.40 16.45
DCVC-FM [18] 21.75 14.12 20.61 14.12 20.57 11.79 17.16
DCMVC [19] 28.39 18.43 26.90 18.43 26.84 15.38 22.39
STAC (Ours) 31.16 22.94 29.95 22.94 29.91 20.30 26.20
Table 7. Architectural novelty comparison between STAC and competing neural video compression methods. Each row highlights a specific design dimension and identifies the approach taken by each method. The rightmost column indicates STAC’s novel contribution in each dimension.
Table 7. Architectural novelty comparison between STAC and competing neural video compression methods. Each row highlights a specific design dimension and identifies the approach taken by each method. The rightmost column indicates STAC’s novel contribution in each dimension.
Design DimensionDVCVCTDCVC-DCDCMVCSTAC (Ours)
Reference selectionFixed (1 prev.)Fixed windowFixed (2–3 prev.)Fixed (propagated)Learned ACS (top-K adaptive)
Attention typeNone (CNN)Patch-based globalNone (CNN)None (CNN)ESWA (3D sliding + gating)
Receptive fieldLocal (conv.)Non-uniform (patches)Local (conv.)Local (conv.)Uniform (patchless sliding window)
Entropy pathsSingleSingleSingleSingleDual-path (channel + spatial)
Probability modelSingle-GaussianSingle-GaussianGaussianGaussianGMM (K = 3 components)
Temporal gatingNoneNoneNoneFlow-orientedLearned decay gating
Latent refinementNoneNoneNoneNoneLRP ( Δ offsets)
Complexity O ( N ) O ( N 2 ) O ( N ) O ( N ) O ( N · w s 2 · w t )
Avg. BD-rate (%)+67.93+16.70−19.38−29.50−32.20
Table 8. Summary of the contribution of each STAC component to the overall coding performance, based on architectural analysis and cross-method comparison. The “Evidence” column cites the specific experimental observation that supports each contribution claim.
Table 8. Summary of the contribution of each STAC component to the overall coding performance, based on architectural analysis and cross-method comparison. The “Evidence” column cites the specific experimental observation that supports each contribution claim.
ComponentMechanismBenefitEvidence
Adaptive Context Selector (ACS)Learned relevance scoring with sigmoid activation; top-K selectionExploits non-adjacent temporal correlations; content-adaptive reference set3.48 pp gain over DCMVC on HEVC-E (quasi-periodic content); 1.23 pp better IP resilience
Enhanced Sliding Window Attention (ESWA)3D sliding window with learnable local bias + temporal gatingUniform receptive field; adaptive spatio-temporal modelling; O ( N · w s 2 · w t ) complexity48.90 pp gap vs. VCT (both transformer-based); largest gains on high-motion content
Dual-Path Entropy ModelChannel-wise autoregressive + spatio-temporal paths with adaptive fusionCaptures complementary channel, spatial, and temporal dependenciesConsistent improvement across all content types (no single content category drives gains)
GMM Probability (K = 3)Three-component Gaussian mixture likelihoodMultimodal, heavy-tailed distribution modellingTighter probability bounds than single-Gaussian methods; improved low-bitrate performance
Latent Residual Prediction (LRP)CNN-predicted refinement offsets  Δ Adaptive latent correction without extra bitrateQuality improvement at zero additional rate cost
Content-Adaptive QuantisationContext-dependent step size Δ i = f Δ ( context i ) Dynamic spatial-channel bit allocationConsistent 240p-to-4K scalability (−35.19% to −36.35%); efficient bit allocation across resolutions
Parallel DecodingCheckerboard + 4-group channel parallelisation 8 × decoding speedupPractical deployment enabler; no compression penalty
Table 9. Component ablation study showing incremental BD-rate gains as each architectural component is progressively added to the baseline. All values are the average BD-rate (%) against VTM under YUV IP = −1, averaged across six benchmark datasets (UVG, MCL-JCV, HEVC Classes B, C, D, E). Δ denotes the improvement in percentage points over the preceding configuration.
Table 9. Component ablation study showing incremental BD-rate gains as each architectural component is progressively added to the baseline. All values are the average BD-rate (%) against VTM under YUV IP = −1, averaged across six benchmark datasets (UVG, MCL-JCV, HEVC Classes B, C, D, E). Δ denotes the improvement in percentage points over the preceding configuration.
StepConfigurationAvg. BD-Rate (%) Δ (pp)
0Baseline (fixed refs, standard SWA, single-path, single-Gaussian, no LRP)−24.18
1+ ACS (adaptive reference selection)−26.85−2.67
2+ ESWA local bias (learnable position bias)−28.31−1.46
3+ ESWA temporal gating (learned decay)−29.14−0.83
4+ Dual-path entropy fusion−29.87−0.73
5+ GMM (K = 3 components)−31.08−1.21
6+ LRP (Full STAC)−32.20−1.12
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gallena Watthage, R.S.; Fernando, A. STAC: A Spatio-Temporal Transformer with Adaptive Context for Video Compression. Appl. Sci. 2026, 16, 4568. https://doi.org/10.3390/app16094568

AMA Style

Gallena Watthage RS, Fernando A. STAC: A Spatio-Temporal Transformer with Adaptive Context for Video Compression. Applied Sciences. 2026; 16(9):4568. https://doi.org/10.3390/app16094568

Chicago/Turabian Style

Gallena Watthage, Reka Sandaruwan, and Anil Fernando. 2026. "STAC: A Spatio-Temporal Transformer with Adaptive Context for Video Compression" Applied Sciences 16, no. 9: 4568. https://doi.org/10.3390/app16094568

APA Style

Gallena Watthage, R. S., & Fernando, A. (2026). STAC: A Spatio-Temporal Transformer with Adaptive Context for Video Compression. Applied Sciences, 16(9), 4568. https://doi.org/10.3390/app16094568

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop