QP-Adaptive Dual-Path Residual Integrated Frequency Transformer for Data-Driven In-Loop Filter in VVC

Yeh, Cheng-Hsuan; Ni, Chi-Ting; Huang, Kuan-Yu; Wu, Zheng-Wei; Peng, Cheng-Pin; Chen, Pei-Yin

doi:10.3390/s25134234

Open AccessArticle

QP-Adaptive Dual-Path Residual Integrated Frequency Transformer for Data-Driven In-Loop Filter in VVC

by

Cheng-Hsuan Yeh

,

Chi-Ting Ni

,

Kuan-Yu Huang

,

Zheng-Wei Wu

,

Cheng-Pin Peng

and

Pei-Yin Chen

^*

Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 70101, Taiwan

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(13), 4234; https://doi.org/10.3390/s25134234

Submission received: 17 June 2025 / Revised: 3 July 2025 / Accepted: 4 July 2025 / Published: 7 July 2025

(This article belongs to the Special Issue Multimodal Sensing Technologies for IoT and AI-Enabled Systems)

Download

Browse Figures

Versions Notes

Abstract

As AI-enabled embedded systems such as smart TVs and edge devices demand efficient video processing, Versatile Video Coding (VVC/H.266) becomes essential for bandwidth-constrained Multimedia Internet of Things (M-IoT) applications. However, its block-based coding often introduces compression artifacts. While CNN-based methods effectively reduce these artifacts, maintaining robust performance across varying quantization parameters (QPs) remains challenging. Recent QP-adaptive designs like QA-Filter show promise but are still limited. This paper proposes DRIFT, a QP-adaptive in-loop filtering network for VVC. DRIFT combines a lightweight frequency fusion CNN (LFFCNN) for local enhancement and a Swin Transformer-based global skip connection for capturing long-range dependencies. LFFCNN leverages octave convolution and introduces a novel residual block (FFRB) that integrates multiscale extraction, QP adaptivity, frequency fusion, and spatial-channel attention. A QP estimator (QPE) is further introduced to mitigate double enhancement in inter-coded frames. Experimental results demonstrate that DRIFT achieves BD rate reductions of 6.56% (intra) and 4.83% (inter), with an up to 10.90% gain on the BasketballDrill sequence. Additionally, LFFCNN reduces the model size by 32% while slightly improving the coding performance over QA-Filter.

Keywords:

CNN; H.266/VVC; in-loop filter; IoT video coding; embedded AI

1. Introduction

Emerging technologies, including the Internet of Things (IoT), artificial intelligence (AI), and others, are gradually contributing to improvements in human lifestyles. The widespread deployment of IoT devices has led to their close integration with various aspects of daily life, allowing not only human–device interaction but also inter-device communication for more collaborative services [1,2]. With advancements in data transmission and storage technologies, such as fifth-generation (5G) networks and big data, these IoT devices have evolved from only performing simple-purpose tasks to supporting multimedia-oriented applications, referred to as Multimedia Internet of Things (M-IoT) [3,4]. M-IoT, especially after being integrated with computer vision and AI, has facilitated significant developments in vision-based monitoring, such as road safety surveillance, smart agriculture, industrial automation, and healthcare services [4]. However, M-IoT is still subject to a variety of constraints that vary across different demands and scenarios. These may involve requirements for high storage or computational capacity, as well as for low power consumption or low latency. Additionally, the increasing resolution of modern devices (e.g., 4K and 8K ultra-high definition (UHD)) has introduced a potential challenge: bandwidth limitations caused by high-bit-rate content transmission [5,6].

To address these challenges, more efficient coding technologies are required [7], highlighting the importance of the High Efficiency Video Coding (HEVC/H.265) standard [8], developed by the Joint Video Experts Team from the ITU-T Video Coding Experts Group and ISO/IEC Moving Picture Experts Group. HEVC has been widely adopted in most smart surveillance devices. However, even HEVC is unlikely to meet the growing demands of next-generation applications. Consequently, its successor, the Versatile Video Coding (VVC/H.266) standard [9], was finalized in July 2020, offering 50% greater coding efficiency than HEVC and providing a solid foundation for high-performance video communication in resource-constrained environments. Several studies [10,11] have validated VVC implementations on hardware platforms (e.g., NVIDIA Jetson) to explore their feasibility in embedded systems.

Despite the advancements in VVC, block-based hybrid coding methods still face significant challenges. Inevitably, transformation and quantization during the encoding processes result in compression artifacts such as blocking, ringing, and blurring. Blocking artifact removal has seen notable improvements in recent works [12,13,14]. To address these issues, VVC incorporates an in-loop filter, including the deblocking filter (DBF [15]), sample adaptive offset (SAO [16]), and adaptive loop filter (ALF [17]), as shown in Figure 1. This filter sequentially reduces blocking artifacts, mitigates ringing effects, and fine-tunes the image quality. Although these conventional in-loop filters effectively reduce artifacts, they are predefined and cannot fully adapt to the dynamic nature of distortions. As a result, data-driven or AI-enabled approaches have emerged as a promising alternative.

Convolutional neural networks (CNNs) have proven to be powerful tools for numerous low-level vision tasks (e.g., super resolution [18], defogging [19], denoising [20]). Inspired by these successes, researchers have integrated these architectures into the fundamental modules of conventional coding tools, such as intra prediction [21,22,23], inter prediction [24,25,26], CU partitioning [27], and in-loop filtering [28,29,30,31,32,33,34,35,36].

In recent years, CNN-based in-loop filters [31,32,33,34,35,36] have shown significant effectiveness in reducing compression artifacts. However, many of these approaches encounter challenges when adapting to varying QPs. This limitation often requires training separate models for different compression levels, leading to increased time and resource consumption. Thus, several studies [35,36] have been dedicated to addressing this challenge by developing QP-adaptive approaches. However, there remains room for improvement in terms of computational complexity and compression performance.

Transformer-based architectures have gained increased attention in image and video restoration tasks due to their strong ability to model long-range dependencies and capture global contexts. For example, SwinIR [37] demonstrates impressive performance in image restoration by leveraging shifted window-based self-attention. Inspired by such advances, several transformer-based in-loop filtering methods have also been proposed [28,29,30], showing promising improvements in artifact removal and compression quality.

In this paper, we introduce a dual-path residual integrated frequency transformer (DRIFT), an efficient QP-adaptive network incorporating a lightweight frequency fusion convolutional neural network (LFFCNN) as the main processing path and a SwinIR transformer global skip connection (SGS) as the residual path. The experimental results show impressive reductions in the BD rate of 6.56% and 4.83% in the intra and inter modes, respectively. Overall, our primary contributions are as summarized below.

Our LFFCNN utilizes octave convolution to separate features into high- and low-frequency components, reducing the computational complexity. It incorporates a proposed frequency fusion residual block (FFRB) with four fundamental modules: MSBF for multiscale feature extraction, LFSQAM for QP adaptation, FFM for frequency information exchange, and HAM for the spatial and channel attention mechanism. Notably, LFFCNN achieves comparable BD rate reductions to QA-Filter [36] while reducing the parameter count by 32%.
LFFCNN enhances local features, whereas SGS leverages Swin transformers to capture long-range dependencies and global contexts. This complementary design significantly improves DRIFT’s performance. Experimental results demonstrate that integrating SGS enhances the BD rate reduction by an additional 1% in intra mode compared to LFFCNN alone.
A quantization parameter estimator (QPE) is proposed to mitigate double enhancement effects. This addition significantly enhances DRIFT’s coding efficiency, yielding a 0.59% improvement in the BD rate reduction for the inter mode.
The proposed DRIFT, which is an AI-enabled method, is beneficial in improving the quality of the reconstructed image on M-IOT devices that support VVC standards.

The rest of this paper is organized as follows. Section 2 presents a review of related work, followed by the proposed DRIFT in Section 3. Section 4 shows the experimental results. Finally, Section 5 concludes the article.

2. Related Work

2.1. CNN-Based In-Loop Filters

CNN-based in-loop filters have made substantial strides in mitigating compression artifacts. As a pioneer, Dong et al. introduced VRCNN [31], employing different kernel sizes within the same layer to extract multiscale features. Kim et al. proposed IACNN [32], expanding the breadth of receptive fields through a parallel network structure. Ding et al. introduced SEFCNN [33] with two serial subnets for feature extraction and enhancement, uniquely addressing the double enhancement issue by selectively applying enhancement at the frame level in low-delay P (LDP) mode and the CU level in random access (RA) mode. Pan et al. proposed EDCNN [34], combining parallel and serial approaches in their feature information fusion block. These diverse architectures demonstrate the potential of CNNs in adaptive filtering for various compression scenarios.

To further improve the generalizability, researchers have explored QP-adaptive methods. X. Song et al. [35] adopted QP maps as additional inputs to better control the filtering strength. However, this approach was limited by its reliance on bias adjustments and input-layer-only QP integration. Liu et al. addressed these limitations with QA-Filter [36], introducing a frequency and spatial QP-adaptive mechanism (FSQAM). This approach integrates QP information into each convolutional layer through weight modulation, allowing for the dynamic adjustment of the filtering strength across the entire network. The FSQAM mechanism represents a significant advancement in QP-adaptive filtering, but there remains room for improvement in the coding efficiency.

2.2. Vision Transformer

Vision transformers (ViT) [38] have revolutionized computer vision tasks by leveraging self-attention mechanisms to capture long-range dependencies in images. Whereas traditional CNNs process images through local receptive fields, transformers can model global relationships across the entire image space. However, the computational complexity of a standard ViT scales quadratically with the image size, making it challenging to process high-resolution images effectively. To address this limitation, SwinIR [37] introduces a hierarchical transformer architecture based on shifted windows, which computes self-attention within local windows while maintaining the ability to model cross-window connections through the shifting mechanism.

The effectiveness of SwinIR has been demonstrated across various image restoration tasks [39,40,41]. In this work, we adopt SwinIR with modifications to its mask mechanism to serve as a global skip connection, which we refer to as SwinIR transformer global skip (SGS). The attention mechanism enables the fusion of global context information while preserving local detailed features through the skip connection pathway.

3. Proposed Method

Based on the current in-loop filter in VVC, rather than replacing existing components, we integrate our DRIFT between DBF and SAO, as illustrated in Figure 2. Oversmoothing is a common issue for CNN-based filters, especially in inter prediction. To address this, DRIFT leverages QP confidence from our QPE module to refine the input QP values. Additionally, we adopt the residual mapping (RM) module, following the method of Liu et al. [42], to further mitigate double enhancement. This design aims to improve the quality of decoded frames in both intra and inter modes. This section begins with a description of the network architecture, followed by FFRB, and is finalized with details of the proposed QPE.

3.1. Network Architecture

As shown in Figure 3, the proposed DRIFT consists of two branches: LFFCNN as the main processing branch to enhance local details and SGS as a residual path to capture long-range dependencies. LFFCNN consists of three parts: octave convolution, FFRBs, and reconstruction layers. This architecture mitigates compression artifacts by integrating global and local features.

3.1.1. Octave Convolution

In LFFCNN, the input reconstructed frame is first processed by octave convolution [43] for frequency-based feature decomposition. Let

α \in [0, 1]

denote the ratio of channels allocated to low-frequency features. The general form of octave convolution processes input features

X_{H} \in R^{(1 - α_{i n}) c_{i n} \times h \times w}, X_{L} \in R^{α_{i n} c_{i n} \times h \times w}

into output features

Y_{H} \in R^{(1 - α_{o u t}) c_{o u t} \times h \times w}, Y_{L} \in R^{α_{o u t} c_{o u t} \times h \times w}

as

\begin{matrix} Y_{H} & = F_{H \to H} (X_{H}^{i n}) + F_{L \to H} (X_{L}^{i n}) \\ Y_{L} & = F_{H \to L} (X_{H}^{i n}) + F_{L \to L} (X_{L}^{i n}) \end{matrix}

(1)

where

F_{H \to H} (\cdot)

and

F_{L \to L} (\cdot)

represent the intra-frequency convolutions, while

F_{L \to H} (\cdot)

and

F_{H \to L} (\cdot)

denote the inter-frequency convolutions with upsampling and pooling operations, respectively. Since our model starts with a single input stream, where

X_{L}^{i n} = 0

, and setting

α = 0.25

, the computation can be simplified to

\begin{matrix} Y_{H} & = F_{H} (x) \\ Y_{L} & = F_{L} (AvgPool (x)) \end{matrix}

(2)

where

x \in R^{1 \times h \times w}

is the input feature of entire network, while

F_{H} (\cdot)

and

F_{L} (\cdot)

represent 3 × 3 convolutional layers.

This frequency-based decomposition offers two key advantages compared to direct feature extraction. First, it reduces the computational complexity and memory usage by processing some of the features at a lower resolution. Second, performing convolution on the downsampled features is equivalent to applying a larger receptive field on the original features, enabling the convolutional layers to capture broader spatial contexts without additional computational overhead.

3.1.2. Feature Fusion Residual Blocks

For hierarchical feature learning and the mapping of low-level features to high-level representations, the high- and low-frequency features

Y_{H}, Y_{L}

are parallelly processed through a cascade of 24 FFRBs:

[Y_{H}^{l + 1}, Y_{L}^{l + 1}] = F_{FFRB}^{l} ([Y_{H}^{l}, Y_{L}^{l}])

(3)

where

Y_{H}^{1} = Y_{H}

,

Y_{L}^{1} = Y_{L}

, and

l \in 1, 2, \dots, 24

. Let

[{\hat{Y}}_{H}, {\hat{Y}}_{L}]

denote the final refined features after all FFRBs:

[{\hat{Y}}_{H}, {\hat{Y}}_{L}] = [Y_{H}^{25}, Y_{L}^{25}]

(4)

The detailed architecture of each FFRB will be presented in Section 3.2.

3.1.3. Reconstruction Layers

The refined features are further processed through our reconstruction module, where we employ subpixel convolution for the upsampling of the low-frequency features. For a feature map

X \in R^{c \times h \times w}

, the subpixel convolution operation can be formulated as

P S {(X)}_{c, h, w} = X_{c \cdot r^{2}, ⌊ h / r ⌋, ⌊ w / r ⌋}

(5)

where

P S (\cdot)

denotes the pixel shuffle operation that rearranges the elements of a

H \times W \times C \cdot r^{2}

tensor to a tensor of shape

r H \times r W \times C

, and r is the upscaling factor.

Let

f_{k \times k} (\cdot)

represent a

k \times k

convolutional layer. The reconstruction process can then be expressed as

\begin{matrix} {\hat{Y}}_{L}^{'} = P S (f_{3 \times 3} ({\hat{Y}}_{L})) \\ {\hat{Y}}_{H}^{'} = f_{3 \times 3} ({\hat{Y}}_{H}) \\ Y_{L F F C N N} = {\hat{Y}}_{L}^{'} + {\hat{Y}}_{H}^{'} \end{matrix}

(6)

This design leverages subpixel convolution instead of traditional upsampling methods to generate more detailed high-frequency information during the upscaling process. Through channel shuffling into spatial locations, it effectively preserves and enhances fine details in the reconstructed features.

3.1.4. SwinIR Transformer Global Skip

The SGS branch is adapted from SwinIR [37], with modifications to the window size and embedding dimension to align with the specific requirements of our task. In this work, we adjust them to 8 and 60, respectively. As illustrated in Figure 4c, SGS focuses on capturing long-range dependencies, providing a global structural representation of the entire network input x. Its output,

Y_{S G S}

, is combined with the detailed local features

Y_{L F F C N N}

produced by the LFFCNN branch to form the final output:

\begin{matrix} Y_{S G S} = F_{SGS} (x_{D R I F T}) \\ Y_{DRIFT} = Y_{S G S} + Y_{L F F C N N} \end{matrix}

(7)

This fusion of global and local features leverages the complementary strengths of both branches, resulting in enhanced overall visual quality. Specifically, SGS provides a global coarse structure, while LFFCNN refines local details, addressing compression artifacts effectively.

3.2. Details of FFRB

The FFRB consists of four key modules: multiscale branch fusion (MSBF), the lightweight FSQAM (LFSQAM), a frequency fusion module (FFM), and a hybrid attention module (HAM). The architecture diagrams of each corresponding module are shown in Figure 5 and Figure 6. The following subsections will detail the structure and function of each module.

3.2.1. Multiscale Branch Fusion

Inspired by VRCNN [31], and considering that VVC introduces more diverse block sizes and coding tools than HEVC, which leads to more varied distortion distributions, we propose MSBF. Following the design philosophy of MADNet [44], MSBF extracts features at different scales through three parallel branches. Let

x_{i n} \in R^{c \times h \times w}

denote the input feature; MSBF can be formulated as

\begin{matrix} F_{l e f t} = f_{3 \times 3} (f_{d_{2}} (f_{1 \times 1} (x_{i n}))) \\ F_{m i d} = f_{3 \times 3} (f_{d_{3}} (f_{1 \times 1} (x_{i n})) \\ F_{r i g h t} = f_{3 \times 3} (x_{i n}) \end{matrix}

(8)

where convolutional layer

f_{1 \times 1} (\cdot)

reduces the channel dimension from c to

c / 4

, and

f_{d_{r}} (\cdot)

denotes 3 × 3 depthwise dilated convolution with dilation rate r. The output features

F_{l e f t}, F_{m i d} \in R^{c / 4 \times h \times w}

and

F_{r i g h t} \in R^{c / 2 \times h \times w}

are then fused through

y_{f u s e d} = f_{1 \times 1} ([F_{l e f t}, F_{m i d}, F_{r i g h t}])

(9)

where

[\cdot]

denotes channel-wise concatenation. Then,

f_{1 \times 1}

produces the final output feature

y_{f u s e d} \in R^{c \times h \times w}

.

This design brings three main advantages: (1) the parallel branches with different receptive fields effectively capture multiscale feature representations required for handling VVC’s diverse block sizes, (2) the use of dilated convolutions enables larger receptive fields without increasing the computational complexity, and (3) the channel reduction strategy through 1 × 1 convolutions significantly reduces the number of parameters and the computational cost compared to standard convolution, while maintaining feature extraction capabilities.

3.2.2. Lightweight FSQAM

In order to enhance the QP adaptiveness while reducing the parameters, we follow Liu et al.’s [36] FSQAM design and replace the original

5 \times 5

convolution layers with

3 \times 3

ones, which we refer to as LFSQAM. In VVC, the relationship between the QP and the quantization step (

Q_{s t e p}

) is defined as

Q_{s t e p} = 2^{(Q P - 4) / 6}

(10)

Here, FSQAM consists of two main components: a frequency QP-adaptive mechanism (FQAM) and a spatial QP-adaptive mechanism (SQAM). Let

z_{i n} \in R^{c \times h \times w}

denote the input feature map. FQAM adaptively adjusts the filtering strength in the frequency domain by decomposing convolution kernels to operate on specific frequencies. The channel-wise scaling factor is introduced as

s_{i} = \frac{1}{1 + θ \cdot Q_{s t e p}^{2}}, i \in 1, \dots, c

(11)

where

θ

is a learnable parameter. This scaling factor effectively modulates the convolution operation based on the compression quality:

\begin{matrix} z_{o u t} & = (w \cdot s) * z_{i n} + (b \cdot s) \\ = (w * z_{i n} + b) \cdot s \end{matrix}

(12)

where w and b denote the convolution weights and bias terms, respectively. Whereas FQAM considers channel-wise adaptation, SQAM extends the QP-adaptive mechanism to the spatial domain by leveraging maximum and average pooling operations to capture diverse spatial information. Let

z_{s q}

denote the input feature to SQAM. The spatial attention weights are computed through

A = σ (f_{5 \times 5} ([M a x P o o l (z_{s q}), A v g P o o l (z_{s q})]))

(13)

This spatial attention mechanism helps to identify regions requiring different enhancement levels based on local feature characteristics and compression artifacts. The final output feature

\hat{z_{s q}}

is obtained by

\hat{z_{s q}} = A ⊙ z_{s q}

(14)

As shown in Figure 5b, our lightweight implementation modifies the original design by replacing the 5 × 5 convolution with a 3 × 3 convolution in LSQAM, reducing the computational complexity while maintaining the adaptive capability for varying compression qualities.

3.2.3. Frequency Fusion Module

Building upon the concept of octave convolution [43], we design an FFM to facilitate the information exchange between high- and low-frequency features obtained from LFSQAM. In contrast to octave convolution, which employs four

3 \times 3

convolutional layers for frequency information interaction, our FFM achieves efficient feature fusion through a more lightweight architecture with only two additional 1 × 1 convolutional layers for channel alignment, alongside the subpixel convolution operations. Let

H_{i n}

and

L_{i n}

denote the input high- and low-frequency features, respectively. The FFM operations can be formulated as

\begin{matrix} H_{o u t} & = H_{i n} + f_{1 \times 1} (L_{i n}) \\ L_{o u t} & = L_{i n} + S u b P i x e l C o n v (H_{i n}) \end{matrix}

(15)

where

S u b P i x e l C o n v (\cdot)

denotes the subpixel convolution path consisting of three cascaded convolutional layers followed by a pixel shuffle operation, as illustrated in Figure 6a. This design enables efficient bidirectional communication between frequency components while maintaining relatively low computational complexity.

3.2.4. Hybrid Attention Module

The HAM adaptively adjusts the proportions of original features and learned residual features through a combination of spatial and channel attention mechanisms. Let

x_{s k i p}

and

x_{m a i n}

denote the skip branch input and main branch input, respectively. We first perform an interlaced concatenation operation:

X_{i c} = I (x_{s k i p}, x_{m a i n})

(16)

where

I (\cdot)

represents the interlaced concatenation operation that consists of channel-wise concatenation followed by channel shuffling. The concatenated feature

X_{i c} \in R^{2 c \times h \times w}

is then processed through spatial

F_{s}

and channel

F_{c}

attention paths:

\begin{matrix} F_{s} & = f_{1 \times 1} (L (f_{3 \times 3}^{g} (X_{i c}))) \\ F_{c} & = f_{1 \times 1} (G A P (X_{i c})) \end{matrix}

(17)

where

f_{3 \times 3}^{g}

denotes 3 × 3 group convolution for efficient spatial feature extraction, and

L (\cdot)

represents the leaky ReLU activation function with negative slope 0.2. Next,

F_{c}

is upsampled and combined with

F_{s}

to obtain attention weights:

[w_{s k i p}, w_{m a i n}] = I (σ (F_{s} + U P (F_{c})))

(18)

where

w_{s k i p}, w_{m a i n} \in R^{c \times h \times w}

are the skip branch weights and main branch weights, respectively. Finally, the attention weights are applied to their respective input features to generate the output feature:

y_{m a i n} = w_{s k i p} ⊙ x_{s k i p} + w_{m a i n} ⊙ x_{m a i n}

(19)

where

σ (x)

denotes the sigmoid function, ⊙ represents element-wise multiplication, and

U P (\cdot)

is an unpooling operation.

3.3. Quantization Parameter Estimator

To further mitigate the double enhancement issue in inter mode, we introduce a QPE to calibrate the QP values. Our approach is inspired by Zagoruyko et al.’s [45] two-channel architecture of comparing image patches, which we adapt to estimate the QP values efficiently. As illustrated in Figure 7, the proposed QPE processes deblocked input frames

x_{q p e}

through four feature extract blocks (FEB), followed by an inception block and fully connected layers for QP confidence estimation. The feature extraction process can be formulated as

F_{i} = F E B_{i} (F_{i - 1}), i \in 1, \dots, 4

(20)

where

F_{0} = x_{q p e}

, and

F_{i}

represents the intermediate feature maps extracted by the i-th FEB with configuration

(c_{i n}, c_{o u t})

, as shown in Figure 7b. Each FEB contains

F_{o u t} = M a x P o o l (B (R (f_{k \times k} (F_{i n}))))

(21)

where

R (\cdot)

represents ReLU activation, and

B (\cdot)

is the BatchNorm operation. The extracted features

F_{4}

are further processed through an inception block with multiple parallel convolution paths. Figure 7c details the structure of our inception block. The output features are channel-wise concatenated with

F_{4}

:

\hat{F} = [f_{1 \times 1} (F_{4}), f_{3 \times 3}^{1} (F_{4}), f_{3 \times 3}^{2} (F_{4}), f_{3 \times 3}^{a v g} (F_{4}), F_{4}]

(22)

where

f_{3 \times 3}^{1}

and

f_{3 \times 3}^{2}

represent two different 3 × 3 convolution paths, and

f_{k \times k}^{a v g}

indicates

k \times k

average pooling followed by 1 × 1 convolution. The confidence scores

s_{q p}

for different QP values are then computed through

s_{q p} = σ (F C (G A P (\hat{F}))) \in R^{4}

(23)

where

G A P (\cdot)

represents global average pooling,

F C (\cdot)

denotes fully connected layers, and

s_{q p}

contains the confidence scores for different QP values. In this work, we estimate 22, 27, 32, and 37. For inter prediction, these confidence scores from all patches are aggregated to determine the overall QP value of the reconstructed frame. This estimated QP is then used to adjust the QP input to DRIFT, mitigating double enhancement effects and potentially improving the overall video coding quality. A comprehensive evaluation of the QPE module’s impact on system performance will be presented in Section 4.5.

4. Experimental Results

This section begins by describing the experimental settings in Section 4.1. Section 4.2, Section 4.3, and Section 4.4 present the objective, subjective, and complexity evaluations, respectively. Finally, an ablation study is presented in Section 4.5.

4.1. Experimental Settings

For the training process of DRIFT, we utilized the DIV2K dataset [46], which comprises 900 diverse, high-quality RGB images. We allocated 800 images for training and 100 for validation. The dataset was processed using the VVC Test Model (VTM), where the reconstructed images after the deblocking filter served as model inputs, while the original uncompressed images were used as the ground truth. By segmenting the reconstructed images into non-overlapping

64 \times 64

blocks, we obtained 2,250,624 training samples and 285,824 validation samples across four QPs. The model was trained using the mean square error (MSE) as the loss function and implemented using PyTorch 1.12.1, with weights initialized using Kaiming initialization [47] and optimized using the Adam optimizer [48]. Training was completed in approximately 22 epochs with a batch size of 72.

The QPE training utilized 120 videos from the REDS dataset [49], encoded under random access (RA) settings. Patches were labeled based on the mean square error (MSE) between the reconstructed and original frames, where patches with the lowest MSE received a label of 1, and others were assigned 0. The PyTorch model was trained for 100 epochs with a batch size of 64, utilizing the Adam optimizer for all layers.

For evaluation, we integrated the trained models into VTM-12.3 [31] via the LibTorch framework [50]. Following common test conditions (CTC) [51], we tested four QP values (22, 27, 32, and 37) under both all-intra (AI) and random access (RA) configurations. The evaluation used the first 64 frames of the VVC test sequences and was conducted on a system equipped with an Intel i7-10700 CPU and an NVIDIA RTX 3090Ti GPU. To assess the rate distortion performance, the BD rate was calculated based on the PSNR across the four QP settings. This metric reflects the average bit rate savings achieved at equivalent reconstruction quality, serving as a standard indicator for codec efficiency.

4.2. Objective Evaluation

To assess the coding performance of our proposed DRIFT, we adopt the BD rate on the luma component as the evaluation criterion. It is found that DRIFT outperforms previous methods [32,33,34,36] in both the AI and RA configurations. Under the AI configuration (Table 1), DRIFT achieved an average BD rate reduction of 6.56% and an up to 10.90% BD rate for sequence

B a s k e t b a l l D r i l l

. Under the RA configuration (Table 2), DRIFT attained an average BD rate reduction of 4.83%. Whereas QA-Filter [36] shows a marginal improvement of 0.13% over EDCNN [34] in the RA configuration compared to its more substantial gain in the AI configuration, this performance disparity can be attributed to the impact of double enhancement. In contrast, DRIFT demonstrates consistent improvements of 1.15% and 1.00% over QA-Filter in the AI and RA configurations, respectively, showing promising results in mitigating the oversmoothing issue commonly encountered in in-loop filtering approaches.

4.3. Subjective Evaluation

Subjective quality evaluations were conducted using two video sequences,

B Q S q u a r e

and

B a s k e t b a l l D r i v e

, as shown in Figure 8. These sequences were encoded under the random access (RA) configuration with

Q P = 37

. We compare the reconstructions from the ground truth (GT), the VTM baseline, QA-Filter [36], and our proposed DRIFT method to assess the visual quality. In the

B Q S q u a r e

sequence, DRIFT demonstrates enhanced structural preservation, particularly in the umbrella frame details, achieving better reconstruction quality compared to QA-Filter. For the

B a s k e t b a l l D r i v e

sequence, focusing on wall textures and railing structures, VTM’s reconstruction exhibits noticeable discontinuities in the railing texture, while QA-Filter introduces artifacts in the lower railing region. DRIFT, however, preserves the texture continuity in both the walls and railings without introducing spurious artifacts.

4.4. Complexity Evaluation

We analyze the model parameters and encoding time complexity of our proposed methods. Table 3 shows that our LFFCNN achieves a compact model size with 1.22 M parameters, reducing the parameter count by 32% compared to QA-Filter [36]. The complete DRIFT framework, which integrates the LFFCNN and SGS modules, requires 2.09 M parameters. Table 4 compares the encoding time complexity with the VTM benchmark using the GPU acceleration. Our method introduces approximately 8% more encoding complexity compared to QA-Filter. This increase in computational overhead can potentially be attributed to the cascaded residual blocks and data dependencies between modules, as well as the transformer-based computations in the SGS component. Several potential approaches could be explored to address the complexity issue in future work. Model compression techniques such as pruning and quantization could reduce the computational overhead. Inter-layer skipping mechanisms between residual blocks could also be implemented to improve the processing efficiency. These optimizations could potentially maintain the coding performance while reducing the computational demands.

4.5. Ablation Study

4.5.1. Effectiveness of QPE

CNN-based loop filters often suffer from double enhancement issues, particularly in inter prediction. To address this, we incorporate the QPE module into our framework. As shown in Table 5, without the QPE, LFFCNN achieves a −4.59% BD rate reduction on average, whereas the complete DRIFT framework further improves this to −5.18%. This improvement demonstrates that our QPE module effectively mitigates the double enhancement problem by providing appropriate QP compensation.

4.5.2. Effectiveness of SGS

Table 6 presents the BD rate analysis results, where LFFCNN without SGS achieves a comparable BD rate reduction (0.23% better) to QA-Filter [36], with fewer parameters. After incorporating the SGS module, our DRIFT framework further improves the coding efficiency to a −6.73% BD rate reduction. This significant improvement of 0.94% in the BD rate reduction validates the effectiveness of our SGS module in capturing global dependencies and enhancing the overall coding performance.

5. Conclusions

In this paper, we propose DRIFT, an efficient QP-adaptive in-loop filtering network for VVC, targeting enhanced video quality in bandwidth- and resource-constrained environments such as IoT and AI-enabled embedded systems. First, LFFCNN is introduced, with octave convolution and the FFRB design, achieving a 32% reduction in parameters and with a slight improvement in the BD rate reduction over QA-Filter. Through the integration of a Swin transformer-based global skip connection (SGS), the network’s feature extraction capabilities are further enhanced, contributing an additional 0.92% BD rate reduction in the AI configuration. To address the double enhancement issue in inter prediction, a QP estimator (QPE) is employed to mitigate repeated filtering effects. Overall, DRIFT achieves 6.56% and 4.83% BD rate reductions under the AI and RA configurations, respectively, with an up to 10.90% reduction in the BasketballDrill sequence in intra mode. These results demonstrate that DRIFT offers a powerful and lightweight in-loop filtering solution that enables real-time video processing for AI-enabled systems and M-IOT devices.

In future work, we plan to explore further architectural optimizations to reduce the computational overhead while maintaining the filtering performance. Additionally, broader experimental validation on more diverse datasets and codec configurations will be conducted to assess the generalization capabilities. Finally, potential deployment on hardware platforms (e.g., NVIDIA Jetson) will be investigated to enable real-time operation in embedded and low-power systems.

Author Contributions

Conceptualization, C.-H.Y., C.-T.N. and K.-Y.H.; Methodology, C.-T.N., Z.-W.W. and C.-P.P.; Validation, Z.-W.W. and C.-P.P.; Formal analysis, C.-H.Y. and C.-T.N.; Investigation, C.-T.N., Z.-W.W. and C.-P.P.; Data curation, C.-H.Y., Z.-W.W. and C.-P.P.; Writing—original draft, C.-H.Y.; Writing—review and editing, K.-Y.H.; Supervision, P.-Y.C. and K.-Y.H.; Project administration, P.-Y.C.; Funding acquisition, P.-Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council, Taiwan, under grant number NSTC 111-2221-E-006-175-MY3.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kim, J.; Lee, J.; Kim, J.; Yun, J. M2M service platforms: Survey, issues, and enabling technologies. IEEE Commun. Surv. Tutor. 2013, 16, 61–76. [Google Scholar] [CrossRef]
Cao, Y.; Jiang, T.; Han, Z. A survey of emerging M2M systems: Context, task, and objective. IEEE Internet Things J. 2016, 3, 1246–1258. [Google Scholar] [CrossRef]
Floris, A.; Atzori, L. Managing the quality of experience in the multimedia Internet of Things: A layered-based approach. Sensors 2016, 16, 2057. [Google Scholar] [CrossRef]
Nauman, A.; Qadri, Y.A.; Amjad, M.; Zikria, Y.B.; Afzal, M.K.; Kim, S.W. Multimedia Internet of Things: A comprehensive survey. IEEE Access 2020, 8, 8202–8250. [Google Scholar] [CrossRef]
Bouaafia, S.; Khemiri, R.; Messaoud, S.; Ben Ahmed, O.; Sayadi, F.E. Deep learning-based video quality enhancement for the new versatile video coding. Neural Comput. Appl. 2022, 34, 14135–14149. [Google Scholar] [CrossRef]
Choi, Y.J.; Lee, Y.W.; Kim, J.; Jeong, S.Y.; Choi, J.S.; Kim, B.G. Attention-based bi-prediction network for versatile video coding (vvc) over 5g network. Sensors 2023, 23, 2631. [Google Scholar] [CrossRef]
Guo, H.; Zhou, Y.; Guo, H.; Jiang, Z.; He, T.; Wu, Y. A Survey on Recent Advances in Video Coding Technologies and Future Research Directions. IEEE Trans. Broadcast. 2025, 2, 666–671. [Google Scholar] [CrossRef]
Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.R. Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
Farhat, I.; Cabarat, P.L.; Menard, D.; Hamidouche, W.; Déforges, O. Energy Efficient VVC Decoding on Mobile Platform. In Proceedings of the 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), Poitiers, France, 27–29 September 2023; pp. 1–6. [Google Scholar]
Saha, A.; Roma, N.; Chavarrías, M.; Dias, T.; Pescador, F.; Aranda, V. GPU-based parallelisation of a versatile video coding adaptive loop filter in resource-constrained heterogeneous embedded platform. J. Real-Time Image Process. 2023, 20, 43. [Google Scholar] [CrossRef]
Lin, L.; Yu, S.; Zhou, L.; Chen, W.; Zhao, T.; Wang, Z. PEA265: Perceptual assessment of video compression artifacts. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 3898–3910. [Google Scholar] [CrossRef]
Lin, L.; Wang, M.; Yang, J.; Zhang, K.; Zhao, T. Toward Efficient Video Compression Artifact Detection and Removal: A Benchmark Dataset. IEEE Trans. Multimed. 2024, 26, 10816–10827. [Google Scholar] [CrossRef]
Jiang, N.; Chen, W.; Lin, J.; Zhao, T.; Lin, C.W. Video compression artifacts removal with spatial-temporal attention-guided enhancement. IEEE Trans. Multimed. 2023, 26, 5657–5669. [Google Scholar] [CrossRef]
List, P.; Joch, A.; Lainema, J.; Bjontegaard, G.; Karczewicz, M. Adaptive deblocking filter. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 614–619. [Google Scholar] [CrossRef]
Fu, C.M.; Alshina, E.; Alshin, A.; Huang, Y.W.; Chen, C.Y.; Tsai, C.Y.; Hsu, C.W.; Lei, S.M.; Park, J.H.; Han, W.J. Sample adaptive offset in the HEVC standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1755–1764. [Google Scholar] [CrossRef]
Tsai, C.Y.; Chen, C.Y.; Yamakage, T.; Chong, I.S.; Huang, Y.W.; Fu, C.M.; Itoh, T.; Watanabe, T.; Chujoh, T.; Karczewicz, M.; et al. Adaptive loop filtering for video coding. IEEE J. Sel. Top. Signal Process. 2013, 7, 934–945. [Google Scholar] [CrossRef]
Park, S.C.; Park, M.K.; Kang, M.G. Super-resolution image reconstruction: A technical overview. IEEE Signal Process. Mag. 2003, 20, 21–36. [Google Scholar] [CrossRef]
Xu, Y.; Wen, J.; Fei, L.; Zhang, Z. Review of video and image defogging algorithms and related studies on image restoration and enhancement. IEEE Access 2015, 4, 165–188. [Google Scholar] [CrossRef]
Tian, C.; Fei, L.; Zheng, W.; Xu, Y.; Zuo, W.; Lin, C.W. Deep learning on image denoising: An overview. Neural Netw. 2020, 131, 251–275. [Google Scholar] [CrossRef]
Dumas, T.; Galpin, F.; Bordes, P. Iterative training of neural networks for intra prediction. IEEE Trans. Image Process. 2020, 30, 697–711. [Google Scholar] [CrossRef]
Park, D.; Kang, D.U.; Kim, J.; Chun, S.Y. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 327–343. [Google Scholar]
Zhu, L.; Kwong, S.; Zhang, Y.; Wang, S.; Wang, X. Generative adversarial network-based intra prediction for video coding. IEEE Trans. Multimed. 2019, 22, 45–58. [Google Scholar] [CrossRef]
Huo, S.; Liu, D.; Li, B.; Ma, S.; Wu, F.; Gao, W. Deep network-based frame extrapolation with reference frame alignment. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1178–1192. [Google Scholar] [CrossRef]
Murn, L.; Blasi, S.; Smeaton, A.F.; Mrak, M. Improved CNN-based learning of interpolation filters for low-complexity inter prediction in video coding. IEEE Open J. Signal Process. 2021, 2, 453–465. [Google Scholar] [CrossRef]
Pan, Z.; Zhang, P.; Peng, B.; Ling, N.; Lei, J. A CNN-based fast inter coding method for VVC. IEEE Signal Process. Lett. 2021, 28, 1260–1264. [Google Scholar] [CrossRef]
Li, T.; Xu, M.; Tang, R.; Chen, Y.; Xing, Q. DeepQTMT: A deep learning approach for fast QTMT-based CU partition of intra-mode VVC. IEEE Trans. Image Process. 2021, 30, 5377–5390. [Google Scholar] [CrossRef] [PubMed]
Kathariya, B.; Li, Z.; Van der Auwera, G. Joint Pixel and Frequency Feature Learning and Fusion via Channel-wise Transformer for High-Efficiency Learned In-Loop Filter in VVC. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 4070–4083. [Google Scholar] [CrossRef]
Kathariya, B.; Li, Z.; Wang, H.; Coban, M. Multi-stage spatial and frequency feature fusion using transformer in cnn-based in-loop filter for vvc. In Proceedings of the 2022 Picture Coding Symposium (PCS), San Jose, CA, USA, 7–9 December 2022; pp. 373–377. [Google Scholar]
Tong, O.; Chen, X.; Wang, H.; Zhu, H.; Chen, Z. Swin Transformer-Based In-Loop Filter for VVC Intra Coding. In Proceedings of the 2024 Picture Coding Symposium (PCS), Taichung, Taiwan, 12–14 June 2024; pp. 1–5. [Google Scholar]
Dai, Y.; Liu, D.; Wu, F. A convolutional neural network approach for post-processing in HEVC intra coding. In Proceedings of the MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, 4–6 January 2017; Proceedings, Part I 23. Springer: Cham, Switzerland, 2017; pp. 28–39. [Google Scholar]
Kim, Y.; Soh, J.W.; Park, J.; Ahn, B.; Lee, H.S.; Moon, Y.S.; Cho, N.I. A pseudo-blind convolutional neural network for the reduction of compression artifacts. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1121–1135. [Google Scholar] [CrossRef]
Ding, D.; Kong, L.; Chen, G.; Liu, Z.; Fang, Y. A switchable deep learning approach for in-loop filtering in video coding. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1871–1887. [Google Scholar] [CrossRef]
Pan, Z.; Yi, X.; Zhang, Y.; Jeon, B.; Kwong, S. Efficient in-loop filtering based on enhanced deep convolutional neural networks for HEVC. IEEE Trans. Image Process. 2020, 29, 5352–5366. [Google Scholar] [CrossRef]
Song, X.; Yao, J.; Zhou, L.; Wang, L.; Wu, X.; Xie, D.; Pu, S. A practical convolutional neural network as loop filter for intra frame. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 1133–1137. [Google Scholar]
Liu, C.; Sun, H.; Katto, J.; Zeng, X.; Fan, Y. QA-Filter: A QP-adaptive convolutional neural network filter for video coding. IEEE Trans. Image Process. 2022, 31, 3032–3045. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Long, Y.; Wang, X.; Xu, M.; Zhang, S.; Jiang, S.; Jia, S. Dual self-attention Swin transformer for hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3275146. [Google Scholar] [CrossRef]
Chi, K.; Yuan, Y.; Wang, Q. Trinity-Net: Gradient-guided Swin transformer-based remote sensing image dehazing and beyond. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3285228. [Google Scholar] [CrossRef]
Fan, C.; Liu, T.; Liu, K. SUNet: Swin transformer UNet for image denoising. arXiv 2022, arXiv:2202.14009. [Google Scholar]
Liu, C.; Sun, H.; Katto, J.; Zeng, X.; Fan, Y. A convolutional neural network-based low complexity filter. arXiv 2020, arXiv:2009.02733. [Google Scholar]
Chen, Y.; Fan, H.; Xu, B.; Yan, Z.; Kalantidis, Y.; Rohrbach, M.; Yan, S.; Feng, J. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3435–3444. [Google Scholar]
Lan, R.; Sun, L.; Liu, Z.; Lu, H.; Pang, C.; Luo, X. MADNet: A fast and lightweight network for single-image super resolution. IEEE Trans. Cybern. 2020, 51, 1443–1453. [Google Scholar] [CrossRef]
Zagoruyko, S.; Komodakis, N. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4353–4361. [Google Scholar]
Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Nah, S.; Baik, S.; Hong, S.; Moon, G.; Son, S.; Timofte, R.; Lee, K.M. NTIRE 2019 Challenge on Video Deblurring and Super-Resolution: Dataset and Study. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 721, 8026–8037. [Google Scholar]
Suehring, K.; Li, X. Common Test Conditions and Software Reference Configurations, document JVET-G1010; Joint Video Exploration Team (JVET): Geneva, Switzerland, 2017. [Google Scholar]

Figure 1. Schematic of in-loop filters in VVC.

Figure 2. Diagram of in-loop filters integrated with DRIFT in VVC.

Figure 3. Architecture of DRIFT. DRIFT consists of two components: (1) LFFCNN as the main processing branch, which includes octave convolution, 24 FFRBs, and reconstruction layers, and (2) SGS as the global skip auxiliary path. The final output combines both local and global features.

Figure 4. (a) Original frame. (b) LFFCNN output. (c) SGS output.

Figure 5. Architectures of (a) multiscale branch fusion (MSBF) module and (b) lightweight FSQAM (LFSQAM) module.

Figure 6. Architectures of (a) frequency fusion module (FFM) and (b) hybrid attention module (HAM).

I C

denotes interlaced concatenation.

Figure 6. Architectures of (a) frequency fusion module (FFM) and (b) hybrid attention module (HAM).

I C

denotes interlaced concatenation.

Figure 7. Architecture of QP estimator (QPE). (a) QPE. (b) FEB. (c) Inception block. C denotes channel-wise concatenation.

Figure 8. Subjective quality comparison for

B a s k e t b a l l D r i v e

(upper) and

B Q S q u a r e

(lower). (a,b) Original frame. (c) VTM. (d) QA-Filter. (e) DRIFT.

Figure 8. Subjective quality comparison for

B a s k e t b a l l D r i v e

(upper) and

B Q S q u a r e

(lower). (a,b) Original frame. (c) VTM. (d) QA-Filter. (e) DRIFT.

Table 1. Comparison of BD rate (%) under all-intra (AI) configuration.

Class	Sequence	IACNN [32]	SEFCNN [33]	EDCNN [34]	QA-Filter [36]	DRIFT
A1 (3840 × 2160)	Tango2	−0.97	−2.42	−2.78	−3.82	−5.23
	FoodMarket4	−1.26	−3.82	−3.66	−6.02	−6.93
	Campfire	−0.93	−1.71	−1.92	−2.59	−3.15
A2 (3840 × 2160)	CatRobot1	−1.90	−3.43	−3.77	−5.01	−5.91
	DaylightRoad2	−1.10	−2.08	−2.33	−3.03	−3.97
	ParkRunning3	−1.58	−2.95	−3.26	−4.40	−5.17
B (1920 × 1080)	MarketPlace	−1.60	−2.99	−3.44	−4.50	−5.51
	RitualDance	−3.33	−5.96	−6.50	−8.00	−9.38
	Cactus	−1.70	−2.97	−3.44	−4.47	−5.29
	BasketballDrive	−0.93	−2.38	−2.94	−4.03	−5.53
	BQTerrace	−0.99	−1.76	−2.08	−2.63	−3.54
C (832 × 480)	BasketballDrill	−4.11	−6.86	−7.40	−9.16	−10.90
	BQMall	−3.01	−5.08	−5.59	−6.65	−8.03
	PartyScene	−2.29	−3.38	−3.56	−4.25	−4.98
	RaceHorses	−1.32	−2.10	−2.38	−2.88	−3.95
D (416 × 240)	BasketballPass	−3.80	−6.44	−6.88	−8.19	−9.43
	BQSquare	−3.45	−5.37	−5.66	−6.52	−8.07
	BlowingBubbles	−3.23	−4.73	−4.97	−5.84	−6.78
	RaceHorses	−3.87	−5.15	−5.41	−6.01	−7.11
E (1280 × 720)	FourPeople	−3.39	−5.68	−6.17	−7.55	−9.22
	Johnny	−2.79	−4.87	−5.40	−6.91	−8.31
	KristenAndSara	−2.92	−4.79	−5.26	−6.46	−7.85
Average		−2.29	−3.95	−4.31	−5.41	−6.56

Table 2. Comparison of BD rate (%) under random access (RA) configuration.

Class	IACNN [32]	SEFCNN [33]	EDCNN [34]	QA-Filter [36]	DRIFT
A1 Average	−1.52	−3.04	−3.16	−3.07	−3.30
A2 Average	−1.96	−3.17	−3.41	−3.42	−4.56
B Average	−1.64	−2.79	−3.10	−3.44	−3.67
C Average	−1.93	−2.93	−3.13	−3.18	−3.63
D Average	−3.27	−4.63	−4.83	−4.65	−5.45
E Average	−2.70	−4.21	−4.59	−5.22	−5.60
Overall Average	−2.17	−3.46	−3.70	−3.83	−4.83

Table 3. Comparison of # parameters.

Method	Parameters (M)
IACNN [32]	0.37
SEFCNN [33]	2.57
EDCNN [34]	18.20
QA-Filter [36]	1.78
LFFCNN	1.22
DRIFT	2.09

Table 4. Comparison of encoding time complexity (%).

Class	QA-Filter [36]	DRIFT
B	103.56	110.59
C	103.14	110.00
D	107.45	109.74
E	105.01	118.08
Average	104.79	112.38

Table 5. Comparison of BD rate (%) with and without QPE.

Class	LFFCNN	DRIFT
A1	−3.16	−3.79
A2	−5.22	−5.96
B	−3.79	−4.28
C	−3.84	−4.56
D	−5.58	−6.00
E	−5.67	−6.36
Average	−4.59	−5.18

Table 6. Comparison of BD rate (%) with and without SGS.

Class	QA-Filter [36]	LFFCNN	DRIFT
A1	−4.31	−4.20	−5.04
A2	−3.03	−3.20	−3.97
B	−4.73	−4.68	−5.85
C	−5.74	−6.10	−6.97
D	−6.64	−7.13	−7.85
E	−6.97	−7.40	−8.46
Average	−5.56	−5.79	−6.73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yeh, C.-H.; Ni, C.-T.; Huang, K.-Y.; Wu, Z.-W.; Peng, C.-P.; Chen, P.-Y. QP-Adaptive Dual-Path Residual Integrated Frequency Transformer for Data-Driven In-Loop Filter in VVC. Sensors 2025, 25, 4234. https://doi.org/10.3390/s25134234

AMA Style

Yeh C-H, Ni C-T, Huang K-Y, Wu Z-W, Peng C-P, Chen P-Y. QP-Adaptive Dual-Path Residual Integrated Frequency Transformer for Data-Driven In-Loop Filter in VVC. Sensors. 2025; 25(13):4234. https://doi.org/10.3390/s25134234

Chicago/Turabian Style

Yeh, Cheng-Hsuan, Chi-Ting Ni, Kuan-Yu Huang, Zheng-Wei Wu, Cheng-Pin Peng, and Pei-Yin Chen. 2025. "QP-Adaptive Dual-Path Residual Integrated Frequency Transformer for Data-Driven In-Loop Filter in VVC" Sensors 25, no. 13: 4234. https://doi.org/10.3390/s25134234

APA Style

Yeh, C.-H., Ni, C.-T., Huang, K.-Y., Wu, Z.-W., Peng, C.-P., & Chen, P.-Y. (2025). QP-Adaptive Dual-Path Residual Integrated Frequency Transformer for Data-Driven In-Loop Filter in VVC. Sensors, 25(13), 4234. https://doi.org/10.3390/s25134234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

QP-Adaptive Dual-Path Residual Integrated Frequency Transformer for Data-Driven In-Loop Filter in VVC

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based In-Loop Filters

2.2. Vision Transformer

3. Proposed Method

3.1. Network Architecture

3.1.1. Octave Convolution

3.1.2. Feature Fusion Residual Blocks

3.1.3. Reconstruction Layers

3.1.4. SwinIR Transformer Global Skip

3.2. Details of FFRB

3.2.1. Multiscale Branch Fusion

3.2.2. Lightweight FSQAM

3.2.3. Frequency Fusion Module

3.2.4. Hybrid Attention Module

3.3. Quantization Parameter Estimator

4. Experimental Results

4.1. Experimental Settings

4.2. Objective Evaluation

4.3. Subjective Evaluation

4.4. Complexity Evaluation

4.5. Ablation Study

4.5.1. Effectiveness of QPE

4.5.2. Effectiveness of SGS

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI