DREFNet: Deep Residual Enhanced Feature GAN for VVC Compressed Video Quality Improvement

Das, Tanni; Choi, Kiho

doi:10.3390/math13101609

Open AccessArticle

DREFNet: Deep Residual Enhanced Feature GAN for VVC Compressed Video Quality Improvement

by

Tanni Das

¹ and

Kiho Choi

^1,2,*

¹

Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Republic of Korea

²

Department of Electronic Engineering, Kyung Hee University, Yongin 17104, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(10), 1609; https://doi.org/10.3390/math13101609

Submission received: 1 April 2025 / Revised: 8 May 2025 / Accepted: 11 May 2025 / Published: 14 May 2025

(This article belongs to the Special Issue Intelligent Computing with Applications in Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

In recent years, the use of video content has experienced exponential growth. The rapid growth of video content has led to an increased reliance on various video codecs for efficient compression and transmission. However, several challenges are associated with codecs such as H.265/High Efficiency Video Coding and H.266/Versatile Video Coding (VVC) that can impact video quality and performance. One significant challenge is the trade-off between compression efficiency and visual quality. While advanced codecs can significantly reduce file sizes, they introduce artifacts such as blocking, blurring, and color distortion, particularly in high-motion scenes. Different compression tools in modern video codecs are vital for minimizing artifacts that arise during the encoding and decoding processes. While the advanced algorithms used by these modern codecs can effectively decrease file sizes and enhance compression efficiency, they frequently find it challenging to eliminate artifacts entirely. By utilizing advanced techniques such as post-processing after the initial decoding, this method can significantly improve visual clarity and restore details that may have been compromised during compression. In this paper, we introduce a Deep Residual Enhanced Feature Generative Adversarial Network as a post-processing method aimed at further improving the quality of reconstructed frames from the advanced codec VVC. By utilizing the benefits of Deep Residual Blocks and Enhanced Feature Blocks, the generator network aims to make the reconstructed frame as similar as possible to the original frame. The discriminator network, a crucial element of our proposed method, plays a vital role in guiding the generator by evaluating the authenticity of generated frames. By distinguishing between fake and original frames, the discriminator enables the generator to improve the quality of its output. This feedback mechanism ensures that the generator learns to create more realistic frames, ultimately enhancing the overall performance of the model. The proposed method shows significant gain for Random Access (RA) and All Intra (AI) configurations while improving Video Multimethod Assessment Fusion (VMAF) and Multi-Scale Structural Similarity Index Measure (MS-SSIM). Considering VMAF, our proposed method can obtain 13.05% and 11.09% Bjøntegaard Delta Rate (BD-Rate) gain for RA and AI configuration, respectively. In the case of the luma component MS-SSIM, RA and AI configurations get, respectively, 5.00% and 5.87% BD-Rate gain after employing our suggested proposed network.

Keywords:

CNN; VVC; video compression; perceptual quality; GAN

MSC:

94A08

1. Introduction

The increasing demand for high quality video content has led to significant challenges in bandwidth usage, particularly as video resolutions and frame rates rise. Video codecs address these challenges by improving compression efficiency, allowing for reduced bandwidth requirements while maintaining video quality, thus enhancing the streaming experience for users. Next-generation codecs, such as H.265/High Efficiency Video Coding (HEVC) [1] and H.266/Versatile Video Coding (VVC) [2], have emerged as vital solutions in this landscape, offering substantial bandwidth savings without compromising on video quality. These codecs utilize advanced compression techniques that optimize data transmission, enabling service providers to deliver high-resolution content, including 4 K and 8 K videos, more efficiently. As a result, users can enjoy smoother streaming experiences even in bandwidth-constrained environments.

Nonetheless, video codecs encounter several difficulties when delivering video data under bandwidth constraints. A key issue is achieving an optimal balance between compression efficiency and video quality. Complex scenes may suffer from artifacts such as blurriness or pixelation, which fail to accurately capture the details of the original content when the bitrate is inadequate. Additionally, rigid bitrate limits can lead to buffering or reductions in quality during fluctuating network conditions, making it challenging to ensure a consistent viewing experience.

Video codecs address the balance between bitrate and visual quality by utilizing sophisticated compression techniques that assess video frames, remove redundant information, and refine encoding methods. These approaches enable codecs to preserve high visual quality at reduced bitrates, improving efficiency while reducing artifacts in intricate scenes. The VVC employs a range of advanced techniques to effectively balance bitrate and visual quality. A significant approach is adaptive block size coding, which enables the codec to adjust the size of coding units dynamically according to the complexity of the content. This adaptability allows for precise capture of intricate details while utilizing larger blocks for simpler areas, thereby optimizing data usage. Furthermore, VVC integrates sophisticated motion estimation and compensation methods that predict pixel values from previous frames, significantly reducing the amount of data required for encoding. The inclusion of in-loop filtering also improves visual quality by mitigating compression artifacts, ensuring a smooth viewing experience even at lower bitrates. Together, these strategies empower VVC to deliver high quality video content efficiently, positioning it as a top choice for contemporary streaming applications.

Although the VVC is sophisticated, it has certain limitations that have an impact on the quality of reconstructed frames. The high complexity of its algorithms can result in processing inefficiencies, and even with advancements in compression, artifacts may still emerge. Implementing post-processing networks at the decoder side can effectively address the limitations of VVC techniques by enhancing the quality of the reconstructed video frames. By analyzing the decoded frames, the post-processing network can apply advanced filtering and enhancement techniques to restore fine details and improve overall visual fidelity. Additionally, these networks can adaptively adjust their processing based on the content characteristics, allowing for targeted improvements in areas that require more attention. Ultimately, integrating post-processing networks can bridge the gap between the capabilities of the VVC and the expectations of viewers, leading to a more satisfying and high quality viewing experience.

Recent advancements in convolutional neural network (CNN)-based architecture have significantly enhanced the quality of video and image reconstruction. These architectures have evolved to incorporate more sophisticated techniques, such as residual learning and attention mechanisms [3], which allow them to focus on critical areas of an image while effectively reducing artifacts. Additionally, the use of multi-scale feature extraction [4] has improved the ability of networks to capture both global context and local details, resulting in more accurate and visually appealing outputs. Furthermore, advancements in training methodologies, including the use of large-scale datasets and transfer learning [5], have enhanced the robustness and generalization of these models, making them more effective across diverse content types. As a result, CNN-based post-processing architectures are also becoming increasingly vital in optimizing visual quality in various applications, from streaming services to video conferencing. These networks are adept at learning intricate patterns within images, facilitating noise reduction and artifact correction, which results in clearer and more visually appealing outcomes across a range of applications.

While CNN-based networks demonstrate impressive performance, they can encounter challenges in producing visually satisfying perceptual quality in certain situations. In contrast, generative adversarial network (GAN)-based architectures offer a compelling solution by introducing adversarial training, which allows the model to not only learn to generate high quality images but also improve perceptual realism through the dynamic interplay between the generator and discriminator. This adversarial framework enables the model to capture finer details, enhance texture representation, and produce visually plausible images that are often more realistic and coherent. As a result, GANs are particularly well suited for tasks where high visual quality is paramount, making them a more effective choice in the proposed method over traditional CNN-based approaches.

In this paper, we introduce a Deep Residual Enhanced Feature Generative Adversarial Network (DREFNet) as a post-processing technique aimed at enhancing the reconstruction quality of the VVC while maintaining a constant bitrate. The architecture comprises a generator network and a discriminator network. To enhance the generator’s generalization across low to high quantization parameter (QP) ranges, QP map [6] is provided as prior information alongside the reconstructed image input. The generator network incorporates Deep Residual Blocks (DRBs) and Enhanced Feature Blocks (EFBs) to facilitate improved training of deeper layers and more effective learning of complex features. The main contributions of this study are summarized as follows:

The proposed post-processing network is designed for both Random Access (RA) and All Intra (AI) scenarios by utilizing a single model generation method for each specific scenario across different QP ranges.
The input to the generator network consists of the reconstructed image, which is accompanied by a QP map to enhance the generator’s generalization across different QP ranges.
The proposed network utilizes Deep Residual Blocks (DRBs) and Enhanced Feature Blocks (EFBs) within the generator network, enabling it to learn robust features while effectively training deeper architecture.
A two-stage training strategy is implemented in which the generator network is initially trained using the Structural Similarity Index Measure (SSIM) loss function, followed by the training of GAN architecture in the second stage, which employs a perceptual loss function.

2. Related Works

In recent years, there have been considerable progressions in the domain of image and video enhancement, primarily fueled by the incorporation of deep learning methodologies. This section aims to present a review of literature related to image and video enhancement.

2.1. Deep Learning-Based Image Enhancement Approach

Dong et al. [7] proposed a deep learning method named super-resolution convolutional neural network (SRCNN) for single image super-resolution (SISR) that learns an end-to-end mapping between low-resolution and high-resolution images using a deep convolutional neural network (CNN). This method treats the mapping as a unified process, optimizing all layers jointly, in contrast to traditional sparse-coding-based SR techniques that address components separately. To enhance the speed and quality of [7], Dong et al. [8] further proposed a compact hourglass-shaped CNN structure named fast super-resolution convolutional neural network (FSRCNN). The redesign of the SRCNN focuses on three key aspects: first, a deconvolution layer is added at the end of the network to learn the mapping directly from the original low-resolution image to the high-resolution image without interpolation. Second, the mapping layer is reformulated to reduce the input feature dimension before mapping and then expand it afterward. Third, the model employs smaller filter sizes with an increased number of mapping layers. As a result, the proposed model achieves over 40 times speed improvement while delivering superior restoration quality. In [9], Kim et al. presented a single-image SR technique that utilizes a very deep convolutional network (VDSR) inspired by VGG-net [10]. The authors developed a deep convolutional network for single-image super-resolution, achieving improved accuracy with a final model consisting of 20 weight layers. By using many small filters, the method effectively captures contextual information across large image areas. Tai et al. [11] introduces the Deep Recursive Residual Network (DRRN), a very deep CNN model with up to 52 convolutional layers designed for SISR. By employing residual learning on both global and local scales, DRRN addresses the challenges of training deep networks while maintaining a concise architecture. Additionally, recursive learning is utilized to manage model parameters as depth increases. In [12], Lim et al. presents the Enhanced Deep Super-Resolution Network (EDSR), which leverages deep convolutional neural networks (DCNN) and residual learning techniques to achieve superior performance in super-resolution compared to existing state-of-the-art methods. The significant improvements in EDSR result from optimizing the network by removing unnecessary modules found in conventional residual networks and expanding the model size while stabilizing the training process. Additionally, the authors introduce a multi-scale deep super-resolution system (MDSR) and a training method that enables the reconstruction of high-resolution images at various upscaling factors using a single model. While these approaches show improvements, they also have several limitations. SRCNN, despite introducing deep learning to super-resolution, faces challenges with small image regions, slow convergence, and scalability to a single resolution. FSRCNN, while faster, may struggle with heavily degraded images, particularly those with low-resolution or noise. VDSR addresses slow convergence by using high learning rates but introduces the risk of exploding gradients, which can destabilize training, even with gradient clipping. While deeper networks (16–19 layers) improve performance, extremely deep models (e.g., 50+ layers) can suffer from vanishing gradients, instability, and optimization challenges. The DRRN network uses recursive learning with weight sharing to reduce model size, but this may limit the network’s ability to capture complex features, potentially affecting its performance on highly detailed images.

Ledig et al. [13] proposed a generative adversarial network for image super-resolution (SRGAN), a GAN designed for SISR that addresses the challenge of recovering fine texture details at large upscaling factors of 4. While traditional optimization-based methods often focus on minimizing mean squared reconstruction error, resulting in high peak signal-to-noise ratios (PSNR) but lacking high-frequency details, SRGAN employs a novel perceptual loss function combining adversarial loss and content loss. Wang et al. [14] introduces the Enhanced Super-Resolution Generative Adversarial Network (ESRGAN), which builds upon the original SRGAN to improve visual quality by addressing issues with artifacts in generated textures. There are three key components: network architecture, adversarial loss, and perceptual loss. In this method, Residual-in-Residual Dense Block (RRDB) is introduced as the fundamental building unit, omitting batch normalization for better performance. Additionally, the discriminator is adapted to predict relative realness, inspired by relativistic GANs, and enhance the perceptual loss by utilizing features before activation to provide stronger supervision for brightness consistency and texture recovery. To further enhance the perceptual quality of ESRGAN, Rakotonirina et al. [15] proposed an improved network architecture featuring a new block called the Residual-in-Residual Dense Residual Block (RRDRB), which offers greater capacity than the original RRDB used in ESRGAN. Additionally, noise inputs to the generator network are introduced to leverage stochastic variation, resulting in more realistic textures in the generated images. Chai et al. [16] introduced a highly deep Residual Channel Attention Generative Adversarial Network (RCA-GAN) for real-world image super-resolution. They designed a channel attention mechanism that dynamically adjusts feature channels by capturing interdependence, thereby enhancing the most informative features. Additionally, they leveraged a generative adversarial network to produce more natural and visually appealing results. In [17], Wang et al. introduced a framework known as the Controllable Feature Space Network (CFSNet), which features two branches targeting different objectives and can dynamically learn coupling coefficients across multiple layers and channels. This capability provides users with enhanced control over the quality of restored images. The framework allows users to smoothly adjust between various objectives, such as finding a balance between perception and distortion in image super-resolution or optimizing the trade-off between noise reduction and detail preservation.

2.2. Deep Learning-Based Video Enhancement Approach

In [18], Zhao et al. present a joint luma and chroma in-loop filtering network for VVC that features a model selection mechanism based on a multi-scale convolutional neural network. The main contributions consist of a residual-based multi-scale CNN model aimed at efficiently filtering both luminance and chrominance components simultaneously. Furthermore, a model selection strategy is introduced to improve the adaptivity and robustness of the CNN-based filtering process. Chen et al. [19] introduced a dense residual convolutional neural network (DRN) for in-loop filtering in VVC that employs a residual learning module, dense shortcuts, and bottleneck layers. This design addresses issues related to gradient vanishing, promotes feature reuse, and optimizes computational efficiency. In [20], Huang et al. introduced an efficient variable CNN-based in-loop filter (VCNN) for VVC that integrates an attention module designed to adaptively recalibrate channel-wise features. This module assigns varying weights to channels based on quantization parameters (QPs) and frame types (FTs). Additionally, the filter incorporates a residual block to selectively highlight informative features while diminishing the impact of less useful ones. Zhang et al. [21] developed a CNN-based in-loop filter for VVC intra coding that utilizes depthwise separable convolution and an attention mechanism to enhance the network’s efficiency and reduce its weight. They introduced two fundamental modules, the residual attention block (RAB) and the weakly connected attention block (WCAB), to effectively extract and refine features. Additionally, a multi-stage training strategy based on progressive learning is implemented to optimize the network’s learning capabilities. However, four different types of input (i.e., reconstructed frame, predicted frame, partition frame, and QP map) must be given to their proposed network. In [22], Zhang et al. developed an NN-based in-loop filter for VVC, named RTNN, which combines ResBlock and Transformer architectures to effectively reduce complex compression artifacts. The RTNN efficiently extracts deep features and captures both local and non-local correlations. It also includes a novel attention module for improved feature refinement through auxiliary information. Additionally, a multi-stage training strategy is implemented to consider QP distance, enhancing the learning capabilities of the RTNN. Extensive usage of residual blocks makes their proposed network computationally complex. Wang et al. [23] developed a three-branch network architecture to generate different compensations for the Y, U, and V components. This approach aims to reduce blocking artifacts and replace the de-blocking filter by utilizing partition information to identify distortion at block boundaries for quality enhancement, along with incorporating QP as prior knowledge. In [24] Zhang et al. developed a CNN post-processing filter named WCDANN, which utilizes depth wise separable convolution and an attention mechanism to enhance the quality of compressed images by effectively learning residual information. WCDANN comprises two key modules: the weakly connected dense attention block (WCDAB) and the residual attention block (RAB), both aimed at extracting important residual features. It features residual connections in each module to promote the flow of information and incorporates two attention mechanisms: channel attention block (CAB) and channel spatial attention block (CSAB) to strengthen significant residual features in the outputs of RAB and WCDAB. Despite deploying a complex network, the coding gain for RA is not very significant (<3%). In [25], Lin et al. developed a partition-aware convolutional neural network (CNN) that integrates coding unit (CU) size information with distorted decoded frames to effectively reduce HEVC-induced artifacts. The proposed method also introduced an adaptive-switching neural network (ASN) comprising multiple independent CNNs to handle variations in content and distortion within compressed video frames, further reducing visual artifacts. Additionally, an iterative training procedure is proposed to train these CNNs with an emphasis on different local patch-wise classes. However, an improper initialization method in iterative training can make the network convergence harder. Guan et al. [26] introduced a Multi-Frame Quality Enhancement (MFQE) approach that employs a Bidirectional Long Short-Term Memory (BiLSTM) detector to locate Peak Quality Frames (PQFs) in compressed video. Their designed PQF detector network requires many handmade features to be trained. They designed a Multi-Frame Convolutional Neural Network (MF-CNN) to enhance the quality of low-quality frames by using the non-PQF and its two nearest PQFs as input. This method aims to improve the overall quality of compressed video by effectively utilizing information from surrounding frames. In [27], Ma et al. developed MFRNet, a convolutional neural network (CNN) architecture for post-processing (PP) and in-loop filtering (ILF) in video compression. The network consists of four cascading multi-level feature review residual dense blocks (MFRBs), each designed to extract features through dense connections and a multi-level residual learning structure. Additionally, MFRBs enhance information flow by reusing high-dimensional features from the previous block. In [28], Zhang et al. proposed one of the first kinds of GAN-based architecture for improving the quality of VVC and AOMedia Video 1 (AV1) compressed frames. They deployed the modified version of the generator network from [13] as their generator. Das et al. [29] suggested VVC-PPFF as a post-processing network to enhance VVC-compressed frames. Their network employs a hierarchical feature fusion mechanism, combining features from early and deeper convolutional layers. This approach captures both low-level details and high-level semantic information, leading to a more comprehensive understanding of the video content. However, at the same time, it increases the computational demand.

In the field of image enhancement, GANs have garnered substantial attention for their ability to produce visually appealing results, with numerous works demonstrating their capacity to improve image quality, detail preservation, and artifact reduction. GANs, owing to their ability to learn complex mappings from low-resolution to high-resolution data, have been particularly successful in areas such as super-resolution, denoising, and inpainting. However, despite the impressive advancements in image-level GAN applications, the realm of video enhancement remains comparatively underexplored. Video enhancement is intrinsically more complex than image enhancement due to the temporal dimension, where not only spatial features but also motion coherence across frames must be preserved. This introduces challenges such as maintaining consistency in motion, temporal stability, and reducing artifacts across frames, all while improving visual fidelity. These temporal dependencies are often ignored or oversimplified in existing methods focused primarily on static images. Given these challenges, there is a clear need for innovative approaches that leverage GAN structures to specifically address the unique demands of video enhancement. The proposed method aims to enhance frame-level quality into a unified GAN framework, offering a more robust solution for video enhancement that preserves both spatial and temporal features.

3. Proposed Method

The previous state-of-the-art methods have already proved that using residual architectures and residual blocks as the backbone of the deep learning network can successfully remove different compression artifacts generated from video codecs. However, excessive usage of residual blocks and flowing the features from all the previous blocks makes the overall network computationally expensive. Additionally, to increase the quantitative result, the visual pleasantness is deprived in most cases. To address these issues, we propose a novel GAN-based architecture designed as a post-processing network, where we find the optimal number of residual blocks to keep the network simple, and we keep the balance between quantitative result and subjective quality. The proposed network will be elaborated upon in terms of its overall structure and training strategy in this section. This architecture aims to enhance the visual quality of videos by effectively utilizing the strengths of GANs, which are known for their ability to generate high-fidelity images. The proposed network will incorporate a generator that synthesizes improved frames while a discriminator evaluates the realism of the generated outputs against real video frames.

3.1. Generator Architecture

Figure 1 illustrates the complete architecture of the generator network. This architecture includes an initial convolution layer, a Deep Residual Block (DRB) layer, an Enhanced Feature Block (EFB) layer, a refinement convolution layer, and a final convolution layer. The reconstructed image is input into the network along with a QP map serving as prior information. Incorporating a QP map as prior knowledge along with the reconstructed image when feeding data into a generator network provides notable benefits for improving the quality of the output images. The QP map offers essential contextual insights regarding the compression levels and quantization effects present in various parts of the image, enabling the generator to make more informed choices during the reconstruction phase. By adding this extra layer of information, the generator is better equipped to maintain crucial details and textures that could be compromised by noise or compression artifacts. Consequently, the images generated are of higher quality, enhancing visual fidelity and boosting the performance of the discriminator network. A more precise discriminator can effectively distinguish between real and generated images, resulting in a more robust training process and ultimately leading to superior outcomes. The QP map is fed into a neural network that matches the spatial dimensions of the compressed input image, enabling the network to handle the compression information in a spatially aligned way. The QP map generates a value that is adjusted to fit a standard range, as outlined in Equation (1):

{Q P}_{m a p} (i, j) = \frac{Q P (i, j)}{{Q P}_{m a x}} .

(1)

In this context, i denotes the horizontal pixel coordinates, while j indicates the vertical pixel coordinates, with the maximum quantization parameter QP_max set to 63. Here, i varies from 1 to W (i.e., width), and j ranges from 1 to H (i.e., height). Within the framework of VVC, QP_max defines the highest level of compression that can be applied to each coding unit within a frame.

The combined input is then processed through the initial convolution layer, which has 64 output channels and uses a 3 × 3 kernel. Opting for 64 channels enables the network to extract a wide variety of features from the input data, allowing it to effectively learn different patterns and representations. Furthermore, the 3 × 3 kernel size provides a good balance between capturing local spatial information and ensuring computational efficiency. This configuration allows the network to concentrate on small, localized areas of the input, which is essential for identifying intricate details and textures.

The output from the initial block is subsequently passed through the DRB layer, as shown in Figure 2a. This DRB layer comprises additional convolution layers and rectified linear units (ReLU) that feature dense residual connections. Dense residual connections in convolutional layers offer significant advantages for improving the performance of networks. By facilitating direct pathways for gradients during backpropagation, these connections help mitigate the vanishing gradient problem, allowing for more effective training of deeper networks. This residual connection encourages feature reuse, as each layer can access the outputs of all preceding layers, leading to richer feature representations and improved learning efficiency. Additionally, dense residual connections enhance the flow of information throughout the network, which can result in better convergence. In this network, we employ 16 DRBs to balance training efficiency with optimal enhancement.

The output from the DRB is subsequently processed through the EFB layer, as depicted in Figure 2b. The EFB block is composed of both convolution and deconvolution layers. Convolution layers excel at feature extraction, allowing the network to capture essential patterns and details from the input data while reducing spatial dimensions. In contrast, deconvolution layers, also known as transposed convolution layers, are effective for upsampling, enabling the network to reconstruct higher-resolution outputs from the learned features. This synergy allows for a more comprehensive understanding of the data, as the convolution layers can distill important information, while the deconvolution layers can effectively translate that information back into a more detailed representation. Together, they facilitate improved performance, leading to outputs that are both accurate and visually appealing.

The output from the EFB then goes through a refinement convolution layer that uses a 3 × 3 kernel to produce a more refined feature output following the EFB layer. The output from the refinement convolution layer is subsequently combined with the output of the initial convolution layer through a skip connection. This approach allows for the integration of features from both layers, facilitating the preservation of important information while enhancing the overall representation. By leveraging skip connections, the network can effectively mitigate the loss of detail that may occur during the processing stages of DRB and EFB, leading to improved performance. This technique promotes better feature flow throughout the network and aids in maintaining the integrity of the original input data.

Finally, an output convolution layer is added after the combined output of the initial convolution layer and the refinement convolution layer. The reconstructed image is then integrated with the output of this output convolution layer via a skip connection. This setup allows for the effective merging of features from both the processed layers and the original input, ensuring that important details are preserved in the final output. By utilizing skip connections, the network can enhance the quality of the reconstructed image, leading to improved visual fidelity and overall performance in the reconstruction task.

3.2. Discriminator Architecture

Figure 3 illustrates the architecture of the discriminator network. In this study, we employed the discriminator network described in [13]. The successful image classifier VGG [10] inspires the network architecture, while the discriminator network adheres to the design principles of a GAN as outlined in [30]. In a GAN structure, the discriminator network plays a crucial role in distinguishing between real and generated data. Its primary function is to evaluate the authenticity of the input data by assigning a probability score that indicates whether the data is real (i.e., from the training set) or fake (i.e., produced by the generator). The discriminator is trained simultaneously with the generator in a competitive setting, where the generator aims to produce increasingly realistic data to fool the discriminator, while the discriminator strives to improve the ability to correctly identify real versus fake samples. This adversarial process drives both networks to enhance their performance, ultimately leading to the generation of high quality synthetic data that closely resembles the real data distribution.

The discriminator network in [13] comprises eight convolutional layers with 3 × 3 filter kernels that increase from 64 to 512, using strided convolutions to reduce image resolution as the number of feature maps doubles. The use of multiple convolutional layers with increasing filter sizes allows the network to capture a wide range of features at different levels of abstraction, enhancing its ability to discern subtle differences between real and generated images. The strided convolutions effectively reduce the spatial dimensions of the input, which not only helps in managing computational complexity but also enables the network to focus on the most relevant features as it processes the data. The network structure also includes two fully connected layers after the 512 feature maps and a Sigmoid activation function applied after the second dense layer to generate a probability score reflecting the perceived quality difference between real and generated image blocks. The incorporation of fully connected layers after the convolutional layers allows for a more comprehensive analysis of the extracted features, facilitating better decision-making regarding the authenticity of the input. The application of a Sigmoid activation function at the output provides a clear probability score, making it easier to interpret the discriminator’s confidence in its classification.

Overall, this architecture enhances the discriminator’s performance by improving its ability to generalize and accurately classify images, which in turn drives the generator to produce higher quality outputs. This adversarial training dynamic is crucial for the overall effectiveness of the GAN, leading to the generation of more realistic synthetic data.

3.3. Train Dataset Preparation

In our experiment, we utilized the BVI-DVC [31] dataset to train the network. This dataset comprises 200 video sequences sourced from various public video datasets, featuring both natural and challenging characteristics. The videos primarily have a spatial resolution of 3840 × 2160. To enhance data diversity, these 200 videos were spatially downsampled to resolutions of 1920 × 1080, 960 × 540, and 480 × 270, resulting in a total of 800 sequences across four different resolutions. In our experiment, all sequences are converted from MP4 to YUV420 chroma subsampling with a bit depth of 10. For network input, non-overlapping patches of size 240 × 240 were extracted from each video frame, with 10 frames taken from each video sequence. The video sequences were compressed using five different QP values (i.e., 22, 27, 32, 37, and 42) in RA and AI configurations utilizing VVenC [32] and VVdeC [33]. This process yielded a total of 1,860,000 patches, with class A (i.e., 3840 × 2160) containing 1,440,000 patches, class B (i.e., 1920 × 1080) containing 320,000 patches, class C (i.e., 960 × 540) containing 80,000 patches, and class D (i.e., 480 × 270) containing 20,000 patches. To further enhance data diversity, we randomly flipped the patches. Given that we employed five different QP values to create the compressed input, we randomly selected 50,000 patches from class A, 50,000 patches from class B, 80,000 patches from class C, and 20,000 patches from class D to develop the RA and AI models. As a result, we randomly chose a total of 200,000 patches from various spatial image resolution classes, each with different QP ranges as shown in Figure 4.

Equation (2) depicts the model created for the RA and AI configurations represented by

CNN models = \{\begin{matrix} \begin{matrix} {M o d e l}_{Q P = 22 ~ 42}^{R A} & W h e n, 22 \leq Q P \leq 42 \end{matrix} \\ \begin{matrix} {M o d e l}_{Q P = 22 ~ 42}^{A I} & W h e n, 22 \leq Q P \leq 42 \end{matrix} \end{matrix}

(2)

where

{M o d e l}_{Q P = 22 ~ 42}^{R A}

and

{M o d e l}_{Q P = 22 ~ 42}^{A I}

denotes the generated model associated with RA and AI configuration for various QP ranges.

3.4. Training Methodology

We conducted our training in two stages. In the first stage, we trained the generator network using SSIM loss. In the second stage, we initiated training with the generator model from the first stage, employing the perceptual loss function. Utilizing a two-stage training approach in GANs, where the first stage employs SSIM loss and the second stage utilizes perceptual loss, offers significant advantages in generating high quality images. The SSIM loss focuses on preserving structural integrity and similarity to the original images, ensuring that the generated outputs maintain essential features. In contrast, perceptual loss emphasizes higher-level content and texture similarities, allowing the model to capture more abstract representations. Together, these loss functions enhance the GAN’s ability to generalize, resulting in outputs that are not only visually coherent but also rich in detail and fidelity. This dual approach effectively balances the need for both pixel-level accuracy and perceptual quality, leading to a more robust training process.

1.: Stage I: In stage I for both the RA and AI configurations, the generator model is created using the SSIM [34] loss function. Utilizing SSIM loss in stage I for the generator network offers significant advantages in enhancing the perceptual quality of generated images. SSIM, which focuses on structural similarity, evaluates images based on luminance, contrast, and structural information, allowing the generator to produce outputs that are more visually coherent and aligned with human perception. This focus on structural fidelity helps the generator learn to preserve important details and textures, resulting in images that are not only closer to the target distribution but also more aesthetically pleasing. When transitioning to stage II, where the model operates as a GAN architecture, the foundation established by the SSIM loss ensures that the adversarial training process is more effective. The generator is better equipped to create high quality images that can effectively challenge the discriminator, leading to improved overall performance of the GAN. This synergy between the SSIM loss in the initial training phase and the adversarial framework in the subsequent stage results in a more robust model capable of generating realistic and high-fidelity images. The SSIM loss function is represented as shown in Equation (3):

$L_{S S I M} = 1 - \frac{(2 μ_{x} μ_{y} + C_{1}) ({2 σ}_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}$

(3)

where mean, variance, and covariance for a tile around a pixel are expressed as ${(μ}_{x}, μ_{y})$ , ${(σ}_{x}, σ_{y})$ , and $σ_{x y}$ , respectively. $C_{1}, C_{2}$ are two constants while calculating contrast and structure components.
2.: Stage II: In stage II, the entire GAN architecture is trained utilizing a perceptual loss function. During this phase, the generator is optimized using a combined loss function denoted as $L_{g}$ . Equation (4) represents the combined loss for the generator.

$L_{g} = L_{S S I M} + γ L_{l_{2}} + η L_{a} .$

(4)

Here, $L_{S S I M}$ represents the SSIM loss between the output of the generator and the target, while $L_{l_{2}}$ denotes the loss between the two. The values of γ and η are assigned as 0.025 and 0.005, respectively. Here, γ and η are used to indicate the weights we put on L₂ loss and adversarial loss, respectively. Since we are dealing here with a balance between quantitative and qualitative improvements of compressed frames through our proposed model, we had to choose the weights of different components of loss functions very carefully. For example, Structural Multi Similarity Index (SSIM) loss was given the most priority, then L₂ loss, and at last the adversarial loss where the discriminator’s network output has influence. In our research the values for γ and η were adopted from [28]. $L_{a}$ is defined as the adversarial loss for the generator, as illustrated in Equation (5):

$L_{a} = 10^{- 03} \frac{1}{N} \sum_{n = 1}^{N} - l o g D_{Φ} (G_{Ψ} (I_{R})) .$

(5)

In this context, $D_{Φ} (G_{Ψ} (I_{R}))$ represents the probability that the reconstructed image $G_{Ψ} (I_{R})$ is a real image. N signifies the number of samples.
At the same time, the discriminator network is optimized by utilizing the discriminator loss function as stated in (6):

$L_{d} = L_{d, r} + L_{d, f}$

(6)

where $L_{d, r}$ and $L_{d, f}$ are discriminator loss for real and fake images, respectively.

$L_{d, r} = \frac{1}{N} \sum_{n = 1}^{N} - (D_{Φ} (I_{R}) \ln (D_{Φ} (I_{R})) + (1 - D_{Φ} (I_{R})) \ln (1 - D_{Φ} (I_{R})))$

$\begin{matrix} L_{d, f} = \frac{1}{N} \sum_{n = 1}^{N} - & (D_{Φ} (G_{Ψ} (I_{i n p})) \ln (D_{Φ} (G_{Ψ} (I_{i n p}))) + (1 \\ - D_{Φ} (G_{Ψ} (I_{i n p}))) \ln (1 - D_{Φ} (G_{Ψ} (I_{i n p})))) . \end{matrix}$

In the above two equations, $I_{R}$ and $I_{i n p}$ denotes the reference or real image and input image, respectively. Utilizing a combined loss of SSIM loss, mean squared error (MSE) loss, and adversarial loss in a GAN architecture provides several benefits. SSIM loss helps preserve the perceptual quality and structural consistency of the generated images by evaluating similarity at a structural level. MSE loss focuses on pixel-level precision and smoothness, aiding in the refinement of finer details and minimizing high-frequency noise. Adversarial loss pushes the generator to create more realistic images by making them indistinguishable from real ones, enhancing their overall visual quality. This blend of losses encourages the generator to learn both fine-grained details and broader structural patterns, leading to images that are both realistic and visually cohesive. The combined loss provides significant advantages to the discriminator during the second stage of training. In Stage I, the generator is initially trained using SSIM loss, which helps it generate images that closely resemble the structural patterns of real data. In Stage II, as the discriminator is trained, the incorporation of SSIM, MSE, and adversarial losses allows it to better generalize by considering not only pixel-wise differences but also the structural and perceptual characteristics of the images. The adversarial loss enhances the discriminator’s ability to distinguish between real and fake images, while SSIM and MSE losses help evaluate the overall quality and authenticity of the generated images. This comprehensive approach strengthens the discriminator, enabling it to more effectively assess generated images, leading to improved performance and better generalization across various types of images and conditions.

3.5. Training Configuration

In Stage I, both the RA and AI models were trained for 30 epochs. In Stage II, the RA model continued with 30 epochs, while the AI model was trained for 50 epochs. It takes around 10 days and around 14 days, respectively, for RA and AI configuration to complete two-stage training. A learning rate of 0.0001 was employed throughout the training process. The Adam optimization [35] algorithm was utilized, with decay rates set to

β_{1}

= 0.9 and

β_{2}

= 0.999. Adam stands for Adaptive Moment Estimation. It combines two well-known techniques: Momentum and RMSprop. The first one is related to gradient descent acceleration, and the second one sets the learning rate adaptively as well as prevents the learning rate diminishing problem. Dynamic learning rates, bias correction, and efficient performance are the three main strength points for Adam. For both RA and AI coding configuration, batch size was selected as 32. For data augmentation, we use random horizontal and vertical flips in a probabilistic manner.

Regarding selection of different key hyperparameters, a comprehensive guideline and possible point of further experiments are provided in Appendix A. Interested readers can check this out.

4. Experimental Evaluations and Discussion

4.1. Test Environment Configuration

To evaluate the performance of the proposed method, we utilized the JVET NNVC CTC [36] sequences that were excluded from the training dataset. A total of 22 sequences, divided into categories A1, A2, B, C, D, and E, were employed to assess the RA and AI configurations in accordance with the JVET NNVC CTC guidelines. Specifically, we conducted tests on RA and AI using classes A1, A2, B, C, and D, as well as classes A1, A2, B, C, D, and E, respectively. The test QP values for all configurations were set at 22, 27, 32, 37, and 42.

4.2. Quality Metric

The ultimate video quality was assessed using two evaluation techniques: the Multi-Scale Structural Similarity Index Measure (MS-SSIM) [37], the Video Multimethod Assessment Fusion (VMAF) metrics [38] and the Bjøntegaard Delta Rate (BD-Rate) [39].

a.: VMAF: Video Multimethod Assessment Fusion (VMAF) is a perceptual video quality metric developed by Netflix to evaluate the visual quality of compressed and distorted videos in a way that aligns with human perception. VMAF combines multiple quality assessment features, including image quality metrics such as Visual Information Fidelity (VIF) and Detail Loss Metric (DLM), fused by support vector machine (SVM) regression to predict subjective video quality scores accurately. The VMAF score is computed as a weighted combination of these features, where a regression model, typically trained on human opinion scores, maps the extracted features to a final quality prediction. The VMAF model can be represented as:

$V M A F = f (Q_{1}, Q_{2}, \dots \dots Q_{n})$

(7)

where $Q_{1}, Q_{2}, \dots \dots Q_{n}$ represent different quality assessment features, and $f (.)$ is a machine learning model, often a support vector machine (SVM) or a neural network, trained on a dataset of subjective quality ratings.
b.: MS-SSIM: MS-SSIM is a widely used quality assessment metric for evaluating the perceptual similarity between an original and a distorted image. It extends the Structural Similarity Index (SSIM) by considering image structures at multiple scales, thereby improving its correlation with human visual perception. MS-SSIM is computed by progressively downsampling the image and combining SSIM measurements at different scales using a weighted geometric mean. The formulation of MS-SSIM is given by:

$M S - S S I M (x, y) = \prod_{j = 1}^{M} {{[l}_{j} (x, y)]}^{α_{j}} \times \prod_{j = 1}^{M} {{[c}_{j} (x, y) \cdot s_{j} (x, y)]}^{β_{j}}$

(8)

where $l_{j} (x, y), c_{j} (x, y)$ and $s_{j} (x, y)$ represent the luminance, contrast, and structure components at scale j, respectively, while $α_{j}$ and $β_{j}$ are the corresponding weighting factors.
c.: BD-Rate: BD-Rate is a widely used metric for evaluating the performance of video codecs, particularly in terms of bitrate efficiency and quality. It quantifies the difference in bitrate required to achieve the same level of video quality between two different encoding methods or codec implementations. The BD-Rate is expressed as a percentage, indicating how much more or less bitrate is needed for one codec to match the quality of another. To calculate BD-Rate, the rate-distortion curves of the two codecs are analyzed, and the area between these curves is computed. A negative BD-Rate indicates that the new codec is more efficient, requiring less bitrate for the same quality, while a positive BD-Rate suggests that it requires more bitrate.

4.3. Experimental Setup

The experiment was carried out utilizing PyTorch v2.4.0 [40] as the deep learning framework on an Ubuntu operating system. The hardware setup included two AMD EPYC 7513 32-core CPUs, 384 GB of RAM, and an NVIDIA A6000 GPU.

4.4. Quantitative Analysis

The compression performance of the proposed architecture is presented in Table 1 and Table 2, showcasing its effectiveness in both RA and AI configurations. JVET employs the BD-Rate metric to assess bitrate reduction, with a lower BD-Rate value indicating improved coding efficiency.

From Table 1 it is observed that the overall VMAF reduction is −13.05%, while the MS-SSIM reductions are −5.00% for the luma (i.e., Y component) component and −18.30% and −19.82% for the chroma (i.e., UV component) components in the RA scenario. For classes A1 and A2, both at a 4 K resolution (i.e., 3840 × 2160), the average VMAF reductions are −16.65% and −18.84%, respectively. The Y-MSSSIM reductions for these two classes are −5.90% and −5.72%. The results indicate a notable coding gain achieved without resorting to aggressive compression techniques on 4 K resolution image classes, despite their larger file sizes. Additionally, for class D with a resolution of 416 × 240 (WQVGA), the average VMAF coding gain is −11.15%, demonstrating the effectiveness of the proposed method for low-resolution video sequences as well.

From Table 2, the analysis of the average BD rate savings for the AI configuration reveals significant efficiency improvements, particularly for class A, which shows a VMAF savings of −21.79%. This substantial reduction indicates that the AI-driven approach effectively optimizes video quality while minimizing bitrate, allowing for enhanced performance in high-resolution scenarios. In contrast, class D demonstrates a more modest gain of −6.71%, suggesting that while the proposed method is beneficial for low-resolution video sequences, the impact is less pronounced compared to higher-resolution classes. Overall, these findings highlight the adaptability of the AI configuration across different video classes, showcasing its potential to deliver quality improvements while managing bandwidth effectively.

To further assess the proposed method, we compared its results with those from the recently published VVC post-processing paper [29], as presented in Table 3. The findings reveal that the proposed method achieves an overall VMAF gain of −10.01% and an overall Y-MSSSIM gain of −1.51% compared to [29], indicating a notable improvement in visual quality. This analysis suggests that the RA configuration not only enhances the perceptual quality of the video but also demonstrates its effectiveness in optimizing both luma and chroma components. The gains in VMAF and Y-MSSSIM highlight the potential of the proposed method to deliver superior visual experiences, reinforcing its viability as a robust solution for post-processing in VVC. VVC-PPFF [29] was trained as a single network architecture where a guiding network was not present. On the contrary, our proposed DREFNet is a generative adversarial network (GAN)-based architecture where the discriminator network guides the generator network for best performance and was trained in two stages in two different manners. VVC-PPFF mainly focused on PSNR gain even if degrading visual appeal. DREFNet keeps a balance between objective and subjective performance improvement. Therefore, network architecture and training as GAN, and a suitable choice of combination of different loss functions, contribute to achieving the superior result. Although VVC-PPFF [29] is a more computationally complex model because of its feature fusion part, we obtain superior results compared to them.

Table 4 compares the average BD-Rate of our proposed method against another GAN-based post-processing network at [28]. They tested their proposed network for Random Access (RA) configuration only, and we tested on both RA and All Intra (AI) configurations. Considering VMAF, they achieved on average 13.85% coding gain over VVC, while we achieved 13.27% coding gain, which is very close to them. In particular, based on VMAF, coding gain for class A is higher for our proposed model. Their average PSNR gain was 0.9%. On the other hand, we achieved a higher PSNR gain than [28]. We achieved 5.20%.

4.5. Ablation Study

We have performed an ablation study by excluding two main parts of our proposed model, DRB and QP map, one by one. Table 5 enlists the VMAF result on class C and class D after excluding each part. The result shows that DRB and QP Map have more effect on AI configuration performance than RA.

4.6. Qualitative Analysis

Figure 5 and Figure 6 illustrate the visual quality comparison between VVC-compressed images and the output from the proposed network for the RA scenario. In class C sequence BQMall, the proposed output demonstrates effective detail preservation, while in the class B sequence BasketballDrive, an improvement in clarity is noted for the output of the proposed method.

Figure 7 and Figure 8 present a visual quality comparison between VVC-compressed images and the output from the proposed network for the AI scenario. In the class C sequence BasketballDrill, the proposed output exhibits enhanced sharpness, while in the class E sequence Johnny, a reduction in artifacts is observed in the output from the proposed method.

4.7. Rate Distortion Plot Analysis

Figure 9, Figure 10, Figure 11 and Figure 12 illustrate the rate-distortion plots (RD curves) depicting the relationship between VMAF and bitrate, as well as Y-MSSSIM and bitrate. The rate-distortion plot is a valuable tool for visualizing the trade-off between video quality and bitrate, specifically in the context of VMAF versus bitrate and Y-MSSSIM versus bitrate.

In the VMAF vs. bitrate plot, we typically observe a curve that illustrates how increasing the bitrate leads to improved perceptual quality, as indicated by higher VMAF scores. Figure 9 and Figure 10 present the VMAF vs. bitrate RD curves for five video sequences in the RA scenario and six video sequences in the AI scenario. The curves indicate that the proposed method demonstrates enhanced gains for both the RA and AI sequences.

Similarly, the Y-MSSSIM vs. bitrate plot provides insights into the structural similarity of the video content, showing how bitrate adjustments affect the preservation of luma details. As bit rate increases, Y-MSSSIM scores generally improve, reflecting better retention of visual features. Figure 11 and Figure 12 display the Y-MSSSIM vs. bitrate RD curves for five video sequences in the RA scenario and six video sequences in the AI scenario. The curves reveal that the proposed method also achieves enhanced gains for both the RA and AI sequences when compared to the VVC. Analyzing both plots together allows for a comprehensive understanding of how different encoding strategies impact video quality.

4.8. Model Complexity

In this research, we focus on keeping a balance between visual quality and quantitative metric improvement of VVC compressed frames. At the same time, we also make sure to reduce the complexity analysis by finding the optimal number of blocks in different parts of the suggested network. For evaluating the computational complexity of the proposed model by us, we consider three metrics: total number of parameters, floating point operations (FLOPs), and inference time. Please note that FLOPs and inference time are dependent on the input size. For example, class A is high-resolution videos, while on the other hand, class D consists of low-resolution videos. Therefore, the inference time and FLOPs are higher for the earlier class. Considering the total number of parameters (~2 M), we can conclude that our model is lightweight enough. Here, in Table 6, we present the computational cost of the proposed network at the inference stage for different class videos.

4.9. Discussion

The proposed post-processing network effectively addresses both RA and AI scenarios by employing a unified model generation approach tailored to specific scenarios across various QP ranges. By incorporating a QP map alongside the reconstructed image as input to the generator network, we enhance the model’s ability to generalize across different QP settings, which is crucial for maintaining image quality in diverse encoding conditions. The integration of Deep Residual Blocks (DRBs) and Enhanced Feature Block (EFB) within the generator facilitates the learning of robust features, allowing the network to train deeper architectures without succumbing to the vanishing gradient problem. Our two-stage training strategy further optimizes performance; initially, the generator is trained using the SSIM loss function, which focuses on perceptual quality, followed by training the GAN architecture in the second stage, utilizing a perceptual loss function to refine the output quality. The proposed post-processing network establishes a strong basis for utilizing GAN in video coding. A potential future direction involves improving the GAN architecture to more effectively capture and leverage temporal dependencies between frames, which is crucial for ensuring visual consistency in dynamic video content. Furthermore, investigating advanced training techniques could enable a gradual enhancement of the output generated, resulting in higher quality results. Additionally, further exploration of loss function optimization could lead to even more realistic and visually appealing outputs.

5. Conclusions

In this paper, we proposed a post-processing network that effectively addresses the challenges associated with both RA and AI scenarios by employing a unified model generation approach tailored to various QP ranges. By incorporating a QP map alongside the reconstructed image as input, the network enhances its ability to generalize across different QP settings, which is vital for maintaining high image quality in diverse encoding conditions. The use of DRB and EFB within the generator facilitates the learning of robust features, allowing for the successful training of deeper architecture without encountering issues related to vanishing gradients. Furthermore, our two-stage training strategy, which begins with the structural SSIM loss function and transitions to a perceptual loss function for the GAN architecture, optimizes the output quality by focusing on both structural fidelity and perceptual realism. The experimental results demonstrate a notable improvement of −13.05% and −5.00% for VMAF and Y MS-SSIM in the RA scenario, respectively, and 11.09% and −5.87% for VMAF and Y MS-SSIM in the AI scenario.

Author Contributions

Conceptualization, T.D.; methodology, T.D. and K.C.; software, T.D.; validation, T.D. and K.C.; formal analysis, T.D.; investigation, T.D. and K.C.; resources, T.D.; data curation, T.D. and K.C.; writing—original draft preparation, T.D.; writing—review and editing, T.D. and K.C.; visualization, T.D.; supervision, K.C.; project administration, K.C.; funding acquisition, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are openly available in [31,36].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Selection of Hyperparameters

In this research, we have categorized the potential hyperparameters into three groups. However, the first and second group are interrelated.

The first group of hyperparameters in general is related to deep learning network training.

Learning rate

Controls how much the model weights are updated during training. A smaller value leads to slower convergence but better stability. It is a common practice to start with

1 \times 10^{- 3}

and adjust based on convergence.

Batch Size

Number of training samples used in one forward/backward pass. Affects memory use and training dynamics. In general, smaller sizes can generalize better; larger ones train faster.

Number of Epochs

The number of complete passes through the training dataset. It is suggested to use early stopping to prevent overfitting.

Optimizer

Algorithm used to update model weights based on gradients. Common choices: Adam, SGD, RMSProp. Adam is typically a good default.

The second group of hyperparameters is related to our proposed DREFNet

Number of residual blocks

In our proposed network, sixteen Deep Residual Blocks (DRBs) are used. It is subject to experiment to keep a balance between model complexity and quantitative result gain.

Activation function

Function applied to introduce non-linearity in the model. Typically, ReLU is standard, and we use it in the proposed DREFNet; LeakyReLU also can be considered for better gradient flow.

The third group is connected to the loss function

Weight on different components

Equation (4) describes the loss function for the generator network. The contribution of different parts to the final loss can also be considered. We believe that this contribution helps to keep stability in subjective and objective performance metrics.

References

Sullivan, G.J.; Ohm, J.-R.; Han, W.-J.; Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.R. Overview of the Versatile Video Coding (VVC) Standard and Its Applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
Xu, S.; Nan, C.; Yang, H. Low-Light Image Enhancement Network Based on Hierarchical Residual and Attention Mechanism. In Proceedings of the 2024 43rd Chinese Control Conference (CCC), Kunming, China, 28–31 July 2024; pp. 7498–7503. [Google Scholar] [CrossRef]
Zhao, S.; Mei, X.; Ye, X.; Guo, S. MSFE-UIENet: A Multi-Scale Feature Extraction Network for Marine Underwater Image Enhancement. J. Mar. Sci. Eng. 2024, 12, 1472. [Google Scholar] [CrossRef]
Kalaivani, A.; Jayapriya, P.; Devi, A.S. Innovative Approaches in Deep Neural Networks: Enhancing Performance through Transfer Learning Techniques. In Proceedings of the 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 4–6 December 2024; pp. 857–864. [Google Scholar] [CrossRef]
Wang, M.-Z.; Wan, S.; Gong, H.; Ma, M.-Y. Attention-Based Dual-Scale CNN In-Loop Filter for Versatile Video Coding. IEEE Access 2019, 7, 145214–145226. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the Super-Resolution Convolutional Neural Network. arXiv 2016, arXiv:1608.00367. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Available online: https://ora.ox.ac.uk/objects/uuid:60713f18-a6d1-4d97-8f45-b60ad8aebbce (accessed on 16 March 2025).
Tai, Y.; Yang, J.; Liu, X. Image Super-Resolution via Deep Recursive Residual Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2790–2798. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar] [CrossRef]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In Computer Vision—ECCV 2018 Workshops; Leal-Taixé, L., Roth, S., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11133, pp. 63–79. [Google Scholar] [CrossRef]
Rakotonirina, N.C.; Rasoanaivo, A. ESRGAN+: Further Improving Enhanced Super-Resolution Generative Adversarial Network. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–9 May 2020; pp. 3637–3641. [Google Scholar] [CrossRef]
Cai, J.; Meng, Z.; Ho, C.M. Residual Channel Attention Generative Adversarial Network for Image Super-Resolution and Noise Reduction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 1852–1861. [Google Scholar] [CrossRef]
Wang, W.; Guo, R.; Tian, Y.; Yang, W. CFSNet: Toward a Controllable Feature Space for Image Restoration. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Seoul, Republic of Korea, 2019; pp. 4139–4148. [Google Scholar] [CrossRef]
Zhao, Y.; Lin, K.; Wang, S.; Ma, S. Joint Luma and Chroma Multi-Scale CNN In-loop Filter for Versatile Video Coding. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 3205–3209. [Google Scholar] [CrossRef]
Chen, S.; Chen, Z.; Wang, Y.; Liu, S. In-Loop Filter with Dense Residual Convolutional Neural Network for VVC. In Proceedings of the 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Shenzhen, China, 6–8 August 2020; pp. 149–152. [Google Scholar] [CrossRef]
Huang, Z.; Sun, J.; Guo, X.; Shang, M. One-for-all: An Efficient Variable Convolution Neural Network for In-loop Filter of VVC. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2342–2355. [Google Scholar] [CrossRef]
Zhang, H.; Jung, C.; Liu, Y.; Li, M. Lightweight CNN-Based in-Loop Filter for VVC Intra Coding. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 1635–1639. [Google Scholar] [CrossRef]
Zhang, H.; Liu, Y.; Jung, C.; Liu, Y.; Li, M. RTNN: A Neural Network-Based In-Loop Filter in VVC Using Resblock and Transformer. IEEE Access 2024, 12, 104599–104610. [Google Scholar] [CrossRef]
Wang, M.; Wan, S.; Gong, H.; Yu, Y.; Liu, Y. An Integrated CNN-based Post Processing Filter For Intra Frame in Versatile Video Coding. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019; pp. 1573–1577. [Google Scholar] [CrossRef]
Zhang, H.; Jung, C.; Zou, D.; Li, M. WCDANN: A Lightweight CNN Post-Processing Filter for VVC-Based Video Compression. IEEE Access 2023, 1, 83400–83413. [Google Scholar] [CrossRef]
Lin, W.; He, X.; Han, X.; Liu, D.; See, J.; Zou, J.; Xiong, H.; Wu, F. Partition-Aware Adaptive Switching Neural Networks for Post-Processing in HEVC. IEEE Trans. Multimed. 2020, 22, 2749–2763. [Google Scholar] [CrossRef]
Guan, Z.; Xing, Q.; Xu, M.; Yang, R.; Liu, T.; Wang, Z. MFQE 2.0: A New Approach for Multi-Frame Quality Enhancement on Compressed Video. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 949–963. [Google Scholar] [CrossRef] [PubMed]
Ma, D.; Zhang, F.; Bull, D.R. MFRNet: A New CNN Architecture for Post-Processing and In-loop Filtering. IEEE J. Sel. Top. Signal Process. 2021, 15, 378–387. [Google Scholar] [CrossRef]
Zhang, F.; Ma, D.; Feng, C.; Bull, D.R. Video Compression with CNN-based Post Processing. IEEE Multimed. 2021, 28, 74–83. [Google Scholar] [CrossRef]
Das, T.; Liang, X.; Choi, K. Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement. Appl. Sci. 2024, 14, 8276. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2016, arXiv:1511.06434. [Google Scholar] [CrossRef]
Ma, D.; Zhang, F.; Bull, D.R. BVI-DVC: A Training Database for Deep Video Compression. IEEE Trans. Multimed. 2022, 24, 3847–3858. [Google Scholar] [CrossRef]
Wieckowski, A.; Brandenburg, J.; Hinz, T.; Bartnik, C.; George, V.; Hege, G.; Helmrich, C.; Henkel, A.; Lehmann, C.; Stoffers, C.; et al. Vvenc: An Open And Optimized Vvc Encoder Implementation. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–2. [Google Scholar] [CrossRef]
Wieckowski, A.; Hege, G.; Bartnik, C.; Lehmann, C.; Stoffers, C.; Bross, B.; Marpe, D. Towards A Live Software Decoder Implementation For The Upcoming Versatile Video Coding (VVC) Codec. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 3124–3128. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Alshina, E.; Liao, R.-L.; Liu, S.; Segall, A. JVET common test conditions and evaluation procedures for neural network based video coding technology. In Proceedings of the Joint Video Experts Team (JVET) 30th Meeting, JVET-AC2016-v1, Virtually, 23–27 January 2023. [Google Scholar]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar] [CrossRef]
Blog, N.T. Toward A Practical Perceptual Video Quality Metric. Medium. Available online: https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652 (accessed on 6 November 2023).
Bjontegaard, G. Calculation of average PSNR differences between RD-curves. In Proceedings of the Video Coding Experts Group (VCEG) Thirteenth Meeting, VCEG-M33, Austin, TX, USA, 2–4 April 2001. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: San Francisco, CA, USA, 2019; Available online: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html (accessed on 4 March 2023).

Figure 1. Generator architecture of proposed GAN.

Figure 2. (a) Deep Residual Block (b) Enhanced Feature Block.

Figure 3. Discriminator architecture of proposed GAN.

Figure 4. Training data distribution.

Figure 5. Visualization of C class sequence BQMall for RA configuration: (a) original frame, (b) reconstructed frame, and (c) proposed network output.

Figure 6. Visualization of B class sequence BasketballDrive for RA configuration: (a) original frame, (b) reconstructed frame, and (c) proposed network output.

Figure 7. Visualization of C class sequence BasketballDrill for AI configuration: (a) original frame, (b) reconstructed frame, and (c) proposed network output.

Figure 8. Visualization of E class sequence Johnny for AI configuration: (a) original frame, (b) reconstructed frame, and (c) proposed network output.

Figure 9. RD curve (VMAF vs. Bitrate) performance comparison for five different test sequences in RA configuration.

Figure 10. RD curve (VMAF vs. Bitrate) performance comparison for six different test sequences in AI configuration.

Figure 11. RD curve (Y-MSSSIM vs. Bitrate) performance comparison for five different test sequences in RA configuration.

Figure 12. RD curve (Y-MSSSIM vs. Bitrate) performance comparison for six different test sequences in AI configuration.

Table 1. The compression performance of the proposed method for RA configuration.

Class	Sequence	BD-Rate (%)
Class	Sequence	VMAF	Y-MSSSIM	U-MSSSIM	V-MSSSIM
A1	Tango2	−15.11%	−5.99%	−29.82%	−21.67%
	FoodMarket4	−12.80%	−4.16%	−12.14%	−13.81%
	Campfire	−22.05%	−7.54%	−11.88%	−27.63%
Average		−16.65%	−5.90%	−17.95%	−21.04%
A2	CatRobot	−20.47%	−6.62%	−23.29%	−17.91%
	DaylightRoad2	−24.47%	−7.57%	−25.71%	−22.67%
	ParkRunning3	−11.58%	−2.97%	−7.31%	−7.31%
Average		−18.84%	−5.72%	−18.77%	−15.97%
B	MarketPlace	−17.44%	−3.41%	−21.26%	−17.61%
	RitualDance	−14.81%	−4.52%	−16.56%	−18.46%
	Cactus	−6.52%	−4.61%	−20.42%	−15.71%
	BasketballDrive	−12.92%	−5.00%	−20.58%	−23.12%
	BQTerrace	−9.16%	−4.70%	−15.99%	−21.03%
Average		−12.17%	−4.45%	−18.96%	−19.19%
C	BasketballDrill	−10.70%	−4.55%	−21.31%	−25.82%
	BQMall	−8.29%	−6.61%	−20.63%	−22.37%
	PartyScene	−9.54%	−4.27%	−7.46%	−13.21%
	RaceHorses	−7.45%	−3.54%	−17.94%	−23.26%
Average		−8.99%	−4.74%	−16.83%	−21.17%
D	BasketballPass	−13.80%	−5.72%	−24.88%	−23.22%
	BQSquare	−17.32%	−5.49%	−14.73%	−23.12%
	BlowingBubbles	−4.30%	−3.90%	−14.17%	−14.11%
	RaceHorses	−9.17%	−3.87%	−21.56%	−24.52%
Average		−11.15%	−4.74%	−18.83%	−21.24%
Overall		−13.05%	−5.00%	−18.30%	−19.82%

Table 2. The compression performance of the proposed method for AI configuration.

Class	Sequence	BD-Rate (%)
Class	Sequence	VMAF	Y-MSSSIM	U-MSSSIM	V-MSSSIM
A1	Tango2	−17.14%	−5.58%	−17.81%	−14.78%
	FoodMarket4	−17.54%	−5.21%	−12.75%	−13.50%
	Campfire	−26.35%	−10.06%	−5.79%	−17.43%
Average		−20.34%	−6.95%	−12.12%	−15.24%
A2	CatRobot	−23.01%	−8.72%	−15.34%	−15.18%
	DaylightRoad2	−26.12%	−8.70%	−20.62%	−18.84%
	ParkRunning3	−16.24%	−2.19%	−4.99%	−8.24%
Average		−21.79%	−6.54%	−13.65%	−14.09%
B	MarketPlace	−18.75%	−3.87%	−17.28%	−14.72%
	RitualDance	−20.80%	−5.15%	−15.97%	−16.70%
	Cactus	−1.90%	−5.35%	−10.94%	−13.84%
	BasketballDrive	−7.21%	−5.10%	−16.24%	−19.75%
	BQTerrace	−3.51%	−5.77%	−14.90%	−18.23%
Average		−10.44%	−5.05%	−15.07%	−16.65%
C	BasketballDrill	−8.40%	−4.82%	−16.89%	−23.17%
	BQMall	−6.53%	−7.58%	−17.91%	−20.24%
	PartyScene	−4.67%	−4.14%	−10.12%	−13.83%
	RaceHorses	−4.85%	−3.74%	−12.94%	−17.88%
Average		−6.11%	−5.07%	−14.46%	−18.78%
D	BasketballPass	−9.73%	−4.28%	−20.27%	−22.92%
	BQSquare	−7.92	−5.34%	−10.69%	−15.21%
	BlowingBubbles	−1.72%	−4.42%	−13.04%	−14.27%
	RaceHorses	−7.47%	−3.31%	−18.54%	−22.08%
Average		−6.71%	−4.34%	−15.64%	−18.62%
E	FourPeople	−3.52%	−10.01%	−11.67%	−14.92%
	Johnny	−6.78%	−8.61%	−16.59%	−18.06%
	KristenAndSara	−3.74%	−7.28%	−12.13%	−15.92%
Average		−4.68%	−8.63%	−13.47%	−16.30%
Overall		−11.09%	−5.87%	−14.25%	−16.81%

Table 3. BD-Rate comparison between the proposed method and VVC-PPFF for RA configuration.

Class	VVC-PPFF [29]				Proposed
	BD-Rate (%)				BD-Rate (%)
	VMAF	Y-MSSSIM	U-MSSSIM	V-MSSIM	VMAF	Y-MSSSIM	U-MSSSIM	V-MSSIM
A1	−5.78%	−3.89%	−14.59%	−15.67%	−16.55%	−5.90%	−17.95%	−21.04%
A2	−4.43%	−3.54%	−14.83%	−10.52%	−18.84%	−5.72%	−18.77%	−15.97%
B	1.16%	−2.78%	−16.21%	−15.13%	−12.17%	−4.45%	−18.96%	−19.19%
C	−3.82%	−3.94%	−16.48%	−17.60%	−8.99%	−4.74%	−16.83%	−21.17%
D	−4.38%	−3.58%	−16.32%	−17.00%	−11.15%	−4.74%	−18.83%	−21.24%
Overall	−3.04%	−3.49%	−15.82%	−15.40%	−13.05%	−5.00%	−18.30%	−19.82%

Table 4. BD-Rate comparison between the proposed method and another GAN [28] for RA configuration.

Class	Proposed		[28]
	BD-Rate (%)		BD-Rate (%)
	VMAF	Y-PSNR	VMAF	Y-PSNR
A	−18.06%	−11.57%	−11.7%	+0.1%
B	−12.17%	−4.83%	−16.05%	0.0%
C	−9.3%	−4.78%	−13.05%	−1.2%
D	−11.4%	−5.88%	−13.4%	−2.45%
Overall	−13.27%	−5.20%	−13.85%	−0.9%

Table 5. BD-Rate comparison based on VMAF for RA and AI configurations for the ablation study.

Class	Excluded Part
	DRB		QP Map
	RA	AI	RA	AI
C	−2.37%	+2.58%	−0.27%	+1.49%
D	−2.40%	+2.66%	−0.19%	+0.57%

Table 6. Computational cost of proposed model for different resolution videos at inference.

Class	Resolution	Inference Time/Frame	FLOPs (G)	Total Parameters (M)
A1	3840 × 2160	6~7 s	8769	2.14
A2	3480 × 2160	6~7 s	8768	2.14
B	1920 × 1080	~3 s	2192	2.14
C	832 × 480	<1 s	422.5	2.14
D	416 × 240	<1 s	105.5	2.14
E	1280 × 720	~2 s	974	2.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Das, T.; Choi, K. DREFNet: Deep Residual Enhanced Feature GAN for VVC Compressed Video Quality Improvement. Mathematics 2025, 13, 1609. https://doi.org/10.3390/math13101609

AMA Style

Das T, Choi K. DREFNet: Deep Residual Enhanced Feature GAN for VVC Compressed Video Quality Improvement. Mathematics. 2025; 13(10):1609. https://doi.org/10.3390/math13101609

Chicago/Turabian Style

Das, Tanni, and Kiho Choi. 2025. "DREFNet: Deep Residual Enhanced Feature GAN for VVC Compressed Video Quality Improvement" Mathematics 13, no. 10: 1609. https://doi.org/10.3390/math13101609

APA Style

Das, T., & Choi, K. (2025). DREFNet: Deep Residual Enhanced Feature GAN for VVC Compressed Video Quality Improvement. Mathematics, 13(10), 1609. https://doi.org/10.3390/math13101609

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DREFNet: Deep Residual Enhanced Feature GAN for VVC Compressed Video Quality Improvement

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning-Based Image Enhancement Approach

2.2. Deep Learning-Based Video Enhancement Approach

3. Proposed Method

3.1. Generator Architecture

3.2. Discriminator Architecture

3.3. Train Dataset Preparation

3.4. Training Methodology

3.5. Training Configuration

4. Experimental Evaluations and Discussion

4.1. Test Environment Configuration

4.2. Quality Metric

4.3. Experimental Setup

4.4. Quantitative Analysis

4.5. Ablation Study

4.6. Qualitative Analysis

4.7. Rate Distortion Plot Analysis

4.8. Model Complexity

4.9. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Selection of Hyperparameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI