Deep Learning Post-Filtering Using Multi-Head Attention and Multiresolution Feature Fusion for Image and Intra-Video Quality Enhancement

The paper proposes a novel post-filtering method based on convolutional neural networks (CNNs) for quality enhancement of RGB/grayscale images and video sequences. The lossy images are encoded using common image codecs, such as JPEG and JPEG2000. The video sequences are encoded using previous and ongoing video coding standards, high-efficiency video coding (HEVC) and versatile video coding (VVC), respectively. A novel deep neural network architecture is proposed to estimate fine refinement details for full-, half-, and quarter-patch resolutions. The proposed architecture is built using a set of efficient processing blocks designed based on the following concepts: (i) the multi-head attention mechanism for refining the feature maps, (ii) the weight sharing concept for reducing the network complexity, and (iii) novel block designs of layer structures for multiresolution feature fusion. The proposed method provides substantial performance improvements compared with both common image codecs and video coding standards. Experimental results on high-resolution images and standard video sequences show that the proposed post-filtering method provides average BD-rate savings of 31.44% over JPEG and 54.61% over HEVC (x265) for RGB images, Y-BD-rate savings of 26.21% over JPEG and 15.28% over VVC (VTM) for grayscale images, and 15.47% over HEVC and 14.66% over VVC for video sequences.


Introduction
In image compression, the main objective is to develop efficient algorithms that minimize the amount of data required to represent the visual information. Nowadays, the consumption of image and video content is constantly increasing which calls for the design of novel compression methods with a highly increased coding performance.
Current compression methods employed on images include conventional image codecs such as JPEG [1] and JPEG2000 [2], denoted here simply J2K. The JPEG codec [1] was developed by the Joint Photographic Experts Group (JPEG), being the most common format for image compression. The J2K standard [2] improves the compression performance over its predecessor and offers several functionalities including resolution and quality scalability and region of interest coding. JPEG and J2K lossy compressed images often suffer from compression artifacts such as blocking artifacts, color bleeding, and ringing effects. JPEG is designed based on the discrete cosine transform (DCT), the block-based nature of this codec being associated with blocking effects at high compression ratios. The global nature of the discrete wavelet transform (DWT) employed in J2K overcomes the blocking effects typical for JPEG compression, but ringing effects and color bleeding are observed at high compression ratios.
A common solution to enhance the quality of lossy images is to employ different filtering techniques which reduce the effect of coding artifacts. In this work, we propose a novel filtering method based on convolutional neural networks (CNNs) designed to enhance the quality of high-resolution images and video sequences by post-processing the decoded images and video frames without applying other modification to the corresponding coding framework.
In our prior work, we propose to replace the built-in filtering module in HEVC [3] with efficient deep learning (DL)-based filtering tools and enhance the quality of HEVCcompressed video and light field (LF) images. In [4], we propose a frame-wise multiscale CNN for quality enhancement of HEVC decoded videos, built based on inception and residual learning blocks (IResLBs) [5], where additional information is extracted from the HEVC decoder to guide the network. In [6], we propose a multiscale CNN to enhance the quality of LF images using macro-pixel volumes. In this work, we introduce a novel DL-based filtering method for enhancing the quality of high-resolution lossy compressed RGB/grayscale images and video sequences. The proposed filtering method follows a simple post-processing approach where a neural network is employed to estimate refinement details which are subsequently added to the decoded images/frames in order to enhance their quality.
In summary, the novel contributions and findings of this paper are as follows: (1) an efficient CNN-based post-filtering method is proposed, where the architecture is designed based on the following concepts: (i) the multi-head attention mechanism for refining the feature maps, (ii) the weight sharing concept for reducing the network complexity, and (iii) novel block designs of layer structures for multiresolution feature fusion; (2) the current resolution patch is processed using feature maps extracted from both input patch and higher resolution patches; (3) the network is trained to estimate the refinement details at full-, half-, and quarter-patch resolutions; (4) the complex experimental evaluation over RGB/grayscale images and video sequences demonstrate that the proposed method offers substantial performance improvements when filtering images encoded by common lossy codecs and video sequences encoded by recent video coding standards, by reducing the coding artifacts specific to each codec.
The remainder of this paper is organized as follows. Section 2 overviews the state-ofthe-art methods for image and video quality enhancement. Section 3 describes the proposed filtering method. Section 4 presents the experimental validation on RGB images, grayscale images, and video sequences. Finally, Section 5 draws the conclusions of this work.
The design of novel DL-based tools to integrate into HEVC became a hot research topic, and filtering tools were proposed. In [13], an artifact reduction CNN (AR-CNN) architecture is proposed to reduce the artifacts in the JPEG compressed images using a sequence of four convolution layers. ARCNN was one of the first architectures that was able to achieve more than 1 dB improvement in PSNR over the JPEG compression on classical test images. In [14], the authors proposed a super-resolution CNN (SRCNN) to obtain highresolution images from the corresponding low-resolution images in an end-to-end manner that outperforms the state-of-the-art in both reconstruction quality and computational speed. In [15], the authors proposed the variable-filter-size residue-learning CNN (VRCNN) architecture built based on ARCNN [13] and employed to replace the conventional HEVC built-in filters, de-blocking filter (DBF) [16] and sample adaptive offset (SAO) [17], when filtering the HEVC reconstructed intra-frames. In [18], a VRCNN architecture with batch normalization is proposed, where the new method is called VRCNN-BN and provides competitive results when enhancing the quality of HEVC decoded sequences. An iterative post-filtering method based on an RNN design was proposed in [19]. In [20], the authors employ a CNN-based autoencoder to estimate the radial blur and enhance the deblurred image. In [21], an image enhancement algorithm based on image-derived graph for weak illumination images is proposed. In [22], a DL-based hue-correction scheme is proposed based on the constant-hue plane in the RGB color space for image enhancement. In [23], a post-filtering DL-based method for quality enhancement of brain images is proposed, where the method operates on 3D volumetric data with a high dynamic range. In [24], a dual-stream recursive residual network is proposed, which consists of structure and texture streams for separately reducing the specific artifacts related to high-frequency or low-frequency components in JPEG compressed images. The method is called STRNN and it is the current state-of-the-art method for artifact reduction of JPEG compressed grayscale images. In [25], a novel multilevel progressive refinement generative adversarial network (MPRGAN) is employed to filter the intra-coded frames, where the generative network refines the reconstructed frame progressively in order to maximize the error of the adversarial network when distinguishing between the enhanced frame and the original frame. In [26], the authors proposed the frame enhancement CNN (FECNN) architecture which contains nine convolutional layers and introduces the residual learning scheme for filtering intra-frames to achieve a fast convergence speed in network training and also a higher filtering performance compared with VRCNN. In [27], a residual-based video restoration network (residual-VRN) is proposed based on residual learning blocks to enhance the quality of decoded HEVC intra-frames. In [28], a recursive residual CNN (RRCNN) architecture is proposed based on the recursive learning and residual learning schemes, which provides an important advantage over the state-of-the-art methods.
In one approach, DL-based filtering tools are introduced inside the main coding loop to filter the inter-predicted frames. In [29], based on the SRCNN [14] structure, the authors proposed a CNN architecture for in-loop filtering called IFCNN. IFCNN was designed to replace the HEVC built-in SAO [17] filter and to perform frame filtering after DBF [16] in the HEVC framework. In [30], the authors proposed a deep CNN architecture called DCAD employed to perform quality enhancement for all intra-and inter-frames after applying the traditional HEVC built-in DBF [16] and SAO filters [17]. In [31], the authors proposed a CNN architecture called RHCNN which contains residual highway units designed based on residual learning units to enhance the HEVC reconstructed frames. In [32], the authors proposed to perform content-aware CNN-based filtering of the HEVC decoded frames. In [33], a partition-aware architecture is proposed, where the CU partition size is used as additional information. In [34], a GAN-based perceptual training strategy is employed to post-process VVC and AV1 results. In [35], a novel architecture which exploits multilevel feature review residual dense blocks is proposed.
In another approach, DL-based filtering tools were proposed to filter the current inter-coded frame by making use of multiple previously reconstructed frames. In [36], the authors proposed a DL-based multi-frame in-loop filter (MIF) to enhance the current frame using several previously reconstructed frames. In [37], a multi-channel long-short-term dependency residual network (MLSDRN) was proposed to perform quality enhancement in addition to the DBF&SAO filtering. In [38], the authors proposed a filtering method based on the Decoder-side scalable convolutional neural network (DS-CNN), where a DS-CNN-I model is employed to enhance the intra-coded frames. In [39], the authors proposed a DL-based filtering method, QENet, designed to work outside the HEVC coding loop. However, the QENet still makes use of multiple frames considering the temporal correlations among them when filtering the inter-coded frames. In summary, in recent years, several deep-learning-based solutions were proposed to either replace the built-in HEVC in-loop filtering methods or to further refine the reconstructed frame. This is in contrast to the recent post-filtering deep-learning-based methods which prove that an improve coding solution is ordained by simply removing the built-in HEVC in-loop filtering methods.
The research area of neural-network-based loop filtering is intensively studied at the MPEG meetings. Many contributions [40][41][42][43] were recently proposed to enhance the VVC quality. In recent years, the attention mechanism [44] has become a powerful tool for improving the network performance. In [45], the attention module is used for low-light image enhancement. In [46], a new non-local attention module is proposed for low-light image enhancement using multiple exposure image sequences.
In this paper, we follow an approach where no modifications are applied to the codec, where the proposed CNN-based post-filtering method is employed to further filter the reconstructed image or decoded video frame. The quality enhancement experiments are carried out on three types of data: RGB images, grayscale images, and intra-predicted video frames. The proposed filtering method enhances the quality of lossy images encoded using traditional image codes, such as JPEG [1] and J2K [2], and of video sequences encoded using the latest video coding standards, HEVC [3] and VVC [47]. Figure 1 depicts the proposed CNN-based filtering method which is designed to simply post-process a lossy image compressed using a traditional image or video codec. One can note that no modifications are applied to the encoder-decoder framework. The input image/frame is split into h × w blocks which are used as input patches for the deep neural network.  where channel and spatial attention are employed to refine the feature maps, (ii) the weights sharing concept, where a single convolutional layer is employed twice inside a specific block of layer, (iii) the novel multiresolution feature fusion block design used to fuse current resolution feature maps with lower/higher resolution feature maps.   ASQE-CNN operates at three patch resolutions (full, half, and quarter) and introduces in its design new network branches used to extract feature maps from the input patch and to estimate the refinement details at each resolution. ASQE-CNN is designed to operate at h × w, h 2 × w 2 , and h 4 × w 4 patch resolution. In the first part of ASQE-CNN, these three resolution feature maps are obtained by processing the input patch, which lies in contrast to the general approach where the input patch is used only for processing the full-patch resolution. In the last part of ASQE-CNN, the final refinement details are extracted from the full-, half-, and quarter-patch resolutions, in contrast to the general approach where only the full-patch resolution is used.

Network Design
The ASQE-CNN architecture is built using eight types of blocks of layers that are depicted in Figure 3.
The convolutional block attention module (CBAM) was proposed in [44] and uses both channel and spatial attention. CBAM is employed here in the design of the multi-head attention block (MHA), where the input feature maps are channel split into K sets of feature maps and processed separately by a different CBAM block. Hence, the MHA block is designed based on the observation that the network architectures usually contain many channels to process the current patch. Here, we propose to divide the attention into a few dozen of channels by splitting the input channels into K sets of feature maps.
The convolution block (CB) contains a sequence of a 2D convolution (Conv2D) layer equipped with a 3 × 3 kernel, N filters, and stride s; a batch normalization (BN) layer [48]; and a rectified linear unit (ReLU) activation layer [49]. Similarly, the deconvolution block (DB) contains a sequence of a deconvolution (Deconv2D) layer equipped with a 3 × 3 kernel, N filters, and stride s; a BN layer; and an ReLU activation layer. For simplicity, the strides s = (2, 2) and s = (4, 4) are denoted as "/2" and "/4", respectively. DB is used to perform a simple block processing and to increase the patch resolution.
The attention-based shared weights block (ASB) proposes a more efficient patch processing design where an MHA block is inserted between a CB and a Conv2D layer and trainable weights are shared between the two Conv2D layers. The skip connection branch is added to the current branch after Conv2D, while BN and ReLU layers are used to further process the feature maps. Moreover, the attention-based shared weights residual block (ASRB) is introduced as a residual learning [50] bottleneck block build using ARB blocks.
The multiresolution feature fusion block design is implemented using the low-resolution feature fusion (LFF) and high-resolution feature fusion (HFF) blocks. The blocks are designed using a similar strategy as ARB, where the MHA block is replaced by an add block which combines the current feature map with the processed low/high-feature maps.
The ASQE-CNN network contains three parts. The first part is called pre-processing and is used to extract from the input patch N feature maps at three patch resolutions, full, half (s = /2), and quarter (s = /4), by employing three CB blocks. The second part is called multiresolution feature fusion (MFF), where the three resolution patches undergo complex processing using multiresolution feature fusion based on ARBS, HFF, and LFF blocks. To reduce the inference time, ASQE-CNN processes the full-resolution blocks using MHA blocks with K = 1 and ARB blocks instead of ASRB blocks. The last part is called multiresolution refinement, where two DB blocks and three Conv2D layers, equipped with 3 × 3 kernels, are used to estimate the final refinement details corresponding to each resolution. One can note that ASQE-CNN adopts a new strategy for processing multiresolution patches, where the number of channels, N, remains constant throughout the network, in contrast to the general approach where the number of channels is doubled when the patch resolution is halved. Our experiments show that the increase in complexity for lower resolution patches does not provide a good improvement in network performance.
The ASQE-CNN model is obtained using N = 64 and K = 4, and contains around 0.89 million (M) parameters. ASQE-CNN is designed to provide an improved performance using a reduced number of parameters. Note that without applying the weight sharing approach, 1.48 M parameters must be trained. Our experiments show that consecutive Conv2D layers may contain similar weights and a too-complex architecture is affected by the vanishing gradient problem; see Section 4.6.

Loss Function
The loss function consists of the summation of a mean squared error (MSE)-based term and an 2 regularization term, used to prevent model overfitting. Let us denote as Θ ASQE-CNN the set of all learned parameters of the ASQE-CNN model, X i the ith input patch in the training set, and Y i the corresponding original patch, both of size w × h × c, where c denotes the number of color channels, i.e., c = 3 for RGB images and c = 1 for grayscale images and video sequences. Let F(·) be the function which processes X i using Θ ASQE-CNN to compute the enhanced frameŶ i = F(X i , Θ ASQE-CNN ). The loss function is formulated as follows: where and λ is the regularization term set here as λ = 0.01. The Adam optimization algorithm [51] is employed.

Experimental Validation
The experimental setup used to validate the proposed DL-based post-filtering method is described in Section 4.1. The experimental results for quality enhancement of decoded RGB images, grayscale images, and video sequences are reported in Sections 4.2-4.4, respectively. The complexity of the proposed network architecture is discussed in Section 4.5. Finally, an ablation study for the proposed network design is presented in Section 4.6.

Experimental Setup
The paper aims to provide an efficient post-filtering method for enhancing the quality of high-resolution images and video sequences. The deep neural network models are trained using input patches extracted from the DIV2K dataset [52,53]. The training dataset in DIV2K is denoted DIV2K_train_HR and contains 800 high-resolution images.  BlowingBubbles 500 D4 BasketballPass 500 The proposed method is employed to enhance the quality of both the lossy images encoded using two traditional image codecs, JPEG [1] and J2K [2], and each frame in the video sequence encoded using the previous and ongoing video coding standards, HEVC [3] and VVC [47], respectively. No modifications are applied to the codec as ASQE-CNN is employed as a post-filtering method. HEVC and VVC do apply their built-in post-filtering method. In our prior works [4,6], we proved that such a built-in method can have a negative effect on the HEVC final results if a CNN is employed.
The MATLAB implementation of the JPEG codec [1] is used to generate lossy images for each of the following six quality factors points: q i ∈ {90, 85, 70, 40, 20, 10}, where i = 1 : 6. Note that the same results can be obtained using the Python Imaging Library (PIL). A neural network model is trained for each q i value and for each color format. Therefore, 6 + 6 = 12 neural network models are trained using DIV2K_train_HR to enhance the quality of RGB and grayscale images in DIV2K_valid_HR and LIVE1.
The MATLAB implementation of the J2K codec [2] is used to generate lossy images at the following six compression ratio points: cr i ∈ {13, 16, 20, 30, 50, 100}. Similarly to JPEG, a neural network model is trained for each cr i value and for each color format. Hence, 12 models are trained using DIV2K_train_HR: six using RGB images and six using grayscale images. All 12 models are employed to enhance the images quality in the DIV2K_valid_HR [52] and LIVE1 [54] datasets.
The x265 library implementation of HEVC [3], available in the FFmpeg [57] framework, is employed to encode single images and video sequences using all-intra profile. For the RGB and grayscale images case, the lossy images are generated for each of the following six constant rate factors (CRF), cr f i ∈ {16, 20, 24, 28, 32, 36}, i.e., the -c:v libx265 -preset veryslow -crf crf_i parameters are used to encode the images. A neural network model is trained for each cr f i value and for each color format. For the video sequences case, the lossy images are generated for each of the following four standard quantization parameter (QP) values, qp j ∈ {22, 27, 32, 37}, where j = 1 : 4, i.e., the -x265-params keyint=1:qp=qp_i parameters are used to encode the video sequences. A neural network model is trained for each qp j value. A total of 16 network models are trained: six models are used to enhance the grayscale images and six models to enhance the RGB images in DIV2K_valid_HR and LIVE1, and four models are used to enhance the Y component in HEVC-VTSEQ.
In this work, the VVC Test Model (VTM) [58], which is the reference software implementation of VVC [47], is employed to encode single images and video sequences in all-intra profile. Similarly to HEVC, the lossy images are generated for each of the four standard QP values. Please note that the VVC runtime is extremely large, e.g., a single high-resolution image is encoded in around 15 minutes (min), and 15 (min) × 4 (QP values) × 800 (images) = 48, 000 (min) ≈ 33 (days) are needed to generate training data. Four models are trained using input patches extracted from the RGB images in DIV2K_valid_HR [52] and used to report results for quality enhancement of RGB images. Four models are trained using patches extracted from the luminance channel after color transforming the images in DIV2K_valid_HR [52] and used to report results for quality enhancement of both grayscale images and video sequences. Hence, eight network models are trained: four models are used to enhance grayscale images (DIV2K_valid_HR and LIVE1) and video sequences (HEVC-VTSEQ), and four models to enhance RGB images (DIV2K_valid_HR and LIVE1).
In our previous work [4], a frame-wise CNN-based filtering method is proposed for video quality enhancement. In contrast to this approach, in this work, we propose a blockbased approach for quality enhancement. Due to its block-based nature, the enhanced image may still be affected by blocking artifacts. The experiments show that we can further reduce the effect of coding artifacts by applying, for a second time, the proposed method on the same lossy compressed image whereby the top 16 lines and left 16 columns are cropped, i.e., the input patches are extracted a second time from the cropped input image, enhanced, and then concatenated to obtain the second enhanced image. The two enhanced images are then fused to obtain the enhancement results using alpha blending in the RGB image case and mean in the grayscale image case. Note that fusion is performed only over the size of the second enhanced image, while the top 16 lines and left 16 columns area is copied from the first enhanced image.
The image distortion is reported using both peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [59], where the PSNR between an original imageŶ and a reconstructed imageŶ is computed as follows: where MSE[Y,Ŷ] is the corresponding mean squared error. The SSIM index [59] for Y and Y is computed as follows: where µ Y and µŶ are the mean of Y andŶ respectively; σ 2 Y and σ 2 Y are the variance of Y and Y respectively; and c 1 = (k 1 L) 2 , c 2 = (k 2 L) 2 are used to stabilize the division with weak denominator, with default values k 1 = 0.01, k 2 = 0.03, and L = 2 b − 1 for b-bit precision.
The quality enhancement results are compared using the two Bjøntegaard metrics [60], Bjøntegaard delta-rate (BD-Rate) and Bjøntegaard delta-PSNR (BD-PSNR). BD-PSNR was introduced to evaluate the quality gains made by one video codec versus another video codec at the equivalent bitrate. More exactly, when a specific amount of bits are spent to encode the video using each codec, then BD-PSNR measures by how much one codec provides a better quality than the other. Similarly, BD-rate measures the bitrate savings at the equivalent quality. In this paper, we use the Python implementation publicly available here [61]. The bitrate is measured as bits per channel (bpc), which is defined as the ratio between the compresses image file size and the product between the number of image pixels and the number of channels.
We note that in the considered experiments, we evaluate the proposed method on several data types and coding standards. The proposed method functions as a postprocessing method, decoupled from the underlying coding methodology. In this sense, in contrast to customized in-loop filtering methods which require modifications of the standard implementation, the proposed method remains standard-compliant and can be used as post-filtering tool on image and video data already encoded with existing coding standards. To address the random access (RA) profile of existing video coding standards, it is expected that the proposed method will need to be performed in-loop, to maximize coding performance. Again, adapting it to in-loop filtering makes it codec-specific, and departs from the basic idea of devising a CNN-based post-processing method followed in this work. Investigating how the proposed post-filtering method can be adapted to the RA configuration is left as a topic of further investigation.    Table 3 presents the average BD-rate savings and BD-PSNR improvement over the two test sets, DIV2K_valid_HR [52] and LIVE1 [54]. In the DIV2K_valid_HR case, ASQE-CNN provides average BD-rate savings of 31.44% and an average BD-PSNR improvement of 2.006 dB compared to the traditional JPEG codec, and average BD-rate savings of 16.02% and an average BD-PSNR improvement of 0.905 dB compared to the traditional J2K codec. However, outstanding quality enhancement results are obtained compared with the HEVC standard, average BD-rate savings of 54.61% and average BD-PSNR improvement of 3.448 dB, and compared with the VVC standard, average BD-rate savings of 19.39% and average BD-PSNR improvement of 1.201 dB. In the LIVE1 case, ASQE-CNN provides (i) average BD-rate savings of 31.44% and an average BD-PSNR improvement of 2.006 dB for JPEG; (ii) average BD-rate savings of 16.02% and an average BD-PSNR improvement of 0.905 dB for J2K; (iii) average BD-rate savings of 52.78% and an average BD-PSNR improvement of 3.7380 dB for HEVC; and (iv) average BD-rate savings of 18.79% and an average BD-PSNR improvement of 1.3056 dB for VVC. One can note that the ASQE-CNN provides impressive results over both test sets.  Figure 8 presents a visual comparison between the following methods: JPEG operating at the lowest quality parameter, q 6 = 10, J2K operating at the highest compression ratio, cr 6 = 13, HEVC operating at the highest constant rate factor, cr f 6 = 36, VVC operating at qp 4 = 37, and the proposed CNN-based post-filtering method, JPEG+ASQE-CNN, J2K+ASQE-CNN, HEVC+ASQE-CNN, and VVC+ASQE-CNN, respectively. The figure shows that the quality of the highly-distorted JPEG image, having a PSNR of only 27.87 dB and SSIM of 0.7825, was substantially enhanced by JPEG+ASQE-CNN to 29.44 dB in PSNR and 0.8374 in SSIM. The coding artifacts, such as color bleeding and blocking artifacts, were reduced and most of the high details were estimated by the ASQE-CNN models, e.g., see the penguin's beak and the rocky background. One can note that J2K offers a better compression performance than JPEG; however, in the zoom-in area the coding artifacts are still visible. Figure 8 Figure 10 presents quality enhancement results computed based on Bjøntegaard metrics [60] for every image in DIV2K_valid_HR [52] and LIVE1 [54]. Note that both the Y-BD-PSNR improvement and Y-BD-rate savings have a high variation over the two datasets. Table 5   Sorted file index   Table 6 presents the video quality enhancement results computed as Y-BD-rate savings for HEVC-vTEQ [56]. The results are computed for all available frames in each video sequence by employing ASQE-CNN to post-process the HEVC [3]-decoded videos and are compared with the results of the six most recent state-of-the-art methods, VR-CNN [15], FE-CNN [26], MLSDRN [37], RRCNN [28], VRCNN-BN [18], and FQE-CNN [4]. The results obtained for the first 30 frames in each video sequence by employing ASQE-CNN to post-process the VVC [47]-decoded videos are also reported in Table 6. ASQE-CNN achieves an outstanding performance compared with the state-of-the-art methods for all video sequences, except for B5, and provides 15.47% Y-BD-rate savings compared with HEVC and 14.66% Y-BD-rate savings compared with VVC. Table 7 presents quality enhancement results computed as Y-BD-PSNR improvement. ASQE-CNN provides an improvement of around 1.13 dB compared with HEVC [3] and 0.88 dB compared with VVC [47]. Figure 12 presents the pseudo-colored image comparison between HEVC and VVC at qp 4 = 37 and HEVC+ASQE-CNN, and VVC+ASQE-CNN, respectively. ASQE-CNN estimates fine details for ≈45% of the pixels and provides the same result for ≈25% of the pixels.

Ablation Study
The ASQE-CNN architecture is designed based on the following three main concepts: (a) the attention mechanism; (b) the weight sharing approach; and (c) the proposed MFF modules. Here, we study how important is to integrate each one of these concepts in ASQE-CNN. The first version is designed without the use of the attention mechanism, where all the MHA blocks are removed. This version is called noAttention. The second version is designed without the use of the weight sharing approach. Therefore, the ASB and LFF and HFF blocks are designed using two convolutional layers with sparely trained weights instead of a single convolutional layer employed twice inside the block. This version is called noWeighSharing. The third version is designed without the use of the MFF modules, i.e., by following the classical multiresolution patch processing design (U-Net). In this case, the following branches in the proposed ASQE-CNN architecture are removed: (1) the half-and quarter-resolution branches which extract feature maps from the input patch in the first part of the architecture; and (2) the half-and quarter-resolution branches which provide the extra refinement in the last part of the architecture. This version is called noMFF (or U-Net). Figure 13 presents the average performance results over DIV2K_valid_HR [52], while Table 8 presents the average Bjøntegaard metrics and the runtime using batch size (bs) of 100 input patches. All three versions are affected by a performance drop compared with the proposed architecture. The following conclusions can be drawn: (1) the attention mechanism has the highest influence in the network design of around 12% BD-rate savings compared with the proposed architecture; (2) although noWeighSharing is more complex (contains more parameters), it provides a small improvement of around 11.45% BD-rate savings compared with the proposed architecture; (3) the classical multiresolution patch processing design (U-Net) was improved using the proposed MFF modules, which provides around 11.54% BD-rate savings compared with the proposed architecture.

Conclusions
The paper introduces a novel CNN-based post-filtering method for enhancing the quality of lossy images encoded using traditional codecs and video sequences encoded using recent video coding standards. A novel deep neural network architecture is proposed to estimate the refinement details. ASQE-CNN incorporates in its design three concepts: the attention mechanism implemented using a multi-head attention block, the weights sharing concept, where a single convolutional layer is employed twice in a specific block, and the proposed MFF blocks, LFF and HFF, designed to fuse current resolution feature maps with lower/higher-resolution feature maps. It operates at three resolutions and uses new network branches to extract feature maps from the input patch and to estimate refinement details at each resolution. Experimental results demonstrate substantial average BD-rate and BD-PSNR improvements over traditional image and video codecs. The paper demonstrates the potential of CNN-based post-filtering methods for widely-used codecs.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results. All authors read and approved the final manuscript.

Abbreviations
The following abbreviations are used in this manuscript: