Next Article in Journal
Resilient Supply Chain Optimization Considering Alternative Supplier Selection and Temporary Distribution Center Location
Next Article in Special Issue
Mathematical Modeling and Analysis of Credit Scoring Using the LIME Explainer: A Comprehensive Approach
Previous Article in Journal
Asymptotic Stability and Dependency of a Class of Hybrid Functional Integral Equations
Previous Article in Special Issue
ClueCatcher: Catching Domain-Wise Independent Clues for Deepfake Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Memory-Efficient Discrete Cosine Transform Domain Weight Modulation Transformer for Arbitrary-Scale Super-Resolution

Deparment of Artificial Intelligence Convergence, Chonnam National University, Gwangju 61186, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(18), 3954; https://doi.org/10.3390/math11183954
Submission received: 17 August 2023 / Revised: 15 September 2023 / Accepted: 16 September 2023 / Published: 18 September 2023

Abstract

:
Recently, several arbitrary-scale models have been proposed for single-image super-resolution. Furthermore, the importance of arbitrary-scale single image super-resolution is emphasized for applications such as satellite image processing, HR display, and video-based surveillance. However, the baseline integer-scale model must be retrained to fit the existing network, and the learning speed is slow. This paper proposes a network to solve these problems, processing super-resolution by restoring the high-frequency information lost in the remaining arbitrary-scale while maintaining the baseline integer scale. The proposed network extends an integer-scaled image to an arbitrary-scale target in the discrete cosine transform spectral domain. We also modulate the high-frequency restoration weights of the depthwise multi-head attention to use memory efficiently. Finally, we demonstrate the performance through experiments with existing state-of-the-art models and their flexibility through integration with existing integer-scale models in terms of peak signal-to-noise ratio (PSNR) and similarity index measure (SSIM) scores. This means that the proposed network restores high-resolution (HR) images appropriately by improving the image sharpness of low-resolution (LR) images.

1. Introduction

We purpose arbitrary-scale super-resolution (SR), which upsamples decimal (floating point) scale in single-image SR (SISR). SISR, which aims to recover high-resolution (HR) images from low-resolution (LR) images, is a recent research study in traditional computer vision and has immense potential in various applications, such as video games, satellite images, medical images, surveillance, monitoring, video enhancement, and security. In addition, SISR has also been recognized as a challenging task due to its ill-posed nature, and various methods have been proposed [1,2,3,4]. Moreover, SISR methods, called integer-scale super-resolution (SR) [1,2,3,4], train on the characteristics of the LR image and upsample the LR image with a fixed upsampling layer. Most integer-scale SISR methods consist of a deep neural network (DNN) and an upsampling layer called pixel shuffling. The limitation of these pixel shuffling modules is that they cannot generate SR images at a noninteger scale. Therefore, many proposed approaches require a separate DNN model for each upsampling layer, usually restricted to a limited number of integers (e.g., ×2, ×3, or ×4). DNN models trained with fixed integer-scale upsampling layers are challenging to perform arbitrary-scale SR because they only perform fixed integer-scale SR. Furthermore, these separate DNN models trained in this way ignore the SR correlation at different scales, leading to discontinuous representation and limited performance. In addition, the memory problem of storing each integer-scale factor and a wide range of arbitrary-scales in real-world scenarios makes selecting and applying SISR models problematic in practical applications. These shortcomings limit their applicability and flexibility in real-world scenarios. An approach to arbitrary-scale SR has emerged and received considerable attention to address the limitations. We note that the arbitrary-scale denotes floating point scale.
For example, in aspects of display applications, utilizing an arbitrary-scale SR can fit an HR image for any size input image. Moreover, it is a common requirement to arbitrarily zoom in on an image by rolling a mouse wheel. Arbitrary scale SR can be utilized to identify the details of an object in a satellite image, as shown in Figure 1. The arbitrary-scale SR fulfills the common requirement of arbitrarily zoom in on an image. Examples like this prove that an arbitrary-scale SR is essential. Besides CiaoSR [5], pioneering work in arbitrary-scale SR, various methods [6,7,8] have been proposed. However, these methods suffer from long training times because they need to train different scales of LR images and memory problems to generate several LR images. This paper proposes a network that performs arbitrary-scale SR by training in the discrete cosine transform (DCT) [9] domain of minimum and maximum scales using conventional integer-scale weights to solve these problems. Unlike existing methods, our proposed network trains directly on the DCT domain, which allows for better high-frequency reconstruction. The proposed network extends an integer-scaled image to an arbitrary-scale target in the DCT domain. It also addresses the limitations that existing integer-scale SR models only perform at a fixed integer scale. When the image is extended, high-frequency components become scarce as the arbitrary-scale increases. Thus, we use depthwise multi-head attention and a depthwise feed-forward network to restore components, which learn directly from the DCT domain. The advantages include convergence over fewer training epochs and sparser weight matrices more conducive to reduced computation [10]. We also adjust the high-frequency restoration weight of depthwise multi head attention as a coefficient for each scale so that a single model can efficiently handle arbitrary-scale SR. This paper demonstrates performance through experiments with existing state-of-the-art models. We also demonstrate the network flexibility through experiments to integrate it with existing integer-scale models. The contributions are summarized as follows:
  • We propose m-DCTformer, a transformer network structure with direct training in the DCT domain for arbitrary-scale SR. Unlike traditional arbitrary-scale models, training from DCT components can improve reconstruction performance by focusing on high-frequency information and converge over fewer training epochs and computations.
  • The m-DCTformer inserts a weight-modulation layer into the network trained at the minimum scale to modulate the existing weights up to the maximum scale. The weights handle arbitrary-scales by modulating the amount depending on the coefficient value, solving computational and memory problems that traditional arbitrary-scale SR models have when training contiguous LR images.
  • The m-DCTformer demonstrates its flexibility through integration with the existing integer-scale SR model, and its applicability in real-world scenarios is verified through experiments.

2. Related Work

2.1. Single Image Super-Resoultion

The SISR technique [11,12,13,14,15] is well known in computer vision and aims to generate an HR image from a single LR counterpart. Early deep learning models, such as the SR convolutional neural network (SRCNN) [16] and fast SR CNN [17], use shallow architectures to learn mappings from LR to HR images. Very deep SR [18] and enhanced deep SR (EDSR) [19] attempt to increase model depth further using residual connections, allowing them to learn more complex mappings and improve reconstruction quality. The efficient sub pixel CNN [20] introduced an approach to learning an upsampling filter array in LR space, extracting feature maps from LR images and upsampling them to produce HR output. In addition, the residual dense network [21] uses a residually dense structure to learn hierarchical representations from all feature maps. Moreover, the storage area network [22] employs feature maps of input images to model relationships between neighboring pixels and dynamically highlights important features. Super-resolution [23] using normalizing flow, called SRFlow, models the conditional probability distribution between HR and LR images, facilitating accurate transformation. Swin-infrared (IR) [24] is a new SISR model that integrates the Swin transformer, a hierarchical transformer with representations computed using shifted windows, into image restoration tasks, including SISR. This model displays remarkable performance in various SR tasks. However, these methods share a fundamental limitation in that they are specifically trained and optimized for certain integer scales, rendering them ineffective in handling noninteger or arbitrary-scale SR scenarios. This research addresses this limitation by introducing a weight-modulation mechanism tailored to the DCT domain that can effectively adjust weights for high-frequency restoration according to the desired scale. Compared to these integer SISR methods, our method handles arbitrary-scale SR, making it more applicable to real-world scenarios such as Figure 1. Furthermore, the modulation is performed by a scale-dependent coefficient, allowing the model to accommodate integer and noninteger scales, providing a more comprehensive solution for arbitrary-scale SR. This advance overcomes the scale limitation of Swin-IR, increasing the flexibility and applicability of SR models.

2.2. Arbitrary Scale Super-Resolution

The arbitrary-scale SISR aims to enhance flexibility by accommodating integer- and noninteger-scale factors, overcoming the shortcomings of conventional SISR. Meta-SR [25] advances SR models by facilitating arbitrary-scale SR using a meta-upscale module. In addition, SRWarp [26] introduces a blend of warping and SR techniques, delivering an adaptive warping layer for resampling kernel prediction and multiscale blending for richer information extraction from the input. However, in SRWarp, replacing the upscale module with integer-scale SR models led to a drop in performance. ArbSR [27] employs a plug-in module with dynamic scale-aware filters, offering effective management of various scale factors but struggling with integer-scale factors. In addition, LTE [28] eemphasizes high-frequency details for arbitrary-scale SISR by estimating dominant frequencies and Fourier coefficients but tends to favor learning low-frequency components, which might prevent it from capturing minute high-frequency details. Further, CiaoSR introduced a continuous implicit attention-in-attention network, promoting the adaptive aggregation of local features, but its reliance on attention mechanisms might not be universally effective across all SR scenarios. This paper proposes a weight-modulation mechanism in the DCT domain to address these challenges. Because our proposed mechanism is directly trained by the DCT coefficient, it can restore better high-frequency details. Moreover, our weight modulation addresses model capacity limitations compared to traditional arbitrary-scale SR models. The model aims to offer a memory-efficient model capable of delivering a superior performance in integer and non integer scale factors without sacrificing the quality of the SR images.

3. Method

This section describes the proposed m-DCTformer framework. In Figure 2, the m-DCTformer applies integer-scale SR to an LR image and proceeds with the SR to the target decimal scale. First, we present an overview of the proposed framework, followed by the detailed implementation of modules.

3.1. Overview

The main goal is to proceed with arbitrary-scale SR up to the target scale based on the results of integer-scale SR. Figure 2 illustrates a comprehensive block diagram of the proposed m-DCTformer. Given an LR image as input, we proceed with an integer-scale SR using the appropriate weights for each scale factor (×2, ×3 and ×4). The result of the integer-scale SR is input into a depthwise two-dimensional (2D) DCT and expanded using zero padding by the target decimal scale. The expanded DCT domain is divided into high and low frequencies. In this process, the high-frequency components are lost due to the expansion. The process to restore these components is described next.
First, the high-frequency DCT domains are embedded into low-level features using a 3 × 3 convolution. The embedded low-level feature is input into a modulation transformer block as an encoder-decoder. Second, the input low-level features are subjected to downsampling and upsampling. The upsampling and downsampling methods are the pixel-shuffle and pixel-unshuffle, respectively. In addition, skip connections are used to restore high-frequency components. The downsampling and upsampling processes with skip connections preserve the fine structural characteristics of restored high-frequency DCT detail components. Furthermore, the original high-frequency DCT components are added to the restored high-frequency component resulting from the last 3 × 3 convolution layer. Finally, the restored high-frequency component is combined with a low-frequency component and is transformed into the spatial domain using the depthwise 2D inverse DCT (IDCT). The integer-scale SR is frozen, and the transformed spatial domain image trains the network with L1 loss with the SR image and ground truth (GT) image. In summary, the main points of m-DCTformer are as follows:
First, using the results of integer-scale SR, m-DCTformer extracts the high-frequency components using depthwise 2D DCT. This process is described in detail in Section 3.3.1. Second, the extracted high-frequency components are trained with depthwise multi-head attention and depthwise feed-forward network to restore the lost high-frequency components. This process is described in detail in Section 3.3.2. Third, we train weight modulation at the maximum scale factor and modulate to the target arbitrary-scale. This process is described in detail in Section 3.3.3.

3.2. Depthwise 2D DCT

A spatial domain image can be transformed into a spectral domain image. This paper uses the DCT, which decomposes the image into a cosine function and produces only real values for the spectral representation. A discrete image of size N × M input in the 2D spatial domain can be represented by a DCT in the frequency domain as follows:
F ( u , v ) = α ( u ) β ( v ) x = 0 N 1 y = 0 M 1 f ( x , y ) γ ( x , y , u , v ) ,
γ ( x , y , u , v ) = cos π ( 2 x + 1 ) u 2 N cos π ( 2 y + 1 ) v 2 M ,
α ( u ) = 1 N u = 0 2 N u 0 ,
β ( v ) = 1 M v = 0 2 M v 0 ,
f ( x , y ) = u = 0 N 1 v = 0 M 1 α ( u ) β ( v ) F ( u , v ) γ ( x , y , u , v ) .
In Equation (1), f ( x , y ) is the pixel value at position ( x , y ) in the input image, and F ( u , v ) represents the DCT coefficient value at position ( u , v ) . It also represents the depthwise 2D DCT in Figure 2. Equations (2)–(4) define the cosine basis function and regularization constants. Conversely, an image transformed into the frequency domain can be transformed into the spatial domain using the 2D IDCT, as presented in Equation (5). This process is depicted as a depthwise 2D IDCT in Figure 2. The high-frequency mask divides the transformed DCT into high and low frequencies. Mask M is expressed as follows:
M ( u , v ) = 0 D ( u , v ) d 1 o t h e r w i s e ,
where D denotes the index of the zig-zag scanning in Figure 2, and d denotes the parameter to extract high frequency components. High-frequency DCT components are extended from an integer-scale factor to the target arbitrary-scale factor; thus, the high-frequency component is lacking. The energy conservation coefficient allows for restoring the image brightness by multiplying the value when the image is expanded in the DCT domain. However, high-frequency components are still lacking. We restored the DCT domain with the lacking high-frequency component with several weight-modulation transformer blocks to solve this.

3.3. Weight Modulation Tansformer

3.3.1. Depthwise Multi-Head Attention

We note that depthwise multi-head attention receives as input the high-frequency components obtained from depthwise 2D DCT. In addition, the obtained high-frequency components are extracted as Q, K, and V using depthwise convolution. Meanwhile, depthwise multi-head attention can apply self-attention [29] across channels, repeated h times independently, depending on hyperparameter h, as depicted in Figure 3. Another key point is using depthwise convolutions to emphasize the local context. A normalized tensor layer is input and generates a query, a key projection enriched with the local context. Then a 1 × 1 convolution is applied to aggregate the pixelwise cross-channel context, and a 3 × 3 depthwise convolution encodes the channelwise spatial context. The convolutional layers of the depthwise multi-head attention network are bias-free. Next, we reshaped the query and key projections such that their dot-product interaction generates a transposed attention map A. The depthwise multi-head attention can be expressed as follows:
F a = A t t e n t i o n ( Q , K , V ) + F 0 ,
A t t e n t i o n ( Q , K , V ) = V · s o f t m a x ( K · Q / ϵ ) ,
where the input and output feature maps are F a and F 0 . Moreover, · denotes the dot product. In Figure 3, the first weight-modulation transformer block receives the feature F 0 extracted through a 3 × 3 convolution layer as input. The Q, K and V matrices are obtained after the input tensor is reshaped. The learnable ϵ value can be obtained to control the size of the dot product of K and Q before applying the softmax function. After applying softmax to the attention map obtained by Q and K, we multiplied it by V and added the resulting value to F 0 to obtain F d . Since depthwise convolution is involved in obtaining Q, K, and V, we get an attention map that is more relevant to the high-frequency components lost during the training process. F a is restored the depthwise feed-forward network.

3.3.2. Depthwise Feed-Forward Network

The regular feed-forward network operates on each pixel location separately and identically. However, we adopted the depthwise feed-forward network in Figure 3. As in the regular feed-forward network, two 1 × 1 convolutions are performed. The input layer 1 × 1 convolution expands the feature channels, and the output layer 1 × 1 convolution reduces them back to the original input dimension. Then, we extract the features using a 3 × 3 depthwise convolution, allowing the extraction of spatial information from neighboring pixel positions, which can be used to learn local features for effective reconstruction. A gate mechanism forms an elementwise multiplication of two parallel paths output from the depthwise convolution. Prior to this multiplication, we applied the Gaussian error linear unit (GELU) [30] activation function to one of the parallel paths, introducing nonlinearity into the model. This activation function allows the model to learn and adapt to complex data patterns. The gate mechanism contributes to the robustness of the model, enhancing its ability to adapt to different image patterns. Given an attentional tensor F a , a depthwise feed-forward network can be expressed as follows:
F r = F a + G a t e ( F c ) ,
F c = C o n v 1 × 1 ( F a ) ,
G a t e ( F c ) = C o n v 1 × 1 ( σ ( C o n v d ( F c ) ) C o n v d ( F c ) ,
where ⊙ denotes element-wise multiplication, σ denotes the GELU activation function, C o n v 1 × 1 represents the 1 × 1 convolution layer, and C o n v d indicates the depthwise convolution layer. The depthwise feed-forward network enables effective restoration of the high-frequency components through depthwise convolution. By incorporating 3 × 3 depthwise convolution and the GELU activation function into a gate mechanism, the depthwise feed-forward network provides a more refined and context-aware approach to feature transformation. Furthermore, F r , which is extracted as a result of Equation (9), is the recovered high frequency. In this case, the number of dimensions is the product of the hyperparameter h and the dimension of F 0 , the input feature of depthwise multi-head attention. This process leads to improved representational learning capabilities and, consequently, better performance in tasks such as high frequency restoration.

3.3.3. Weight Modulation

The weight of each Q, K, and V can be modulated to w ( s x , m a x , s y , m a x ) using each weight modulation and an appropriate coefficient, λ x and λ y , which are the scale coefficients of horizontal modulation filter m x and vertical modulation filter m y , respectively. The weight modulation can be formulated as follows:
w ( s x , s y ) = w ( s x , m a x , s y , m a x ) λ x m x λ y m y ,
where * denotes a convolution operation. Each weight starts at the maximum scale factor s x , m a x , s y , m a x and modulates to the target arbitrary-scale factor s x , s y . Figure 4 depicts the process of the weight-modulation layer. We designed them separately as 3 × 1 and 1 × 3 convolution layers to process in each horizontal and vertical direction, respectively. Each layer trains independently at the minimum arbitrary-scale factors of s x , m i n and s y , m i n . Therefore, we modulate the weights w with coefficients λ x and λ y according to the target arbitrary-scale factor. The modulated feature f and the coefficient λ x and λ y to use weight modulation can be expressed as follows:
f y = ( f w ) λ y w y = f ( w λ y w y ) ,
f x = ( f w ) λ x w x = f ( w λ x w x ) ,
λ x = ( s x , m a x s x ) / ( s x , m a x s x , m i n ) ,
λ y = ( s y , m a x s y ) / ( s y , m a x s y , m i n ) .

4. Experiment

This section presents the experimental results and discusses their implications. We introduce the experimental setup and evaluate m-DCTformer on the datasets for training and valuation. Finally, we analyze the results of the experiments comparing m-DCTformer with other models.

4.1. Experimental Setup

During training, we input a batch of LR images into the framework following previous work LTE. The corresponding LR images were cropped into 64 × 64 patches. The patches were augmented by a random horizontal flip, vertical flip, and 90° rotation. We set the batch size to 4 and used the Adam optimizer [31] for training, with L1 loss instead of the mean squared loss (MSE) or L2 loss. The m-DCTformer has six weight modulation transformer blocks. The number of dimensions in each of the six weight modulation transformer blocks is 48, 96, 192, 96, 48, and 48. We train m-DCTformer for 1000 epochs, initializing the learning rate to 1 × 10 4 and decaying by 0.5 at epochs [200, 400, 600, 800]. We set the hyperparameter d to 20 and the h values to [4, 6, 6]. Moreover, we set the weighting for the gradient moving average to 0.9 and the weighting for the squared gradient moving average to 0.999 in the Adam optimizer [31]. Furthermore, we used an NVIDIA RTX 3090 24 GB for training. The coefficient of each weight modulation used in the test was calculated as the ratio of the maximum arbitrary-scale to the minimum arbitrary-scale. We used the classical method EDSR and the state-of-the-art method hybrid attention transformer (HAT) [32] for integer-scale SR.

4.2. Dataset

We used the DIV2K dataset [33] for CiaoSR, LTE, and our m-DCTformer training. It consists of 1000 images in 2K resolutions and provides low-resolution counterparts with down-sampling scales, ×2, ×3, and ×4, generated by the bicubic interpolation method. We also evaluate the performance on the validation set of Set5 [34], Set14 [35], Urban100 [36] and real-world SR dataset [37] in terms of peak signal-to-noise (PSNR) and in terms of structural similarity index measure (SSIM) values.

4.3. Quantitative and Qualitative Results on Set5, Set14, Urban 100 Datasets

This section compares the performance of the proposed m-DCTformer with the state-of-the-art arbitrary-scale SR methods CiaoSR and LTE. Each arbitrary-scale method is trained on the LR bicubic DIV2K dataset. For the proposed model, we only trained with LR datasets with scale factors of ×2.1 and ×4.9, whereas for the other models, we used scale factors of ×2, ×3, and ×4. We also evaluated the proposed model quantitatively using the PSNR and SSIM. Table 1, Table 2 and Table 3 show the quantitative results with other approaches and our m-DCTformer. Moreover, they use EDSR and HAT in integer SR, and they are obtained from Set5, Set14, and Urban100 datasets. Table 1 shows our EDSR + m-DCTformer outperforms EDSR + CiaoSR by an average of 0.23 dB and 0.0462 dB and outperforms EDSR + LTE by an average of 0.17 dB and 0.0045 dB in terms of PSNR and SSIM, respectively. Our HAT + m-DCTformer outperforms HAT + CiaoSR by an average of 0.71 dB and 0.0116 dB and outperforms HAT + LTE by an average of 0.43 dB and 0.0076 dB in terms of PSNR and SSIM, respectively. Table 2 shows our EDSR + m-DCTformer outperforms EDSR + CiaoSR by an average of 0.15 dB and 0.0449 dB and outperforms EDSR + LTE by an average of 0.1 dB and 0.0037 dB in terms of PSNR and SSIM, respectively. Our HAT + m-DCTformer outperforms HAT + CiaoSR by an average of 0.24 dB and 0.0043 dB and outperforms HAT + LTE by an average of 0.51 dB and 0.0098 dB in terms of PSNR and SSIM, respectively. Table 3 shows our EDSR + m-DCTformer outperforms EDSR + CiaoSR by an average of 0.64 dB and 0.0055 dB and outperforms EDSR + LTE by an average of 0.22 dB and 0.0061 dB in terms of PSNR and SSIM, respectively. Our HAT + m-DCTformer outperforms HAT + CiaoSR by an average of 0.72 dB and 0.0048 dB and outperforms HAT + LTE by an average of 1.02 dB and 0.0078 dB in terms of PSNR and SSIM, respectively. Compared to other methods, the model demonstrates high PSNR performance, especially in Urban 100. As shown in Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16 and Figure 17, m-DCTformer exhibits the closest results to the GT image compared to the other models. Figure 5 and Figure 6 show the qualitative results of m-DCTformer and other models based on EDSR in integer SR, obtained from the Set5 dataset. Figure 7 and Figure 8 show the qualitative results of m-DCTformer and other models based on EDSR in integer SR, obtained from the Set 14 dataset. Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 show the qualitative results of m-DCTformer and other models based on EDSR in integer SR, obtained from the Urban 100 dataset. Figure 15, Figure 16 and Figure 17 show the qualitative results of m-DCTformer and other models based on HAT in integer SR, obtained from the Set 14 and Urban 100 dataset. In particular, it recovers handrails, building lines, and other aspects very well in the Urban 100 datasets. This outcome is because the depthwise 2D DCT module converts to the DCT domain and extracts high frequencies to restore damaged high frequencies and because the weight modulation is divided into horizontal and vertical directions to modulate the weight of the existing maximum noninteger-scale factor.

4.4. Qualitative Results on Real-World Dataset

This section applies the real-world dataset to validate a qualitative comparison with other methods. This dataset has real noise images with no GT images. Figure 18, Figure 19 and Figure 20 show the qualitative results of m-DCTformer and other models based on HAT in integer SR, obtained from the real-world dataset for a scale factor of ×2.9. Moreover, Figure 18, Figure 19 and Figure 20 indicate that the m-DCTformer recovers high-frequency DCT components robustly, even in a noisy real-world environment. Therefore, m-DCTformer is suitable for a real-world scenario.

4.5. Complexity

Reconstructing high-quality images like DIV2K consumes considerable memory during evaluation. Table 4 compares the m-DCTformer with other arbitrary-scale SR methods for floating point operations per second (FLOPs), parameters, model capacity, and time on an Nvidia RTX 3090 24 GB environment. The FLOPs are computed as the theoretical amount of multiply-add operations in the network. Moreover, the parameters are computed from the number of parameters in a network. In addition, the model capacity denotes the capacity of the model in terms of megabytes (MB), and the time in Table 4 denotes inference time for performing a single image. The output image size is 256 × 256 and the scale factor is ×2.5. The m-DCTformer has the lowest FLOPs and fastest time because it learns high-frequency DCT components. It has the smallest memory size because it handles an arbitrary-scale from minimum to maximum with weight modulation. This efficient memory size makes it applicable to real-world scenarios.

4.6. Ablation Study

This section presents an experiment that investigates the influence of the proposed modules on SR performance in terms of PSNR and SSIM values. As presented in Table 5 and Table 6, when training on the DCT domain when no weight modulation was used, m-DCTformer achieved a PSNR of 35.50 dB on the Set5 and Urban 100 dataset. Table 5 denotes quantitative results with the m-DCTformer on the Set5 dataset for × 2.5 . Bold represents the best PSNR and SSIM scores. Moreover, Table 6 denotes quantitative results with the m-DCTformer on the Urban 100 dataset for × 2.5 . Bold represents the best PSNR and SSIM scores. Adding a weight modulation to this increases the PSNR to 35.53 dB. This result suggests that modulating the trained weight is effective in restoration when performing arbitrary-scale SR. Next, when learning in the DCT domain is applied, the PSNR increases significantly to 0.42 dB. This result demonstrates that learning in the DCT domain robustly restores the lost high-frequency information. Furthermore, weight modulation increases the PSNR by 0.50 dB in addition to DCT domain learning, suggesting that the m-DCTformer is effective for arbitrary-scale SR by restoring high-frequency information lost during image degradation and modulating the weights to the arbitrary-scale through weight modulation. Finally, we performed an ablation study on d, the parameter in Equation (6). In Table 7, we tested ×3.5 on the Set14 dataset by training when d is 10, 20, and 30. We observed that d is robust to high frequency DCT component restoration at 20. Table 7 denotes quantitative results with the m-DCTformer on the Set 14 dataset for × 3.5 . Bold represents the best PSNR and SSIM scores.

5. Discussion

In this section, we discuss the limitations of our proposed model. First, m-DCTformer relies on the existing integer-scale SR model. Since we adopt the existing integer-scale SR model as the backbone network, it is highly dependent on the performance of integer-scale SR. In future research, it may be beneficial to study models that directly perform arbitrary-scale SR without using the integer-scale SR model. Second, our model is limited to bicubic interpolation. The LR images used for training and testing were bicubic interpolations. These bicubic LR images may not be able to utilize high-level semantic information because they only perform pixel-level interpolation while ignoring structural features. Therefore, it may be helpful for future research to utilize a variety of low-resolution images instead of being limited to bicubic low-resolution images.

6. Conclusions

This paper proposes an m-DCTformer based on a transformer that learns high-frequency DCT components. The proposed m-DCTformer takes the SR method with an integer scale as the backbone and processes the remaining arbitrary-scale. The depthwise multi-head attention and depthwise feed-forward network proposed by the m-DCTformer are learned when the remaining arbitrary-scale processed by the DCT domain is at a maximum, and they restore the lost high-frequency components in the high-frequency image as input. Each Q, K, and V forms an attention map to restore the feed-forward network. In addition, a weight modulation is learned when the remaining processed residual arbitrary-scale is the minimum to modulate the weights learned at the maximum. The learned weight modulation modulates the weights of Q, K, and V in depthwise multi-head attention. In conclusion, m-DCTformer can manage memory efficiently, and we demonstrated its performance through a flexible combination with the existing integer-scale SR model.

Author Contributions

Conceptualization, S.B.Y.; methodology, M.H.K. and S.B.Y.; software, M.H.K.; validation, M.H.K.; formal analysis, M.H.K. and S.B.Y.; investigation, M.H.K. and S.B.Y.; resources, M.H.K. and S.B.Y.; data curation, M.H.K. and S.B.Y.; writing—original draft preparation, M.H.K. and S.B.Y.; writing—review and editing, M.H.K. and S.B.Y.; visualization, S.B.Y.; supervision, S.B.Y.; project administration, S.B.Y.; funding acquisition, S.B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Industrial Fundamental Technology Development Program (No. 20018699) funded by MOTIE of Korea and the IITP grant funded by the Korea government (MSIT) (No. 2021-0-02068, RS-2023-00256629).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhang, Y.; Huang, Y.; Wang, K.; Qi, G.; Zhu, J. Single image super-resolution reconstruction with preservation of structure and texture details. Mathematics 2023, 11, 216. [Google Scholar] [CrossRef]
  2. Cha, Z.; Xu, D.; Tang, Y.; Jiang, Z. Meta-Learning for Zero-Shot Remote Sensing Image Super-Resolution. Mathematics 2023, 11, 1653. [Google Scholar] [CrossRef]
  3. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 184–199. [Google Scholar]
  4. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  5. Cao, J.; Wang, Q.; Xian, Y.; Li, Y.; Ni, B.; Pi, Z.; Zhang, K.; Zhang, Y.; Timofte, R.; Van Gool, L. Ciaosr: Continuous implicit attention-in-attention network for arbitrary-scale image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 20–22 June 2023; pp. 1796–1807. [Google Scholar]
  6. Yao, J.E.; Tsao, L.Y.; Lo, Y.C.; Tseng, R.; Chang, C.C.; Lee, C.Y. Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 20–22 June 2023; pp. 1776–1785. [Google Scholar]
  7. Song, G.; Sun, Q.; Zhang, L.; Su, R.; Shi, J.; He, Y. OPE-SR: Orthogonal Position Encoding for Designing a Parameter-free Upsampling Module in Arbitrary-scale Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 20–22 June 2023; pp. 10009–10020. [Google Scholar]
  8. Yun, J.S.; Yoo, S.B. Single image super-resolution with arbitrary magnification based on high-frequency attention network. Mathematics 2021, 10, 275. [Google Scholar] [CrossRef]
  9. Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. 1974, 100, 90–93. [Google Scholar] [CrossRef]
  10. Ghosh, A.; Chellappa, R. Deep feature extraction in the DCT domain. In Proceedings of the 2016 23rd International Conference on Pattern Recognition, Cancun, Mexico, 4–8 December 2016; pp. 3536–3541. [Google Scholar]
  11. Kim, M.H.; Yun, J.S.; Yoo, S.B. Multiregression spatially variant blur kernel estimation based on inter-kernel consistency. Electron. Lett. 2023, 59, e12805. [Google Scholar] [CrossRef]
  12. Yun, J.S.; Na, Y.; Kim, H.H.; Kim, H.I.; Yoo, S.B. HAZE-Net: High-Frequency Attentive Super-Resolved Gaze Estimation in Low-Resolution Face Images. In Proceedings of the Asian Conference on Computer Vision, Macau SAR, China, 4–8 December 2022; pp. 3361–3378. [Google Scholar]
  13. Yun, J.S.; Yoo, S.B. Kernel-attentive weight modulation memory network for optical blur kernel-aware image super-resolution. Opt. Lett. 2023, 48, 2740–2743. [Google Scholar] [CrossRef] [PubMed]
  14. Na, Y.; Kim, H.H.; Yoo, S.B. Shared knowledge distillation for robust multi-scale super-resolution networks. Electron. Lett. 2022, 58, 502–504. [Google Scholar] [CrossRef]
  15. Lee, S.J.; Yoo, S.B. Super-resolved recognition of license plate characters. Mathematics 2021, 9, 2494. [Google Scholar] [CrossRef]
  16. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  17. Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 391–407. [Google Scholar]
  18. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  19. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 22–25 July 2018; pp. 136–144. [Google Scholar]
  20. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  21. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2472–2481. [Google Scholar]
  22. Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 11065–11074. [Google Scholar]
  23. Lugmayr, A.; Danelljan, M.; Van Gool, L.; Timofte, R. Srflow: Learning the super-resolution space with normalizing flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part V; Springer International Publishing: Cham, Switzerland, 2020; pp. 715–732. [Google Scholar]
  24. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 19–21 June 2021; pp. 1833–1844. [Google Scholar]
  25. Hu, X.; Mu, H.; Zhang, X.; Wang, Z.; Tan, T.; Sun, J. Meta-SR: A magnification-arbitrary network for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1575–1584. [Google Scholar]
  26. Son, S.; Lee, K.M. SRWarp: Generalized image super-resolution under arbitrary transformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–21 June 2021; pp. 7782–7791. [Google Scholar]
  27. Wang, L.; Wang, Y.; Lin, Z.; Yang, J.; An, W.; Guo, Y. Learning a single network for scale-arbitrary super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 19–21 June 2021; pp. 4801–4810. [Google Scholar]
  28. Lee, J.; Jin, K.H. Local texture estimator for implicit representation function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 1929–1938. [Google Scholar]
  29. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–15. [Google Scholar]
  30. Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  31. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  32. Chen, X.; Wang, X.; Zhou, J.; Dong, C. Activating more pixels in image super-resolution transformer. arXiv 2022, arXiv:2205.04437. [Google Scholar]
  33. Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 22–25 July 2017; pp. 126–135. [Google Scholar]
  34. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the 23rd British Machine Vision Conference, Surrey, UK, 3–7 September 2012. [Google Scholar]
  35. Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the Curves and Surfaces: 7th International Conference, Avignon, France, 24–30 June 2010; pp. 711–730. [Google Scholar]
  36. Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 5197–5206. [Google Scholar]
  37. Lugmayr, A.; Danelljan, M.; Timofte, R. Ntire 2020 challenge on real-world image super-resolution: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 16–18 June 2020; pp. 494–495. [Google Scholar]
Figure 1. Example application of arbitrary-scale super-resolution in satellite image processing task.
Figure 1. Example application of arbitrary-scale super-resolution in satellite image processing task.
Mathematics 11 03954 g001
Figure 2. Architecture of the m-DCTformer.
Figure 2. Architecture of the m-DCTformer.
Mathematics 11 03954 g002
Figure 3. Architecture of weight modulation transformer.
Figure 3. Architecture of weight modulation transformer.
Mathematics 11 03954 g003
Figure 4. Weight modulation process.
Figure 4. Weight modulation process.
Mathematics 11 03954 g004
Figure 5. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Set5 dataset.
Figure 5. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Set5 dataset.
Mathematics 11 03954 g005
Figure 6. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Set5 dataset.
Figure 6. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Set5 dataset.
Mathematics 11 03954 g006
Figure 7. Qualitative comparison of the m-DCTformer with other arbitrary-scale super-resolution methods for a scale factor of ×4.9 on the Set 14 dataset.
Figure 7. Qualitative comparison of the m-DCTformer with other arbitrary-scale super-resolution methods for a scale factor of ×4.9 on the Set 14 dataset.
Mathematics 11 03954 g007
Figure 8. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Set 14 dataset.
Figure 8. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Set 14 dataset.
Mathematics 11 03954 g008
Figure 9. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Figure 9. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Mathematics 11 03954 g009
Figure 10. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Figure 10. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Mathematics 11 03954 g010
Figure 11. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Figure 11. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Mathematics 11 03954 g011
Figure 12. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Figure 12. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Mathematics 11 03954 g012
Figure 13. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×3.1 on the Urban 100 dataset.
Figure 13. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×3.1 on the Urban 100 dataset.
Mathematics 11 03954 g013
Figure 14. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×3.1 on the Urban 100 dataset.
Figure 14. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×3.1 on the Urban 100 dataset.
Mathematics 11 03954 g014
Figure 15. Qualitative comparison of the m-DCTformer with other arbitrary-scale super-resolution methods for a scale factor of ×4.9 on the Set 14 dataset.
Figure 15. Qualitative comparison of the m-DCTformer with other arbitrary-scale super-resolution methods for a scale factor of ×4.9 on the Set 14 dataset.
Mathematics 11 03954 g015
Figure 16. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Figure 16. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×4.9 on the Urban 100 dataset.
Mathematics 11 03954 g016
Figure 17. Qualitative comparison of m-DCTformer with other arbitrary-scale super-resolution methods for a scale factor (×4.9) on the Urban 100 datasets.
Figure 17. Qualitative comparison of m-DCTformer with other arbitrary-scale super-resolution methods for a scale factor (×4.9) on the Urban 100 datasets.
Mathematics 11 03954 g017
Figure 18. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×2.9 on the real-world dataset.
Figure 18. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×2.9 on the real-world dataset.
Mathematics 11 03954 g018
Figure 19. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×2.9 on the real-world dataset.
Figure 19. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×2.9 on the real-world dataset.
Mathematics 11 03954 g019
Figure 20. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×2.9 on the real-world dataset.
Figure 20. Qualitative comparison of the m-DCTformer with other arbitrary scale super-resolution methods for a scale factor of ×2.9 on the real-world dataset.
Mathematics 11 03954 g020
Table 1. Quantitative results with other approaches and our m-DCTformer with EDSR and HAT on Set5 datasets. Bold represents the best peak signal-to-nose ratio (PSNR) and similarity index measure (SSIM) scores.
Table 1. Quantitative results with other approaches and our m-DCTformer with EDSR and HAT on Set5 datasets. Bold represents the best peak signal-to-nose ratio (PSNR) and similarity index measure (SSIM) scores.
DatasetSet5
MethodEDSR [19] + CiaoSR [5]EDSR [19] + LTE [28]EDSR [19] + ours
MetricPSNRSSIMPSNRSSIMPSNRSSIM
2.137.350.931337.470.955937.510.9563
2.236.880.926537.050.952737.120.9533
2.336.580.922036.690.949436.770.9498
2.436.180.917636.300.945836.360.9462
2.535.860.913135.930.942535.990.9426
2.635.540.908635.650.939435.670.9393
2.735.230.904235.280.936235.310.9360
2.834.920.899634.970.932935.010.9326
2.934.600.895634.750.929834.740.9293
3.134.130.887134.190.922834.400.9257
3.233.920.883434.030.920034.220.9231
3.333.650.878633.760.916633.980.9201
3.433.450.874233.530.913133.720.9169
3.533.230.870833.300.910433.470.9138
3.632.950.866732.980.905433.120.9090
3.732.750.862832.730.901832.930.9064
3.832.530.585732.580.898932.730.9032
3.932.350.854632.370.894532.510.8993
4.131.990.847431.980.887532.170.8942
4.231.830.844031.840.884332.100.8914
4.331.620.839431.600.880131.850.8878
4.431.380.835631.430.876931.710.8853
4.531.240.832831.230.872231.560.8825
4.630.980.827931.080.868431.430.8797
4.730.850.825130.880.864531.210.8761
4.830.700.822130.750.862331.030.8733
4.930.530.816630.500.856530.840.8690
MethodHAT [32] + CiaoSR [5]HAT [32] + LTE [28]HAT [32] + ours
MetricPSNRSSIMPSNRSSIMPSNRSSIM
2.137.460.952237.670.956737.940.9587
2.236.970.947237.270.953737.520.9556
2.336.670.942936.900.950537.140.9522
2.436.240.941436.510.947136.710.9488
2.535.940.940136.170.944136.270.9452
2.635.600.939635.870.940935.960.9419
2.735.310.935135.510.937735.550.9382
2.835.000.930535.260.934835.240.9349
2.934.710.926835.000.931935.020.9319
3.134.220.921834.500.925535.000.9305
3.233.990.920034.320.922634.800.9278
3.333.720.915134.070.919734.520.9250
3.433.560.911933.860.916434.310.9226
3.533.300.908433.600.913234.130.9201
3.633.010.905933.250.908533.830.9162
3.732.800.901833.080.905533.570.9128
3.832.600.898432.890.902533.440.9101
3.932.390.894232.640.897533.130.9065
4.132.010.889932.280.891932.920.9032
4.231.870.884332.220.888932.780.9009
4.331.710.880732.000.885232.610.8977
4.431.500.876931.780.881232.500.8963
4.531.340.872631.620.877832.290.8929
4.631.100.870131.440.873932.110.8894
4.730.970.867631.280.870331.990.8873
4.830.790.865931.100.868031.730.8846
4.930.670.860130.900.862731.620.8819
Table 2. Quantitative results with other approaches and our m-DCTformer with EDSR and HAT on Set14 datasets. Bold represents the best peak signal-to-nose ratio (PSNR) and similarity index measure (SSIM) scores.
Table 2. Quantitative results with other approaches and our m-DCTformer with EDSR and HAT on Set14 datasets. Bold represents the best peak signal-to-nose ratio (PSNR) and similarity index measure (SSIM) scores.
DatasetSet14
MethodEDSR [19] + CiaoSR [5]EDSR [19] + LTE [28]EDSR [19] + ours
MetricPSNRSSIMPSNRSSIMPSNRSSIM
2.132.910.880133.190.911733.300.9123
2.232.520.869032.700.902932.830.9036
2.332.130.859432.310.895532.410.8960
2.431.810.851031.990.888832.080.8885
2.531.500.841831.630.880731.710.8803
2.631.190.833631.330.872731.400.8720
2.730.920.825231.050.865631.080.8645
2.830.730.818830.780.858330.790.8569
2.930.430.809330.500.851130.510.8495
3.130.130.795830.120.837430.250.8387
3.229.880.787829.940.831030.020.8323
3.329.700.781129.730.823929.830.8258
3.429.530.774229.540.817329.650.8196
3.529.370.767829.390.811029.490.8134
3.629.200.761429.240.805829.330.8085
3.729.050.755029.030.798829.100.8014
3.828.890.749728.860.792528.950.7956
3.928.750.743128.760.787928.830.7914
4.128.480.730928.480.775528.610.7816
4.228.350.725928.360.770128.510.7771
4.328.230.720928.190.763728.410.7720
4.428.100.715428.100.758328.290.7677
4.527.960.710527.950.753028.110.7621
4.627.840.705227.810.747427.930.7570
4.727.730.699827.720.742627.850.7529
4.827.600.694427.620.737727.710.7480
4.927.490.691027.460.730927.590.7424
MethodHAT [32] + CiaoSR [5]HAT [32] + LTE [28]HAT [32] + ours
MetricPSNRSSIMPSNRSSIMPSNRSSIM
2.133.730.917933.570.914834.160.9200
2.233.270.910833.010.906433.620.9114
2.332.900.901432.640.899533.210.9044
2.432.560.895632.300.892532.810.8975
2.532.220.887331.880.883932.410.8894
2.631.890.880831.580.876032.030.8812
2.731.610.872931.260.868931.660.8732
2.831.310.864330.990.861931.340.8655
2.931.030.857630.720.855131.040.8577
3.130.950.849030.300.841531.010.8496
3.230.380.840930.110.835330.740.8437
3.330.220.833529.910.828330.520.8368
3.430.030.828229.750.822030.290.8304
3.529.840.820029.580.815730.110.8243
3.629.680.815629.410.810429.940.8197
3.729.690.808829.210.803129.690.8134
3.829.350.804229.030.797329.520.8074
3.929.200.799828.950.792929.410.8030
4.128.880.790128.700.781629.140.7941
4.228.730.782528.590.776129.020.7888
4.328.630.778028.450.770228.870.7836
4.428.470.772528.370.765628.810.7798
4.528.330.770428.110.758828.670.7748
4.628.220.763827.940.753428.510.7710
4.728.090.759527.860.748828.420.7666
4.827.960.743827.760.743728.280.7624
4.927.830.740127.710.738628.170.7571
Table 3. Quantitative results with other approaches and our m-DCTformer with EDSR and HAT on Urban100 datasets. Bold represents the best peak signal-to-nose ratio (PSNR) and similarity index measure (SSIM) scores.
Table 3. Quantitative results with other approaches and our m-DCTformer with EDSR and HAT on Urban100 datasets. Bold represents the best peak signal-to-nose ratio (PSNR) and similarity index measure (SSIM) scores.
DatasetUrban100
MethodEDSR [19] + CiaoSR [5]EDSR [19] + LTE [28]EDSR [19] + ours
MetricPSNRSSIMPSNRSSIMPSNRSSIM
2.130.850.919631.700.922432.000.9246
2.230.250.910531.200.915031.420.9166
2.329.650.900230.740.907630.870.9082
2.429.320.894830.330.900430.370.8999
2.528.950.887329.920.892929.910.8917
2.628.650.879829.560.885629.470.8831
2.728.140.864929.210.878329.070.8750
2.828.060.866528.890.871228.700.8667
2.927.870.860128.600.864328.380.8588
3.127.160.846028.070.850728.460.8573
3.226.940.838027.830.844128.200.8508
3.326.660.831827.600.837227.930.8437
3.426.470.824227.380.830727.690.8372
3.526.240.817627.160.823927.460.8304
3.626.090.812026.960.817327.240.8239
3.725.880.804126.760.810727.030.8172
3.825.720.799026.580.804426.820.8109
3.925.550.793026.410.798126.640.8046
4.125.220.780626.070.785126.430.7964
4.225.070.774125.910.778826.270.7907
4.324.940.768625.770.773126.120.7853
4.424.780.764325.620.766625.970.7795
4.524.490.749425.490.761025.840.7744
4.624.570.755025.340.754425.690.7685
4.724.260.740125.210.748725.550.7629
4.830.700.822125.090.743125.420.7579
4.930.530.816624.960.736925.280.7522
MethodHAT [32] + CiaoSR [5]HAT [32] + LTE [28]HAT [32] + ours
MetricPSNRSSIMPSNRSSIMPSNRSSIM
2.132.660.930232.450.929533.320.9373
2.232.100.925331.930.922732.630.9302
2.331.380.913831.470.915932.000.9227
2.431.100.910131.030.909231.450.9150
2.530.710.903630.630.902530.920.9072
2.630.350.897730.240.895630.430.8992
2.729.460.884229.890.889029.980.8913
2.829.730.885729.560.882229.570.8834
2.929.450.879429.270.875929.190.8756
3.128.980.875828.700.873130.260.8876
3.228.810.870128.460.87729.940.8821
3.328.650.872828.220.870429.610.8758
3.428.510.869227.980.864329.320.8699
3.528.180.860127.760.858129.020.8638
3.627.960.855227.550.852028.750.8578
3.727.790.849727.360.845928.490.8519
3.827.530.842427.170.840028.270.8463
3.927.410.837426.990.833928.040.8405
4.127.230.836726.620.831928.180.8404
4.226.920.829026.460.825927.980.8354
4.326.750.824226.320.820527.820.8311
4.426.650.819126.160.814427.630.8258
4.526.610.815326.030.809327.450.8211
4.626.310.810225.880.803327.300.8163
4.726.120.805425.740.797627.130.8112
4.825.920.800725.620.792326.990.8071
4.925.710.795225.480.786426.840.8021
Table 4. Comparison of the m-DCTformer with arbitrary-scale super-resolution approaches for FLOPs, parameters, model capacity, and time.
Table 4. Comparison of the m-DCTformer with arbitrary-scale super-resolution approaches for FLOPs, parameters, model capacity, and time.
FLOPs (G)Parameters (M)Model Capacity (MB)Time (ms)
EDSR [19] + CiaoSR [5]59942.834901340
EDSR [19] + LTE [28]61839.74454483
EDSR [19] + ours53847.75155391
Table 5. Quantitative results with the m-DCTformer on the Set5 dataset for × 2.5 . Bold represents the best PSNR and SSIM scores.
Table 5. Quantitative results with the m-DCTformer on the Set5 dataset for × 2.5 . Bold represents the best PSNR and SSIM scores.
DCT DomainWeight ModulationPSNRSSIM
35.500.9335
🗸35.520.9335
🗸 35.930.9424
🗸🗸35.990.9426
Table 6. Quantitative results with the m-DCTformer on the Urban 100 dataset for × 2.5 . Bold represents the best PSNR and SSIM scores.
Table 6. Quantitative results with the m-DCTformer on the Urban 100 dataset for × 2.5 . Bold represents the best PSNR and SSIM scores.
DCT DomainWeight ModulationPSNRSSIM
29.580.8841
🗸29.610.8842
🗸 29.840.8899
🗸🗸29.910.8917
Table 7. Quantitative results with the m-DCTformer on the Set 14 dataset for × 3.5 . Bold represents the best PSNR and SSIM scores.
Table 7. Quantitative results with the m-DCTformer on the Set 14 dataset for × 3.5 . Bold represents the best PSNR and SSIM scores.
The Parameter dSet14
PSNRSSIM
1029.380.8128
2029.490.8134
3029.460.8133
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, M.H.; Yoo, S.B. Memory-Efficient Discrete Cosine Transform Domain Weight Modulation Transformer for Arbitrary-Scale Super-Resolution. Mathematics 2023, 11, 3954. https://doi.org/10.3390/math11183954

AMA Style

Kim MH, Yoo SB. Memory-Efficient Discrete Cosine Transform Domain Weight Modulation Transformer for Arbitrary-Scale Super-Resolution. Mathematics. 2023; 11(18):3954. https://doi.org/10.3390/math11183954

Chicago/Turabian Style

Kim, Min Hyuk, and Seok Bong Yoo. 2023. "Memory-Efficient Discrete Cosine Transform Domain Weight Modulation Transformer for Arbitrary-Scale Super-Resolution" Mathematics 11, no. 18: 3954. https://doi.org/10.3390/math11183954

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop