Residual Dense Swin Transformer for Continuous-Scale Super-Resolution Algorithm

: The single-image super-resolution task benefits has a wide range of application scenarios, so has long been a hotspot in the field of computer vision. However, designing a continuous-scale super-resolution algorithm with excellent performance is still a difficult problem to solve. In order to solve this problem, we propose a continuous-scale SR algorithm based on a Transformer, which is called residual dense Swin Transformer (RDST). Firstly, we design a residual dense Transformer block (RDTB) to enhance the information flow before and after the network and extract local fusion features. Then, we use multilevel feature fusion to obtain richer feature information. Finally, we use the upsampling module based on the local implicit image function (LIIF) to obtain continuous-scale super-resolution results. We test RDST on multiple benchmarks. The experimental results show that RDST achieves SOTA performance in the fixed scale of super-resolution tasks in the distribution, and significantly improves (0.1 ∼ 0.6 dB) the arbitrary scale of super-resolution tasks out of distribution. Sufficient experiments show that our RDST can use fewer parameters,and its performance is better than the SOTA SR method.


Introduction
Single-image super-resolution (SISR) refers to the technical means to restore a lowresolution image to a high-resolution image.It is widely used in the fields of medical imagery [1,2], remote sense image [3,4], monitoring, and security [5,6].Therefore, this technology has long been a research hotspot in the field of computer vision.In most of today's application scenarios, people expect to enlarge an image to any scale without losing the high-frequency details of the image.However, because a low-resolution image can correspond to multiple different high-resolution images, SISR becomes an ill posed problem.How to use the single model to approximate the optimal solution in the superresolution space of arbitrary scale amplification is still a difficult problem.Therefore, it is of great significance to study a continuous-scale super-resolution algorithm with excellent performance.
SISR algorithms can be divided into two categories: traditional methods and deeplearning-based methods.Yang [7] drew on the idea of compressed sensing, performed sparse representation of low-resolution images, and used prior knowledge to complete the dictionary learning of high-resolution images to achieve super-resolution reconstruction; Gao et al. [8] used locally linear embedding in manifold learning.To achieve linear mapping from low-resolution space to high-resolution space, both Glasner [9] and Huang [10] proposed the example-based super-resolution method; the difference between the methods being that the latter transforms the patch to find a more similar patch in the low-resolution to high-resolution images.However, the super-resolution effect of these traditional methods is limited, and they struggle to meet application requirements in real life.Algorithms based on deep learning, especially based on convolutional neural networks, exhibit excellent performance that traditional methods do not have.Since Dong [11] first brought CNN into the SR field, countless SISR algorithms have been developed, and the SRCNN structure has been improved.Most of them use residual connections, dense connections, and iterative supervision to continuously deepen the CNN [12][13][14][15].Although this approach solves the problem of limited receptive field caused by CNN fixed-size convolution to a certain extent, it still does not fundamentally solve the problem of global information loss.In addition, there are very few studies on super-resolution at any scale.Lim et al. [16] used multiple upsampling modules for training to achieve integer multiples of multiscale superresolution.Refs.[17,18] used the pooling layer and local implicit functions to achieve arbitrary scales of super-resolution, but they both focused on building modules that can achieve arbitrary scales of upsampling, while ignoring the importance of feature extraction.
The development of the Transformer [19] in the field of CV, especially the emergence of the ViT [20] and Swin Transformer [21], has provided new ideas for scholars in the field of SISR.ViT was the first method to successfully apply Transformer to the computer field and achieve the same or even surpassing the effect of CNN.It slices the image and performs patch embedding, using it as the input sequence of the Transformer.Based on this, some scholars [22][23][24] have proposed performing super-resolution tasks and realized new SOTA performance at that time.On this basis, the Swin Transformer uses the idea of CNN to introduce a shift window to enhance the performance of network local feature extraction and reduce the amount of calculation.Based on this, ref. [25] introduced the idea of partial windows to further improve the super-resolution effect.It is obvious that none of these Transformer-based methods can make full use of the low-level and high-level information and cannot achieve image super-resolution on a continuous scale.
Benefiting from the inspiration of the Swin Transformer and LIIF, we propose a residual dense Swin Transformer to solve the continuous-scale super-resolution with excellent performance.We propose the residual dense Transformer block (RDTB) structure on the basis of the Swin transfomer.By introducing residual connections and dense connections, we realize information interaction between all levels, propose local feature fusion (LFF) to promote feature fusion within the block, and design global feature fusion (GFF) to achieve information flow between blocks.Through the information complementation between the bottom and high levels, the network can pay attention to the low-frequency and highfrequency information of the image at the same time; the Transformer's self-attention mechanism can be used to take into account the local and global information in the image.We combine the patch-embedded characteristic of the Transformer with the implicit local continuous expression of the image, and we better combine feature extraction with the upsampling module to achieve continuous-scale super-resolution reconstruction.
In summary, our contributions are as follows: (1) A high-performance super-resolution network RDST is proposed.The network makes full use of the low-and high-level information in the image and is combined with an LIIF upsampling module to achieve continuous-scale super-resolution reconstruction of a single model.(2) A novel RDTB structure is proposed, which uses LFF to perform local information fusion on features within blocks and uses GFF to perform global information fusion on the features between blocks.At the same time, it combines the shallow information to fully explore the information expression in low-resolution images.(3) Through a comparison experiment with a fixed multiple in the distribution and a continuous-scale super-resolution experiment on the benchmark, it is shown that RDST is equal to the state-of-the-art (SOTA) method in the super-resolution results for the fixed multiple, and the super-resolution results at magnification outside the dataset distribution are greatly improved.

Related Work
This section gives a brief review of the CNN-based and Transformer-based SISR methods.

CNN-Based Super-Resolution Method
With the rise of deep learning, especially convolutional neural networks, the SISR algorithm based on CNN has made brilliant achievements.The SRCNN [7] proposed by Dong et al. is the pioneering work of CNN applied to SR.With the help of sparse coding, they introduced CNN to SR tasks, creating a precedent for the study of SISR based on deep learning.Later, in response to the slower speed of SRCNN [7], they proposed FSRCNN [26], which uses postsampling and deconvolution layers to reduce network parameters, which greatly improves the speed of the algorithm, but the super-resolution effect is not improved compared to that of SRCNN [7].The VDSR [12] of Kim et al. enhances the super-resolution effect by deepening the network structure, but the problem that it creates is that the network parameters are greatly increased.DRCN [13] uses the idea of iteration to deepen the network, and residual learning and recursive supervision strategies are used to stabilize the network training process.Although the network parameters are reduced, the calculation amount is not, and there is also the problem that the network is difficult to train.In order to overcome the training difficulties caused by network deepening, SRResNet introduces the idea of local residuals.DRRN [27] combines the ideas of local residuals, global residuals, and convolutional layer recursion to reduce the computational cost and improve the effect of the algorithm.In EDSR, the BN layer in the residual block is not needed, and it also stacks deeper networks by reducing the computational cost.SRDenseNet draws on the idea of DenseNet [28] and uses the complementary fusion of features of different depths for super-segmentation tasks.RDN [15] is a further improvement on DRRN.It applies a dense residual block and introduces local residual learning and global residual learning to improve the effect of the model.RCAN [29] introduces channel attention into the residual block and uses the RCAB structure to improve network expression ability.MSRN [30] extracts rich feature information from a multiscale perspective.Although the CNN-based methods have made many achievements, the characteristics of the convolution kernel always limit the global feature extraction ability of such networks, which cannot fundamentally achieve the effective fusion of global and local features.

Transformer-Based Super-Resolution Method
After [19,31,32] made brilliant achievements in the field of NLP, scholars have tried to apply Transformer to the field of computer vision, challenging the dominance of CNN in computer vision.With the introduction of ViT, DeiT [33], Swin Transformer, etc., scholars have proposed a Transformer-based SISR.IPT [22] introduces Transformer into the underlying visual tasks and uses ImageNet pretraining and multitask learning and performs well on the dataset of SISR tasks; ESRT [23] combines the backbone of CNN with Transformer and uses Transformer's powerful global modeling capabilities to enhance the CNN.Swin-IR [25] includes the RSTB structure based on Swin Transformer, effectively using the sliding window mechanism to achieve long-distance modeling and using fewer parameters to obtain better performance.Although these algorithms have achieved varying degrees of improvement, they are currently based on Transformer.All of the super-resolution algorithms focus on how to apply Transformer to a fixed-multiple super-resolution image task.They fail to solve the super-resolution task of any multiple, and they fail to make full use of the low-level information and high-level information in the network.

Network Architecture
As can be seen in Figure 1

Shallow and Multilevel Feature Extraction
First, we use a 3 × 3 convolutional layer to extract the shallow features from lowresolution images I LR ∈ R H×W×C in , which is expressed in Formula (1): where H SFE (•) refers to the shallow feature extraction module, and H, W, C in , and C out are the length, width, number of channels and the number of output channels of the shallow features, respectively.On the one hand, the application of the convolutional layer can make good use of the underlying features of the image to restore an image that is more in line with the perception of the human eye.On the other hand, it is conducive to subsequent global residual learning and stabilizes the training process of the network.Subsequently, we use multiple RDTBs to extract each level's features F i,LF ∈ R H×W×C out , which is expressed in Formula (2): where H RDTB−i (•) represents the ith RDTB, and F i−1,LF is the feature extracted by the ith RDTB.Each RDTB block takes the output of the previous RDTB block as the input, uses the Swin Transformer layer (STL) in the block to extract image features, and uses local feature fusion.To enhance the feature interaction within the block, local residuals are introduced to connect the training process of stabilizing the network and strengthen the feature expression ability of the network.Finally, the final feature expression is obtained through the multilevel feature fusion module H MLFF (•), which is expressed in Formula (3): Here, H GFF (•) represents the global feature fusion function between blocks.Through multilevel feature fusion and the introduction of global residuals, the network makes full use of the low-level and high-level features in the image to improve the network's reconstruction effect.

Upsampling Module Using LIIF
Inspired by [18], we use the local image implicit function f θ (•) in the upsampling module to express the discrete image continuously, namely, I = f θ (z, x).The input of the function is any coordinate x to be predicted, the corresponding feature vector is z, and the output is the RGB value I at this coordinate.The corresponding eigenvectors of the actual predicted coordinates cannot be obtained directly, so they are estimated by using the eigenvectors of the four nearest coordinates around the predicted coordinates.The specific super-resolution process is as follows: We first perform feature unfolding on the fusion feature F MF in the upsampling module, and we use our own information to enrich each feature vector in the feature map.The specific method is expressed in the following formula: where F (n,i,j) MF represents the nth feature vector in fusion feature F MF and its coordinates are (i, j).Then, we use the nearby feature vector to predict the RGB value of the corresponding coordinate x q ; the specific process is as follows: where v (n) t represents the coordinates of the corresponding feature vector F (n,i,j) MF , S t is the rectangular area of the diagonal coordinates of x q , v (n) t , S is the total area corresponding to the four eigenvector coordinates, and f θ (•) represents the function of the RGB value of the predicted coordinates.Considering that the relationship between the position of the pixel to be predicted and its surrounding pixels is different when the actual magnification is different, the cell parameter is also introduced into the function f θ (•), which refers to the size of the pixel under different magnifications.In the actual prediction process, a five-layer MLP can be used to achieve a super-resolution, continuous-scale of image.

Residual Dense Transformer Block
As can be seen in Figure 1, RDTB is composed of several STLs and a convolutional layer.Taking the ith RDTB as an example, for the input fusion feature F i−1,LF , feature extraction and learning are performed through the multilayer STL, and local feature fusion is used to interactively flow features at different levels to enhance RDTB's local information extraction capacity.Finally, the residual connection is introduced to obtain the fusion feature F i,LF .It is expressed in Formula (6).
where H LLF (•) represents the local feature fusion function in the block.

Swin Transformer Layer
STL is improved from the Transformer structure based on self-attention.This specific structure is shown in Figure 1.It uses window multihead self-attention (W-MSA) to calculate the global attention within the window and solves the problem of the huge computational cost of the Transformer for the image; it also uses shift window-multihead self-attention (SW-MSA) to realize window information interaction between the two so as to achieve global information modeling.The specific process is expressed in the following formula: Fi,j = W M SA LN F i,j−1 + F i,j−1 (7) Fi,j+1 = SW M SA LN F i,j + F i,j where F i,j−1 represents the output feature of the jth STL in the ith RDTB, Fi,j is the output feature of W-MSA, and j ∈ 2, 4, • • • , 2M, LN(•) is layer normalization; since STL calculates the self-attention of the patch in the window, its position coding method is also different from that of the traditional ViT.Using relative position coding, the calculation of the self-attention mechanism in the window can be expressed as where Q, K, and V are query, key, and value matrices, respectively; and B is the relative position weight that can be learned.

Multilevel Feature Fusion
It can be seen in Figure 1 that after each RDTB extracts local fusion features, we propose the use of multilevel feature fusion, which makes full use of the low-level information and high-level information extracted from the network to enhance the feature expression ability of the network.Multilevel feature fusion can be divided into two steps: global feature fusion and global residual learning.

Global Feature Fusion
Global feature fusion performs further information exchange on the local fusion features extracted from each level of RDTB.By concat splicing each level of fusion features F i,LF , first use 1 × 1 convolution to achieve channel-dimensional information interaction and reduce network parameters, and then use 3 × 3 convolution to enhance local context information to obtain global fusion features F GF , which can be expressed as where Concat(•) represents the splicing of channel dimensions.

Global Residual Learning
In order to introduce more high-frequency information from the image, before upsampling, we use global residual learning to connect the shallow features F 0 extracted above and the global fusion features F GF with long jumps to obtain the final multiscale fusion feature F MF .The application of long-hop connections enables the network to learn residual information at a coarse-grained level, which further improves the ability to express features.The specific process can be expressed as

Experiments 4.1. Dataset and Metrics
During the training process, we used 800 high-definition images in DIV2K [34] as the model's training set; in the testing phase, we evaluated the model on several recognized benchmarks: Set5 [35], Set14 [36], BSD100 [37], Urban100 [36], and Manga109 [38].At the same time, in order to evaluate and compare SR algorithms more objectively, we used PSNR and SSIM [39] as indicators to measure model performance.It is worth noting that the Transformer-based SR method needs to process the image block, so the algorithm in this paper had the same data boundary processing as SwinIR in the experiment.

Implementation Details
During the training process, we set the RDTB number, STL number, window size, embed dim and attention head number to 6, 6, 8, 64 and 8, respectively.We randomly cropped low-resolution images into 48 × 48 tiles as the input.We used the Adam optimizer to train the model for 1000 rounds, the batch size was set to 64, the initial learning rate was set to 0.0001, and the learning rate was halved every 200 rounds.In the training phase, it was ensured that the magnification of each batch of images was the same, and the value of the magnification was randomly distributed from 1 to 4. Our model was implemented based on the pytorch framework and trained on 4 Tesla V100 GPUs.In this study, the L1 loss function was used to optimize and learn the parameters of RDST.The formula is as follows: Loss = ∥I HR − I SR ∥ (14) where I HR and I SR represent high-resolution images (gt) and reconstructed super-resolution images, respectively.

Comparative Experiment
We compared the algorithm in this paper with several typical fixed multiple SISR algorithms, including SRCNN, DRRN, SRDenseNet, EDSR, and RCAN.Each algorithm was tested on 5 benchmarks.It should be noted that the comparison algorithm indicators were from the original paper.The SRCNN and EDSR indicators in the Manga109 data were from RCAN, and the DRRN indicators were from RDN. RDST-s* means the RDST-s model trained on Div2K + Flickr2K

In-Distribution
Table 1 shows the PSNR index of each algorithm's ×2, ×3, and ×4 fixed multiples.It can be seen that the RDST in this paper achieved the best performance.Compared with the previous classic neural network algorithms SRCNN, DRRN, and SRDenseNet, RDST shows powerful feature extraction capabilities.Figure 2 shows the visual effects of the algorithm in this paper and the classic SR algorithms with for super-resolution and fixed-size images.Figure 1 shows the results for four times, three times, and two times scale factors from top to bottom.For the "img078" in Urban100 and the "zebra" in Set14, the super-segmentation result of RDST preserves the texture details in the image.Compared with the other methods, it has fewer artifacts and is more suitable for human perception.For the "bird" in Set5, our super-score results are also very close to the original HR results.The good visual effects show that RDST makes full use of multilevel features and Transformer's global modeling capabilities.

Out of Distribution
Different from the ordinary fixed magnification SR method, our proposed RDST can achieve super-resolution effects for any multiple with the help of LIIF.In order to further explore the combination capabilities of different encoders with LIIF, CNN-based models were selected, including EDSR baseline (EDSR (b) ) and RDN, and compared with the proposed RDST.RDST-t, RDST-s, and RDST-b refer to the tiny, small, and base versions of RDST, respectively.The number of RDTBs and the number of STLs in the block were four, six, and eight, respectively.
As Table 2 shows, RDST-s and RDST-b almost captured the best PSNR indicators for each scale.Especially for the PSNR index outside the distribution, our method is generally 0.1∼0.6 dB higher than the model based on CNN combined with LIIF.This finding fully proves the excellent generalization ability of RDST and the powerful feature extraction ability of the RDTB that we designed.The powerful extra-distribution performance is also due to the combination of Transformer's unique encoding and LIIF for continuous image expression.Figures 3-5, respectively, show the visual effects of 6 times, 18 times, and 30 times the super score.It can be clearly seen from the figures that our proposed RDST can also achieve good visual effects even at multiples outside of the hyperdivision distribution.Compared with other methods, RDST can better retain texture details such as "glass boundary" and "railing shape", retain more high-frequency details of the image, and produce super-resolution high-quality images that are more suitable for the human eye.3 shows the impact of LFF and GFF on the performance of the model.The four models in the table have the same RDTB number (6), STL number (6), window size (8), channel number (64), and attention head number (8), and the models were all tested on Manga109.It can be found from the PSNR indicators in the table that the addition of LFF and GFF enhances the flow of information before and after the network, improves the performance of the model, and verifies the effectiveness of LFF and GFF.It is worth noting that we also found a very interesting phenomenon.For the results within the distribution, the model that only adds LFF obtained the best effect at each magnification; for the results out of distribution, the model that combines LFF and GFF obtained the best effect at each magnification.We guess that this is because for a super-resolution image of multiples within the distribution, more attention is paid to the high-level features of the network, and LFF can provide enough local semantic information to reconstruct the image.For a super-resolution image for multiples outside the distribution, each level of the network needs to complement each other to achieve a better reconstruction effect.

Imapact of Head Number
Table 4 shows the influence of the number of multihead attentions in the Transformer structure on the performance of the model, and the models were all tested on Manga109.In order to more intuitively compare the impact of different numbers of attention heads on RDST.We also drew a scatter line chart for the PSNR indicators of the three zoom scales within and outside the distribution, as shown in Figure 3.For the convenience of presentation, we denote the models as RDST1, RDST2, RDST4, and RDST8.
Combining the data in Table 4 with the broken line in Figure 6, we can clearly see that for the over-score effect within the distribution, RDST8 obtains the best PSNR value; for the over-score effect outside the distribution, RDST2 obtains the best PSNR value.Through a large number of previous studies, it is known that different numbers of attention heads in the same layer of Transformer can learn information in different subspaces, but the attention patterns of most heads are the same.Therefore, we guess that for the super-resolution tasks in the subdivisions, the scaling factor is small, different feature information can be obtained from different heads, and feature information can be supplemented from heads with similar attention patterns, thereby improving the performance of the model.When the model performs an out-of-distribution super-resolution task, the scaling factor is large, and the heads with the same attention mode cannot achieve good information complementation.On the contrary, unnecessary information similar to noise is introduced, resulting in a large number of heads, causing the model's performance to decline.

Conclusions
This paper proposed a Transformer super-division model RDST that can perform continuous-scale super-resolution tasks and has excellent performance.Based on Transformer, we introduced dense connection and local residual learning, and we designed RDTB with better feature extraction capabilities.Through multilevel feature fusion, we make full use of the information of each layer of the model, and then LIIF continuously expresses the fused features to obtain continuous-scale super-score results.The proposed RDST was tested on multiple benchmarks and achieved performance close to or even better than SOTA methods in fixed multiples within the segment, especially for arbitrary multiples outside of the distribution, producing considerable improvements compared to the other methods.In general, the overall performance of RDST is better than that of state-of-the-art SR methods.

Figure 6 .
Figure 6.PSNR for different numbers of attention heads.

Table 1 .
Comparison with classical SISR methods.Best and second best performance are in red and blue colors, respectively.

Table 2 .
Quantitative comparison (average PSNR) with CNN methods on benchmark datasets.Best and second best performance are in red and blue colors, respectively.

Table 3 .
Add Ablation study of LFF and GFF.Best performance are in red color, respectively.

Table 4 .
Ablation study of head number.Best performance are in red color, respectively.