Image Super-Resolution via Dual-Level Recurrent Residual Networks

Recently, the feedforward architecture of a super-resolution network based on deep learning was proposed to learn the representation of a low-resolution (LR) input and the non-linear mapping from these inputs to a high-resolution (HR) output, but this method cannot completely solve the interdependence between LR and HR images. In this paper, we retain the feedforward architecture and introduce residuals to a dual-level; therefore, we propose the dual-level recurrent residual network (DLRRN) to generate an HR image with rich details and satisfactory vision. Compared with feedforward networks that operate at a fixed spatial resolution, the dual-level recurrent residual block (DLRRB) in DLRRN utilizes both LR and HR space information. The circular signals in DLRRB enhance spatial details by the mutual guidance between two directions (LR to HR and HR to LR). Specifically, the LR information of the current layer is generated by the HR and LR information of the previous layer. Then, the HR information of the previous layer and LR information of the current layer jointly generate the HR information of the current layer, and so on. The proposed DLRRN has a strong ability for early reconstruction and can gradually restore the final high-resolution image. An extensive quantitative and qualitative evaluation of the benchmark dataset was carried out, and the experimental results proved that our network achieved good results in terms of network parameters, visual effects and objective performance metrics.


Introduction
Image super-resolution (SR), reconstructing HR from the corresponding LR image, is an important image processing technique in computer vision. It has applications in all aspects of the real world, such as medical imaging [1], surveillance and security [2] and satellite imaging [3].
The SR task has the inherent ill-posed problem that multiple different HR images can be recovered from a single LR image. To solve this issue, researchers have proposed a number of methods for SR reconstruction, which we can divide into two categories according to the process of reconstruction: traditional-based methods (such as interpolation-based methods [4] or reconstruction-based methods [5]) and learning-based methods (DL). At present, the typical method is to learn the non-linear mapping of LR-HR through neural networks [2,[6][7][8] to construct HR images. These networks calculate a series of feature maps from LR images; the resolution is then increased by one or more upsampling layers to construct the final HR images. Compared with these pure feedforward methods, it is believed that using a feedback connection to simply guide the tasks can produce results that are more suitable for the human visual system, i.e., visually satisfactory results [9].
Dong et al. [10] first used the CNN model for the SR task and the proposed SRCNN, which predicts the non-linear mappings of LR-HR via a fully connected layers network. Its reconstruction results are significantly better than traditional methods. The advantage Its reconstruction results are significantly better than traditional methods. The advantage of the deep learning method comes from two key factors. Firstly, increasing the depth of the CNN model to learn more complex mappings from LR to HR and to improve SR performance. Secondly, adding residual connections to the network (globally [7], locally [11] or jointly [8]) can effectively alleviate the problem of gradient vanishing and exploding caused by deepening the network only by stacking more layers.
Although these methods based on deep learning can achieve superior results, there are also some shortcomings. The main problem is that the deeper the network, the more parameters are required, and the more storage resources are taken up. A recursive structure is usually adopted to reduce network parameters. These networks with recursive structures work at a single spatial resolution (e.g., DRCN [7] and DRRN [8]). Similar to most CNN-based approaches, these networks transmit information in a feedforward manner.
In this paper, we add an additional level to the residual branch in the classical feedforward network structure, so that our model becomes a dual-level network that operates in different resolution spaces. Specifically, the HR-level (HRL) information is used to refine LR-level (LRL) information through feedback connections, while it uses LRL to enrich HRL information through feedforward connections, and finally, obtains SR with rich details and is visually satisfied. The DLRRB is composed of multiple groups of cross-level feature fusion blocks of HRL (CLFFB_S) and cross-level feature fusion blocks of LRL (CLFFB_L) with dense connections. We use the output of CLFFB_S (that is, the hidden information of the DLRRB as shown in Figure 1a) as the feedback information in our network. The hidden information ( _ and _ ) in the DLRRB of each iteration was used to modulate the input of the next iteration and output _ . To provide our network with an early reconstruction ability and obtain clearer SR images, as in work [12], we input the LR images into each iteration and formed loss functions between the output SR and HR in each iteration. The principle of the feedback scheme in our network is that the HRL information in the feedback information flow can refine the LR image features, and the refined LR image features can guide the network to gradually construct better SR images. Our network ranks successive iterations of target HR images from easy to hard according to the difficulty of the LR image recovery. Such a learning process allows our network DLRRN to handle a more complex degradation, while the experimental results also prove that our network can deal accurately with complex degradation models.  The DLRRN proposed in this paper is different from DSRN [13] in the following three points: Firstly, this paper performs mutual correction through the feature map of the image, while the image directly processed by DSRN will increase the memory consumption of the network. Secondly, the DLRRN proposed in this paper outputs SR images at each iteration, which provides the network with the ability of early reconstruction and can deal with more complex degradation. Thirdly, this paper uses the output of the last iteration as the final output image, while the final output of DSRN is the LR image, and then the final output is obtained by upsampling. In general, the DLRRN and DSRN are very different in terms of performance and network structure.
In summary, our main contributions are as follows: • This paper proposes a single image super-resolution network via dual-level recurrent residuals (DLRRN), which use both feedforward and feedback connections to generate HR images with rich details. This recursive structure with feedback connections has a small number of parameters, while providing a powerful early reconstruction capability. • Inspired by [14], in this paper, a cross-layer feature fusion block (CLFFB) for the SR task is designed as the core part of DLRRB, which can enhance information by effectively processing cross-layer information flow.

•
Since the self-attention module [15] can describe the spatial correlation of any two positions in an image, in this paper, we use it to propose the self-attention feature extraction block (SAFEB). SAFEB models the local features by contextual relevance; it cooperates with the applied MS-SSIM [16] to improve the reconstruction performance and produce better visual effects.
The remainder of this paper is arranged as follows: The second section mainly introduces some classic super-resolution algorithms based on deep learning and attempts to apply feedback connections to super-resolution, as carried out in recent years. The third section is the details of our network. The fourth section is about the implementation details of our experiment and the analysis of the results. The fifth section is the summary of this paper and some defects of the algorithm.

Deep-Learning-Based Image Super-Resolution
Due to the powerful learning ability of deep learning, many scholars have introduced it into computer vision tasks (including SR), and the results have shown its excellent performance. Dong et al. [10] proposed the first CNN-based SR method, namely SRCNN, which introduced three fully connected layers to SR tasks to learn the complex mapping from LR to HR, and SRCNN was trained via end-to-end methods. Theoretically, the CNN-based SR network reconstruction process consists of three stages: feature extraction, non-linear mapping and image reconstruction. The VDSR proposed by Kim et al. [6] learns the LR to HR representation by stacking 20 convolutional layers. In [8], a skip connection and adjustable gradient are adopted to overcome gradient vanishing and exploding, which may be caused when the network becomes deeper. However, the deeper the model, the more parameters it needs, which is not conducive to practical applications. It has become a research hotspot for reducing network parameters without sacrificing network performance, the DRCN [7] loops the same recursive layer 16 times, which can effectively reduce parameters without reducing network performance. In addition, skip connections and recursive supervision are used in DRCN to alleviate training difficulties. A variety of different skip connections are used in SR tasks to improve reconstruction performance. The residual skip connections in [17] were applied to SRResNet [18] and EDSR [19]. SRDenseNet [20] applies the dense skip connections in [21]. Zhang et al. [22] proposed RDN using local/global residuals and dense skip connections. These network structures can use or combine hierarchical features in a down-up manner through skip connections, extracting shallow features from the first few layers lacks sufficient contextual information that will be reused in subsequent layers, thus limiting the reconstruction capability of the network. At the same time, skip connections make the neural network deeper, resulting in greatly increased network parameters. Such a large-capacity network occupies a large amount of storage resources and has the problem of over-fitting. To solve these problems give the network a better generalization ability. This work proposes DLRRN with a recursive structure, in which LRL features are corrected by HRL with more contextual information in a top-down flow of information, while LRL information enriches HRL features in a down-top manner. In particular, the recursive structure in the DLRRN (shown in Figure 1b) plays a crucial role in implementing the feedback process.

Feedback Mechanism
The feedback network divides the prediction process of non-linear mapping inputs to the target space into multiple steps, so that the model has a self-correcting ability. In recent years, many network architectures have applied feedback mechanisms to various visual tasks [23][24][25].
Some researchers have made attempts to introduce feedback mechanisms into SR tasks. The DBPN proposed by Haris et al. [23] realizes iterative error feedback through up-and down-projection units. The feedback block (FB) designed in [11] directly iterates convolution and deconvolution to realize (down-) up-sampling, and feedback is realized through the output of FB. To make the feedback mechanism suitable for image SR, this paper carefully designed a CLFFB as the basic module in DLRRN, instead of simple and repeated up-and down-sampling as in [11]. The information in our CLFFB is efficiently inter-corrected between HRL and LRL via cross-layer connections. The experimental results also demonstrate the excellent reconstruction performance of our well-designed CLFFB.

Attention Mechanism
An attention module can model remote dependency and has been widely used in many tasks [11,15,26]. The study of [15] first proposed a self-attention mechanism to describe the global dependencies of inputs and applied it to machine translation. The work [27] introduced self-attention mechanisms to learn better image generators. Subsequently, different attention modules are widely used in computer vision tasks.
The attention module models the features with learning weights to update the features. For example, SENet [28] generates feature vectors in the channel direction through a global pooling operation, then learns the correlation among the channels through feature vectors, highlighting the channel maps with a large amount of information and suppressing unimportant channel features according to different channel weights. CBAM [14] focuses on salient regions by extending the SE module to the spatial dimension. More and more attention mechanisms are used in SR tasks, and SFTGAN [29] adopts a spatial feature transformation layer to make the generated SR images have more realistic and visually pleasing textures. The study of [30] explored the potential of a reference-based superresolution method on remote sensing images, utilizing rich texture information from HR reference images to reconstruct the details in LR images. The study of [31] learned the predicted convolution kernels and channel modulation coefficients obtained from unsupervised degenerate representations to handle various quantization models. In order to capture rich context and produce visually satisfactory SR images, this paper introduced a self-attention mechanism to SR and crafted SAFEB to better represent features with intra-class compactness.

Methods
This section introduces the details of our network architecture. Section 3.1 briefly introduces the overall network architecture. Section 3.2 is the basic block (DLRRB) of DLRRN, which is composed of dense CLFFB to handle information flow. Section 3.3 introduces CLFFB, as the core part of our network, which can enhance information by effectively handling cross-layer information flow. SAFEB is introduced in Section 3.4. Because the self-attention mechanism models the spatial position, it is helpful to calculate the loss function of MS-SSIM, thus achieving a better visual effect. Section 3.5 provides a detailed description of the loss function of our network, and this study introduces MS-  [16] to enable the network to produce results that are more consistent with human vision. Finally, the implementation details of our network are shown in Section 3.6.

Network Structure
Unlike models that work at a single spatial resolution, DLRRN enables pieces of information in LR and HR spaces to be guided to each other. The overall structure of our DLRRN is shown in Figure 2. Specifically, in Figure 2a, CLFFB_L and CLFFB_S represent the LRL information space and HRL information space, respectively. The four colored arrows represent the transfer function between LRL and HRL. There are purple ( f lr ), brown ( f hr ) and yellow f up arrows exits in conventional RNN, which provide information flow from LRL to LRL, HRL to HRL, and LRL to HRL, respectively. For LRL information to access HRL information with more context information, this paper adds a green arrow ( f down ) to realize the feedback of HRL information.
introduces CLFFB, as the core part of our network, which can enhance information by effectively handling cross-layer information flow. SAFEB is introduced in Section 3.4. Because the self-attention mechanism models the spatial position, it is helpful to calculate the loss function of MS-SSIM, thus achieving a better visual effect. Section 3.5 provides a detailed description of the loss function of our network, and this study introduces MS-SSIM [16] to enable the network to produce results that are more consistent with human vision. Finally, the implementation details of our network are shown in Section 3.6.

Network Structure
Unlike models that work at a single spatial resolution, DLRRN enables pieces of information in LR and HR spaces to be guided to each other. The overall structure of our DLRRN is shown in Figure 2. Specifically, in Figure 2a, CLFFB_L and CLFFB_S represent the LRL information space and HRL information space, respectively. The four colored arrows represent the transfer function between LRL and HRL. There are purple ( ), brown ( ) and yellow arrows exits in conventional RNN, which provide information flow from LRL to LRL, HRL to HRL, and LRL to HRL, respectively. For LRL information to access HRL information with more context information, this paper adds a green arrow ( ) to realize the feedback of HRL information.         The DLRRN can be unfolded to ordered T iterations in time, as shown in Figure 2b. In order to make the DLRRN have an early reconstruction ability and carry output information in the feedback information, we established a loss function between each iteration result and HR. The residual branch in each iteration t consists of three parts: shallow feature extraction part ( + ), dual-level recurrent residual block (DLRRB) and dimension reduction block. Each DLRRB is weight-shared in time, while the up-sampled images in each iteration t use global residual skip connections to bypass the residual branch. Therefore, the purpose of the residual branch in each iteration t is to restore the high-resolution residual image after inputting the low-resolution image . In this paper, we used ( , ) and ( , ) to denote the regular convolution and deconvolution layers, respectively, where and denote the size and number of filters, respectively. We use (3,4 ) and SAFEB to extract shallow features. In subsequent experiments, we set to 64 ( = 64) by default. We provided the LR image input for LR feature extraction part, and obtained the shallow feature containing LR image information: where is the input of the shallow information of the t-th DLRRB. The DLRRN can be unfolded to ordered T iterations in time, as shown in Figure 2b. In order to make the DLRRN have an early reconstruction ability and carry output information in the feedback information, we established a loss function between each iteration result and HR. The residual branch in each iteration t consists of three parts: shallow feature extraction part (Conv + SAFEB), dual-level recurrent residual block (DLRRB) and dimension reduction block. Each DLRRB is weight-shared in time, while the up-sampled images in each iteration t use global residual skip connections to bypass the residual branch. Therefore, the purpose of the residual branch in each iteration t is to restore the high-resolution residual image I t Res after inputting the low-resolution image I LR . In this paper, we used Conv(s, m) and Deconv(s, m) to denote the regular convolution and deconvolution layers, respectively, where s and m denote the size and number of filters, respectively. We use d Conv (3, 4m) and SAFEB to extract shallow features. In subsequent experiments, we set m to 64 (m = 64) by default. We provided the LR image input I LR for LR feature extraction part, and obtained the shallow feature F t in containing LR image information: where F t in is the input of the shallow information of the t-th DLRRB. The DLRRB of the t-th iteration receives the hidden information F t−1 out of the previous iteration and the shallow feature F t in , F t out represents the output of DLRRB in the t-th iteration. The mathematical formula of DLRRB is: where H DLRRN (·) refers to DLRRB operation. The DLRRB output feature F t out generates a residual image I t Res through a dimension reduction block (DRB). The mathematical formula is: where Conv represents the dimension reduction operation.
The output SR image of the t-th iteration can be expressed as: where H UP represents the up-sampling function; therefore, we can choose any up-sampling operation. Here we use bilinear up-sampling operation. After T iterations, we can obtain a total of T SR images I 1 SR , I 2 SR , . . . , I T SR ; we chose I T SR as the final output of our network.

Dual-Level Recurrent Residual Block
The structure of the DLRRB is shown in Figure 3. The DLRRB of the t-th iteration receives hidden information F_lr t−1 out F_sr t−1 out to correct the low-level representation , and then outputs the high-level representation F_lr t out F_sr t out with richer features to the t + 1 iteration and DRB. The DLRRB is composed of G group dense CLFFB, and each CLFFB can make HRL features and LRL features interact to generate final SR images with rich details.
The DLRRB output feature generates a residual image through a dime sion reduction block (DRB). The mathematical formula is: where represents the dimension reduction operation. The output SR image of the t-th iteration can be expressed as: where represents the up-sampling function; therefore, we can choose any up-sam pling operation. Here we use bilinear up-sampling operation. After T iterations, we c obtain a total of T SR images ( , , … , ); we chose as the final output of our ne work.

Dual-Level Recurrent Residual Block
The structure of the DLRRB is shown in Figure 3. The DLRRB of the t-th iterati receives hidden information _ ( _ ) to correct the low-level representati _ ( _ ), and then outputs the high-level representation _ ( _ ) wi richer features to the t + 1 iteration and DRB. The DLRRB is composed of G group den CLFFB, and each CLFFB can make HRL features and LRL features interact to genera final SR images with rich details.  Figure 3. The internal structure of the DLRRB.
As can be seen from Figure 3, DLRRB contains two branches, one is the SR bran that generates an HRL feature map with rich details through fine LRL feature maps, an the other is the LR branch, which refines LRL feature maps through detailed HRL featu maps. The two branches guide each other and gradually achieve our final image rich detail.
At the beginning of the t-th DLRRB, the LR branch receives the input informati _ and output information _ of the previous layer, and then concatenates an compresses them by (1, ) to generate a rough input feature map : Similarly: As can be seen from Figure 3, DLRRB contains two branches, one is the SR branch that generates an HRL feature map with rich details through fine LRL feature maps, and the other is the LR branch, which refines LRL feature maps through detailed HRL feature maps. The two branches guide each other and gradually achieve our final image I T SR in rich detail. At the beginning of the t-th DLRRB, the LR branch receives the input information F_lr t in and output information F_lr t−1 out of the previous layer, and then concatenates and compresses them by Conv(1, m) to generate a rough input feature map L t 0 : Similarly: where F_lr t in is F t in in Figure 2, L t g and H t g represent the LRL and HRL feature map output of the g-th CLFFB of DLRRB in the t-th iteration, respectively. L t g can be expressed as: where C l g indicates that Conv(1, m) is used for dimension reduction in the g-th feature fusion group in LR branch, and f _lr t g indicates the feature maps the output of the g-th CLFFB in the t-th iteration (see Figure 3). Similarly: To use useful information from each group and to correct the input features F t+1 in for the next iteration, we fuse the feature maps of each group (green arrows in Figure 3). the output of DLRRB as follows: For the LRL: For the HRL: where F_sr t out is F t out in Figure 2. C l(h) FF (·) represents the feature fusion of the last layer of the t-th DLRRB in the LR(SR) branch, which is expressed as Conv(1, m) function.
It is worth mentioning that in the first DLRRB in the DLRRN, we initialize as follows. For LR branch: For SR branch:

Cross-Level Feature Fusion Block
Different from the study of [12,23], which directly fuses low-level and high-level features, we use the cross-layer feature gate mechanism to guide selectively enhanced spatial details. Therefore, we propose an effective CLFFB (as shown in Figure 4) to process the information flow in the network.
where _ is in Figure 2 ] refers to the concatenations of _ and _ , and ( ) rep the initial dimensionality reduction operation using (1, ) in LR(SR) branch and represent the LRL and HRL feature map output of the g-th CL DLRRB in the t-th iteration, respectively.
can be expressed as: where indicates that (1, ) is used for dimension reduction in the g-th fusion group in LR branch, and _ indicates the feature maps the output of t CLFFB in the t-th iteration (see Figure 3). Similarly: To use useful information from each group and to correct the input features the next iteration, we fuse the feature maps of each group (green arrows in Figure  output of DLRRB as follows: For the LRL: For the HRL: where _ is in Figure 2. ( ) (·) represents the feature fusion of the las of the t-th DLRRB in the LR(SR) branch, which is expressed as It is worth mentioning that in the first DLRRB in the DLRRN, we initialize as f For LR branch: For SR branch:

Cross-Level Feature Fusion Block
Different from the study of [12,23], which directly fuses low-level and high-lev tures, we use the cross-layer feature gate mechanism to guide selectively enhanced details. Therefore, we propose an effective CLFFB (as shown in Figure 4) to proc information flow in the network.   Specifically, the input of the CLFFB (the following takes CLFFB_L as an ex includes two parts. One part is that the feature map from the SR branch is re the same size as by a convolution operation, and the previous output from branch jointly generates the cross-level feature map : where ↓ indicates the downsampling operation of . We feed the generated cross-layer feature map into two branches to refine features. One branch is to generate the weight vector to reweight the feature channel direction: where (·) represents the global average pooling function, rep (1, ), and refers to the activation function. The other branch is used to generate an attention map ∈ × : where , is the average and maximum pooling function along the chann and is the (1,1). The generated weight vector , attention map and feature map are s and multiplied element-wise to obtain a fine feature map, and cascaded with th level feature map ( ) ↓ , and then the output _ of CLFFB is obtained through volution layer.

Self-Attention Feature Extraction Block
The scale of objects in LR images is varied, and single-scale features cannot multi-scale contextual information of different objects. Since the non-salient regi relatively dispersed, the direct aggregation of multi-scale features may weaken th sentation ability of important regions. We separately placed self-attention [15] (th ture is shown in Figure 5b) on different scales of features in order to focus more a on visually important areas; therefore, we have constructed SAFEB, as shown in 5a. Specifically, the input of the CLFFB (the following takes CLFFB_L as an example) includes two parts. One part is that the feature map H t g from the SR branch is resized to the same size as L t g by a convolution operation, and the previous output L t g from the LR branch jointly generates the cross-level feature map l t g : where H t g ↓ indicates the downsampling operation of H t g . We feed the generated cross-layer feature map l t g into two branches to refine the LRL features. One branch is to generate the weight vector α to reweight the features in the channel direction: α = Sigmoid conv conv avgpool l t g (14) where avgpool(·) represents the global average pooling function, conv represents conv(1, m), and Sigmoid refers to the Sigmoid activation function. The other branch is used to generate an attention map β ∈ R H×W : where Mean, Max is the average and maximum pooling function along the channel axis, and conv is the conv(1, 1). The generated weight vector α, attention map β and feature map L t g are summed and multiplied element-wise to obtain a fine feature map, and cascaded with the crosslevel feature map H t g ↓ , and then the output f _lr t g of CLFFB is obtained through a convolution layer.

Self-Attention Feature Extraction Block
The scale of objects in LR images is varied, and single-scale features cannot capture multi-scale contextual information of different objects. Since the non-salient regions are relatively dispersed, the direct aggregation of multi-scale features may weaken the representation ability of important regions. We separately placed self-attention [15] (the structure We first load the input low-level feature maps in parallel to the dilated convolution layers with different dilation rates to extract rich features, then add self-attention mechanism modules [15] (as shown in Figure 5b) to each branch. The input and output of the self-attention block are denoted as _ = × × and _ = × × , respectively. The attention map A can be obtained by: where (·) is the function, (·) indicates that the reshape input feature is × , = × .
Next, we combine the attention features maps A with _ to generate enhanced attention feature maps, then add the input feature maps _ to obtain the final output _ as follows: where (·) refers to reshape input features to × × .
In particular, we do not apply the self-attention module to the global average pooling branch and 1 × 1 convolution branch because these two branches are designed to use the minimum and maximum receptive fields to keep the intrinsic properties of the input.

Loss Function
In deep neural networks, the loss function is the essential part, which determines the direction of our network optimization. We use the L1 loss function and MS-SSIM [16] loss function to optimize our network. The results show that our network can produce a better visual effect without reducing objective performance metrics (PSNR, SSIM), and achieve the balance between perception and objective evaluation metrics.
In the evaluation index of image quality, PSNR and SSIM [32] is generally used as the evaluation index for images generated by L1 and L2 loss function optimization networks, but L1 and L2 have one thing in common: they are based on per-pixel comparison of differences, without considering human visual perception, and without considering human aesthetics, so a high PSNR value does not mean a good visual quality of an image. In [16], the structural similarity loss function (SSIM) and multi-scale structural similarity loss function (MS-SSIM) are designed to restore images with better vision. The SSIM loss function considers luminance, contrast and structure, which takes human visual perception into account. Generally speaking, the results obtained by SSIM are better than those obtained by L1 and L2 in visual.
SSIM for a certain pixel p is defined as: We first load the input low-level feature maps in parallel to the dilated convolution layers with different dilation rates to extract rich features, then add self-attention mechanism modules [15] (as shown in Figure 5b) to each branch. The input and output of the selfattention block are denoted as F_att in = R m×H×W and F_att out = R m×H×W , respectively. The attention map A can be obtained by: where so f tmax(·) is the so f tmax function, R 1 (·) indicates that the reshape input feature is R C×N , N = H × W. Next, we combine the attention features maps A with F_att in to generate enhanced attention feature maps, then add the input feature maps F_att in to obtain the final output F_att out as follows: where R 2 (·) refers to reshape input features to R C×H×W . In particular, we do not apply the self-attention module to the global average pooling branch and 1 × 1 convolution branch because these two branches are designed to use the minimum and maximum receptive fields to keep the intrinsic properties of the input.

Loss Function
In deep neural networks, the loss function is the essential part, which determines the direction of our network optimization. We use the L1 loss function and MS-SSIM [16] loss function to optimize our network. The results show that our network can produce a better visual effect without reducing objective performance metrics (PSNR, SSIM), and achieve the balance between perception and objective evaluation metrics.
In the evaluation index of image quality, PSNR and SSIM [32] is generally used as the evaluation index for images generated by L1 and L2 loss function optimization networks, but L1 and L2 have one thing in common: they are based on per-pixel comparison of differences, without considering human visual perception, and without considering human aesthetics, so a high PSNR value does not mean a good visual quality of an image. In [16], the structural similarity loss function (SSIM) and multi-scale structural similarity loss function (MS-SSIM) are designed to restore images with better vision. The SSIM loss function considers luminance, contrast and structure, which takes human visual perception into account. Generally speaking, the results obtained by SSIM are better than those obtained by L1 and L2 in visual.
SSIM for a certain pixel p is defined as: where x, y represents the processed image and the real image, µ x(y) is the mean value of X(Y), δ 2 x(y) represents the variance of X (Y), δ xy is the covariance of X and Y, C1 and C2 are constants, and its calculation formula is C1 = (k 1 L) 2 and C2 = (k 2 L) 2 , L is the gray value range of the image ([0, 255] for color images and [0, 1] for gray images). k 1 and k 2 are two constants, and the default values are 0.01 and 0.03. It should not be overlooked that the mean and standard deviation are calculated by the Gaussian filter.
We can learn from [16] that it is crucial to choose the size of Gaussian kernel to calculate the mean and variance of images in SSIM. If it is chosen to be small, the calculated SSIM loss cannot keep the local structure of the image well, and artifacts will appear. If the selection is large, the network will produce noise at the edge of the image. In order to avoid time-consuming adjustments of Gaussian kernel size, [16] proposed a version of multi-scale SSIM, and MS-SSIM is defined as: where l M and cs j represent Equation (19) at the scale of M and j, respectively. For conve- Therefore, the loss function of MS-SSIM is: where p is the center pixel of input image patch P. We combine L1 with MS-SSIM as the loss function of our network, which is defined as L DLRRN : where the θ indicates the trade-off factor, ω denotes the parameters of the DLRRN, and I 0 HR = I 1 HR = . . . = I T HR represents the SR image reconstructed by the t-th iteration.

Network Details
The activation function after the convolution layer and deconvolution layer is PRelu [32]. As with [12], we set different k in the (De)Conv(k, m) according to different scaling factors to achieve (up-)downsampling of the feature map, as shown in Table 1. We can obtain a total of T SR images I 0 HR , I 1 HR , · · · , I T SR , and we chose I T SR as the final output of our network. Our network can handle both grey and color images, the output channel of the last convolution layer can be 1 or 3, accordingly.

Experimental Section
In this chapter, we describe the experimental process and analyze results in detail. The public datasets, evaluation metrics, degradation model, training settings and experimental conditions are described in Section 4.1. Section 4.2 is the experimental analysis. Firstly, we study the influence of iteration times T and the number G of CLFFB_L and CLFFB_S in CLFFB on the reconstruction performance. Secondly, we analyze the loss function. Finally, we explore the influence of SAFEB on the experimental results. Section 4.3 describes the algorithm comparison and visualization results. We first analyze the network parameters and the complexity. The training results of the network's training models (BI × 2, BI × 3, BI × 4, BD × 3, DN × 3) were then compared with those of other algorithms.

Implementation Details
We use DIV2K [33] as the training dataset of the network, which contains 800 training images and 100 validation images. To make our trained model more robust, there are two ways to augment the data, as described in [14]: (1) scaling-reducing the scale [0.8, 0.7, 0.6, 0.5]; (2) rotation and flip-horizontally flipping and rotating 90 degrees to expand the training data. We evaluated SR results for five standard benchmark datasets under PSNR and SSIM [32] indicators: Set5 [25], Set14 [34], BSD100 [35], Urban100 [36] and Manga109 [37]. As in previous work, our experimental results were quantitatively evaluated in the luminance (Y) channel.
To ensure a fair comparison with previous work, we used the process of HR obtaining LR by bicubic downsampling as the standard degradation (denoted as BI). To verify the generalization ability of our network to deal with multiple degradation models, we further experimented with two additional degradation models BD and DN [22]. BD is defined as firstly blurring HR image with Gaussian kernel with size 7 × 7 and standard deviation of 1.6, and then performing downsampling operation. DN is defined as the process of first adding Gaussian noise with a noise level of 30 to the HR and then obtaining the LR by standard bicubic downsampling. BI/BD/DN × n means that HR is degraded by BI/BD/DN and the downsampling factor is n to obtain LR, and the formed LR-HR image pair is used for network training or testing, as shown in Table 2. Table 2. Degradation model experiments conducted in this paper.

BI×2
Under BI degradation, the scaling factor is 2.

BI×3
Under BI degradation, the scaling factor is 3. BI×4 Under BI degradation, the scaling factor is 4. DN×3 Under DN degradation, the scaling factor is 3. BD×3 Under BD degradation, the scaling factor is 3.
In our training process, we set the input batchsize to 8. In order to make the extracted features contain more LR image context information, similar to the study of [12], we set different patchsizes for different scaling factors (Table 3 lists the input patchsize settings). Using the method in [24] to initialize the network parameters, we used Adam [26] as the optimization function for our network. The initial learning rate of our network was 0.0001, and was halved every 150 epochs; we trained a total of 600 epochs. We used the Pytorch framework to realize our network and train it at TITAN RTX.

Study of T and G
In this subsection, we will discuss the effect of the iterations times (denoted as T) and the number of groups (denoted as G) of CLFFB_L and CLFFB_S in the DLRRB on the reconstruction results. We first set G = 5 to analyze the effect of T on the reconstruction results, and the experimental results are shown in Figure 6a. It highlights the fact that the reconstruction quality increases with T. In general, the reconstruction performance of the network is outstanding; therefore, CLFFB is effective for the SR task. In addition, we visualized T on the BI × 4 model (as shown in Figure 7, the first group is the reconstructed RGB image, and the second group is its corresponding residual image (I T Res ). Then, T = 4 is allowed to study the influence of G on network reconstruction, and its convergence curve is shown in Figure 6b. We can find that the larger the G value, the better the reconstruction performance, indicating that the deep network has a strong representation capability. Overall, choosing a larger T or G is helpful to obtain better results. In the following discussion, we use DLRRN (T = 4, G = 5) for analysis. It is worth mentioning that we consider both network performance and network parameters, so we assume T = 4 and G = 5.

Study of T and G
In this subsection, we will discuss the effect of the iterations times (denoted as T) and the number of groups (denoted as G) of CLFFB_L and CLFFB_S in the DLRRB on the reconstruction results. We first set G = 5 to analyze the effect of T on the reconstruction results, and the experimental results are shown in Figure 6a. It highlights the fact that the reconstruction quality increases with T. In general, the reconstruction performance of the network is outstanding; therefore, CLFFB is effective for the SR task. In addition, we visualized T on the BI × 4 model (as shown in Figure 7, the first group is the reconstructed RGB image, and the second group is its corresponding residual image ( ). Then, T = 4 is allowed to study the influence of G on network reconstruction, and its convergence curve is shown in Figure 6b. We can find that the larger the G value, the better the reconstruction performance, indicating that the deep network has a strong representation capability. Overall, choosing a larger T or G is helpful to obtain better results. In the following discussion, we use DLRRN (T = 4, G = 5) for analysis. It is worth mentioning that we consider both network performance and network parameters, so we assume T = 4 and G = 5.   Figure 7. In the test of the best model of BI × 4, the first group indicates that the reconstruction performance improves with the increase in T. The second group is its corresponding residual map .

Study of T and G
In this subsection, we will discuss the effect of the iterations times (denoted as T) and the number of groups (denoted as G) of CLFFB_L and CLFFB_S in the DLRRB on the reconstruction results. We first set G = 5 to analyze the effect of T on the reconstruction results, and the experimental results are shown in Figure 6a. It highlights the fact that the reconstruction quality increases with T. In general, the reconstruction performance of the network is outstanding; therefore, CLFFB is effective for the SR task. In addition, we visualized T on the BI × 4 model (as shown in Figure 7, the first group is the reconstructed RGB image, and the second group is its corresponding residual image ( ). Then, T = 4 is allowed to study the influence of G on network reconstruction, and its convergence curve is shown in Figure 6b. We can find that the larger the G value, the better the reconstruction performance, indicating that the deep network has a strong representation capability. Overall, choosing a larger T or G is helpful to obtain better results. In the following discussion, we use DLRRN (T = 4, G = 5) for analysis. It is worth mentioning that we consider both network performance and network parameters, so we assume T = 4 and G = 5.

Analysis of Loss Function
We uses Equation (22) L DLRRN as the loss function of our optimized network. We first explored the influence of hyperparameter θ on the training of our network. We used the dichotomy method to explore the value range of θ as shown in Figure 8, and the experimental results showed that when θ = 0.1, the training results could reach the relative optimal solution (i.e., the PSNR value was relatively maximum). At the same time, our network was compared with the L1-trained network alone, and the results showed that the results of the L DLRRN training were slightly higher than the results of the L1 loss training in the objective evaluation metrics (32.28 vs. 32.26 from Figure 8). Additionally, we proved that, when the results of the L DLRRN and the L1 training were the same as the PSNR, due to MS-SSIM, our network produced superior visual effects, as shown in Figure 9 (the visual evaluation metrics, PI [38] and LPIPS [39], are shown at the bottom of the figure) and in Table 4.

Analysis of Loss Function
We uses Equation (22) as the loss function of our optimized network. We first explored the influence of hyperparameter on the training of our network. We used the dichotomy method to explore the value range of as shown in Figure 8, and the experimental results showed that when = 0.1, the training results could reach the relative optimal solution (i.e., the PSNR value was relatively maximum). At the same time, our network was compared with the L1-trained network alone, and the results showed that the results of the training were slightly higher than the results of the L1 loss training in the objective evaluation metrics (32.28 vs. 32.26 from Figure 8). Additionally, we proved that, when the results of the and the L1 training were the same as the PSNR, due to MS-SSIM, our network produced superior visual effects, as shown in Figure  9 (the visual evaluation metrics, PI [38] and LPIPS [39], are shown at the bottom of the figure) and in Table 4.

Ablation Analysis of SAFEB
Regarding the ablation analysis of SAFEB, we could use (1, ) a convolution layer to place SAFEB as our baseline. As shown in Table 5, we could obtain the following

Analysis of Loss Function
We uses Equation (22) as the loss function of our optimized network. We first explored the influence of hyperparameter on the training of our network. We used the dichotomy method to explore the value range of as shown in Figure 8, and the experimental results showed that when = 0.1, the training results could reach the relative optimal solution (i.e., the PSNR value was relatively maximum). At the same time, our network was compared with the L1-trained network alone, and the results showed that the results of the training were slightly higher than the results of the L1 loss training in the objective evaluation metrics (32.28 vs. 32.26 from Figure 8). Additionally, we proved that, when the results of the and the L1 training were the same as the PSNR, due to MS-SSIM, our network produced superior visual effects, as shown in Figure  9 (the visual evaluation metrics, PI [38] and LPIPS [39], are shown at the bottom of the figure) and in Table 4.

Ablation Analysis of SAFEB
Regarding the ablation analysis of SAFEB, we could use (1, ) a convolution layer to place SAFEB as our baseline. As shown in Table 5, we could obtain the following

Ablation Analysis of SAFEB
Regarding the ablation analysis of SAFEB, we could use Conv(1, m) a convolution layer to place SAFEB as our baseline. As shown in Table 5, we could obtain the following results through experiments: within 100 epochs, when SAFEB acted on the network alone, the reconstruction performance was slightly increased. Our network performance was improved by 0.04 dB (32.38 vs. 32.42) when we experimented with 200 epochs, which shows that it is effective for SAFEB. Although SAFEB acting alone on the network did not significantly improve performance, our experiments showed that it could improve the visual effects. As shown in Table 4, we used the BI × 4 model to test on Set5 under PSNR = 32.40, and we used PI [38] and LPIPS [39] as evaluation metrics of visual quality, which showed the effectiveness of SAFEB and MS-SSIM in improving visual effect.

Network Parameters and Complexity
We compared DLRRN with ten deep-learning-based SR methods: SRCNN [10], VDSR [6], DRRN [8], MemNet [40], EDSR [19], DBPN-S [23], D-DBPN [24], SRFBN [12], USRNet [41] and RFANet [42]. The comparison results of network parameters and reconstruction effect (PSNR) are shown in Figure 10. We can see from Figure 10a that the network parameters and reconstruction performance of our method are relatively optimal. Our network requires only 35% and 8% of the parameters in the D-DBPN and EDSR, while achieving better reconstruction results. Although RFANet has a slightly higher performance than our network, its number of parameters is twice that of our network. Overall, compared with other latest methods, our network is lighter and more efficient.
that it is effective for SAFEB.
Although SAFEB acting alone on the network did not significantly improve performance, our experiments showed that it could improve the visual effects. As shown in Table 4, we used the BI × 4 model to test on Set5 under PSNR = 32.40, and we used PI [38] and LPIPS [39] as evaluation metrics of visual quality, which showed the effectiveness of SAFEB and MS-SSIM in improving visual effect.

Network Parameters and Complexity
We compared DLRRN with ten deep-learning-based SR methods: SRCNN [10], VDSR [6], DRRN [8], MemNet [40], EDSR [19], DBPN-S [23], D-DBPN [24], SRFBN [12], USRNet [41] and RFANet [42]. The comparison results of network parameters and reconstruction effect (PSNR) are shown in Figure 10. We can see from Figure 10a that the network parameters and reconstruction performance of our method are relatively optimal. Our network requires only 35% and 8% of the parameters in the D-DBPN and EDSR, while achieving better reconstruction results. Although RFANet has a slightly higher performance than our network, its number of parameters is twice that of our network. Overall, compared with other latest methods, our network is lighter and more efficient.
We compared DLRRN's Flops with other algorithms, and the comparison results are shown in the Figure 10b. It can be seen from the figure that, compared with SRFBN, the Flops of the algorithm in this paper increases by 75%, and its performance is improved by 0.19 dB. Compared with USRNet, the Flops of this algorithm is reduced by 68%, and it can achieve comparable performance. Overall, Flops also reflects the effectiveness of our algorithm to some extent. Since the algorithm in this paper works in the LR and HR spaces and adopts a dense structure, it leads to more computational complexity of the network. Next, we will try to drastically reduce the complexity of the network without affecting the reconstruction effect. We compared DLRRN's Flops with other algorithms, and the comparison results are shown in the Figure 10b. It can be seen from the figure that, compared with SRFBN, the Flops of the algorithm in this paper increases by 75%, and its performance is improved by 0.19 dB. Compared with USRNet, the Flops of this algorithm is reduced by 68%, and it can achieve comparable performance. Overall, Flops also reflects the effectiveness of our algorithm to some extent. Since the algorithm in this paper works in the LR and HR spaces and adopts a dense structure, it leads to more computational complexity of the network. Next, we will try to drastically reduce the complexity of the network without affecting the reconstruction effect.

Results of Evaluation on BI Model
We compare DLRRN with the ten latest image SR methods: SRCNN [10], VDSR [6], DRRN [8], SRDenseNet [20], MemNet [40], EDSR [19], D-DBPN [23], SRFBN [12], USR-Net [41] and RFANet [42]. The results of quantitative evaluation are shown in Table 6. Compared with our method, EDSR uses more filters (256 v.s. 64), while D-DBPN, USRNet and DRN use more training images (DIV2K + Flickr2K v.s. DIV2K). Compared with them, however, our DLRRN can obtain competitive results. Table 6. Quantitative evaluation of comparative algorithms in BI degradation models. Red indicates the best SR reconstruction performance, and blue is the second best. We show the SR visualization results of BI × 4 in Figure 11. The proposed DLRRN can produce more convincing results (as the RFANet code is not open source, we do not have access to its visuals). We can see from SR visualization results of the "BokuHaSitatakaKun" image in Manga109 that "M" letters reconstructed by DRRN and MemNet are separated, the VDSR, EDSR and D-DBPN cannot restore the clear texture of the image, the image generated by SRFBN is fuzzy, and the image edge restored by USRNet has many artifacts. The proposed DLRRN produces clear images, even smoother than the label. In addition, we also visualized "img 092" in Urban100, the texture directions of SR images reconstructed by other comparison methods except SRFBN and USRNet are all wrong. However, our proposed DLRRN allows HRL information and LRL information to be mutually corrected in the iterative process and optimizes our network by using L1 and MS-SSIM loss functions, so the obtained SR image is smoother than the ground truth and more in line with people's vision.

Scale
We show the SR visualization results of BI × 4 in Figure 11. The proposed DLRRN can produce more convincing results (as the RFANet code is not open source, we do not have access to its visuals). We can see from SR visualization results of the "BokuHaSita-takaKun" image in Manga109 that "M" letters reconstructed by DRRN and MemNet are separated, the VDSR, EDSR and D-DBPN cannot restore the clear texture of the image, the image generated by SRFBN is fuzzy, and the image edge restored by USRNet has many artifacts. The proposed DLRRN produces clear images, even smoother than the label. In addition, we also visualized "img 092" in Urban100, the texture directions of SR images reconstructed by other comparison methods except SRFBN and USRNet are all wrong. However, our proposed DLRRN allows HRL information and LRL information to be mutually corrected in the iterative process and optimizes our network by using L1 and MS-SSIM loss functions, so the obtained SR image is smoother than the ground truth and more in line with people's vision.

Results of Evaluation on BD and DN Models
To verify the generalization ability of our network model, the proposed DLRRN is also trained in BD and DN degradation models and DLRRN with SRCNN [10], VDSR [6], IRCNN_G [43], IRCNN_C [43], SRMD(NF) [44], RDN [22], SRFBN [12] and RFANet are compared [42]. The results of the quantitative evaluation with the latest algorithm are shown in Table 7. We find that our algorithm performs well on most datasets.  Figure 11. Comparison of the visual effect of the method in this paper with other methods on BI × 4.

Results of Evaluation on BD and DN Models
To verify the generalization ability of our network model, the proposed DLRRN is also trained in BD and DN degradation models and DLRRN with SRCNN [10], VDSR [6], IRCNN_G [43], IRCNN_C [43], SRMD(NF) [44], RDN [22], SRFBN [12] and RFANet are compared [42]. The results of the quantitative evaluation with the latest algorithm are shown in Table 7. We find that our algorithm performs well on most datasets.  We show two groups of SR visual results tested on the BD and DN models in Figure 12. From the visualization results, we can see that our network can reduce distortion and recover SR images with more details. From the overall experimental results, it is concluded that our network handles BD and DN degradation more robustly and effectively. We show two groups of SR visual results tested on the BD and DN models in Figure  12. From the visualization results, we can see that our network can reduce distortion and recover SR images with more details. From the overall experimental results, it is concluded that our network handles BD and DN degradation more robustly and effectively.

Conclusions and Discussion
In this paper, we realize image super-resolution reconstruction by adding an extra level in the super-resolution network based on feedforward structure, called super-resolution via dual-level recurrent residual network (DLRRN), which makes the pieces of HRL "butterfly"from Set5

Conclusions and Discussion
In this paper, we realize image super-resolution reconstruction by adding an extra level in the super-resolution network based on feedforward structure, called super-resolution via dual-level recurrent residual network (DLRRN), which makes the pieces of HRL information and LRL information guide each other through the iterative process, so as to achieve the better reconstruction of SR images. The proposed CLFFB plays an important role in the iterative process, which is used to effectively fuse the cross-level information flow and features enhancement. We use the combination of L1 and L MS−SSI M loss function to make an attempt to trade-off objective performance measures and visual effects. In conclusion, our comprehensive experimental results show that the proposed DLRRN has a good effect on the objective evaluation index and visual effects.
However, the method proposed in this paper has the limitation of a high complexity compared to a pure feed-forward network (i.e., The high-level feature learning stage only works in the LR space.) due to the dense structure and working in both HR and LR spaces. Our experimental results show that (as shown Figure 13) the SR image generated by our network can produce a good visual effect for the middle area of the image, but the restoration effect for the edge of the image is not ideal. We find that Equation (13) emphasizes that the calculation of the standard deviation in SSIM(p) needs the support of pixel neighborhood, and SSIM(p), and its derivatives cannot be calculated in some boundary regions of p. In conclusion, our comprehensive experimental results show that the proposed DLRRN has a good effect on objective evaluation index and visual effect. Next, our work will continue to explore the situation of satisfying visual effects and recovering better edge information. information and LRL information guide each other through the iterative process, so as to achieve the better reconstruction of SR images. The proposed CLFFB plays an important role in the iterative process, which is used to effectively fuse the cross-level information flow and features enhancement. We use the combination of L1 and loss function to make an attempt to trade-off objective performance measures and visual effects. In conclusion, our comprehensive experimental results show that the proposed DLRRN has a good effect on the objective evaluation index and visual effects.
However, the method proposed in this paper has the limitation of a high complexity compared to a pure feed-forward network (i.e., The high-level feature learning stage only works in the LR space.) due to the dense structure and working in both HR and LR spaces. Our experimental results show that (as shown Figure 13) the SR image generated by our network can produce a good visual effect for the middle area of the image, but the restoration effect for the edge of the image is not ideal. We find that Equation 13 emphasizes that the calculation of the standard deviation in SSIM(p) needs the support of pixel neighborhood, and SSIM(p), and its derivatives cannot be calculated in some boundary regions of p. In conclusion, our comprehensive experimental results show that the proposed DLRRN has a good effect on objective evaluation index and visual effect. Next, our work will continue to explore the situation of satisfying visual effects and recovering better edge information. In future studies, we will explore the lightweight aspects of the SR network and try to introduce a non-parametric attention mechanism or dynamic convolution layer to enhance information extraction in the high-level information learning stage of the network. We will improve the reconstruction block of the network and design a more efficient reconstruction part instead of simply using transposed convolution or sub-pixel convolution. At the same time, in the future work, we will apply this work to video SR or introduce it into the real world for real-time broadcasting.  In future studies, we will explore the lightweight aspects of the SR network and try to introduce a non-parametric attention mechanism or dynamic convolution layer to enhance information extraction in the high-level information learning stage of the network. We will improve the reconstruction block of the network and design a more efficient reconstruction part instead of simply using transposed convolution or sub-pixel convolution. At the same time, in the future work, we will apply this work to video SR or introduce it into the real world for real-time broadcasting.

Conflicts of Interest:
We declare no conflict of interest.

LR (HR)
Low-(high-)resolution DLRRB The dual-level recurrent residual block HRL (LRL) HR-level (LR-level) CLFFB_S/CLFFB_L Cross-level feature fusion block of HRL/(LRL) SAFEB The self-attention feature extraction block CLFFB Collectively referred to as CLFFB_S and CLFFB_L DRB dimension reduction block BI The process of obtaining LR image by bicubic downsampling of HR image.

BD
First blurring the HR image with a Gaussian kernel with size 7 × 7 and standard deviation of 1.6, and then performing a downsampling operation DN The process of first adding Gaussian noise with a noise level of 30 to the HR and then obtaining the LR by standard bicubic downsampling