A Lightweight Dense Connected Approach with Attention on Single Image Super-Resolution

In recent years, neural networks for single image super-resolution (SISR) have applied more profound and deeper network structures to extract extra image details, which brings difficulties in model training. To deal with deep model training problems, researchers utilize dense skip connections to promote the model’s feature representation ability by reusing deep features of different receptive fields. Benefiting from the dense connection block, SRDensenet has achieved excellent performance in SISR. Despite the fact that the dense connected structure can provide rich information, it will also introduce redundant and useless information. To tackle this problem, in this paper, we propose a Lightweight Dense Connected Approach with Attention for Single Image SuperResolution (LDCASR), which employs the attention mechanism to extract useful information in channel dimension. Particularly, we propose the recursive dense group (RDG), consisting of Dense Attention Blocks (DABs), which can obtain more significant representations by extracting deep features with the aid of both dense connections and the attention module, making our whole network attach importance to learning more advanced feature information. Additionally, we introduce the group convolution in DABs, which can reduce the number of parameters to 0.6 M. Extensive experiments on benchmark datasets demonstrate the superiority of our proposed method over five chosen SISR methods.


Introduction
The single image super-resolution (SISR) image processing technique is essential in computer vision and aims to recover a high-resolution image from a single low-resolution (LR) counterpart. It has been used in a large number of computer vision applications, such as medical imaging [1], surveillance imaging [2,3], object recognition [4], remote sensing imaging [5] and image registration and fusion [6,7]. For example, image registration requires high-resolution (HR) images to provide richer details to transform different sets of data into one coordinate system. However, when the upscaling factor increases, the complexity of image registration increases. As a result, it is vital to design an appropriate architecture for the SISR technique.
Traditional SISR methods can be divided into three categories: reconstruction-based methods [8,9], interpolation methods [10][11][12], and learning-based methods [13][14][15][16][17]. Recently, with the rapid development of neural networks, Convolutional Neural Network-based (CNN-based) SR methods have achieved remarkable performances [18][19][20][21][22][23][24]. In 2015, Dong et al. [18] proposed the first Super-Resolution Convolutional Neural Network (SRCNN), introducing a three-layer convolution neural network for single image super-resolution. Afterwards, a Fast Super-Resolution Convolutional Neural Network (FSRCNN) [19] was proposed to accelerate SRCNN, speeding it up by more than 40 times with better restoration quality via a modified network structure and smaller filter sizes. To deal with multi-scale SR problems, Lai et al. proposed a progressive upsampling framework named the Laplacian pyramid SR network (LapSRN) [20], which can progressively generate intermediate SR predictions. Since He et al. [25] presented a Residual Network (ResNet) to show that the network's depth can be of great significance for various computer vision tasks, some researchers began to increase the network depth to enhance the SR effect. Kim et al. [21] proposed the very deep convolutional network for SR (VDSR). They extended the depth of the network to 20 and achieved higher performance compared with SRCNN and FSRCNN. Soon after, some works [22,23] also successfully demonstrated that deepening CNN networks could further boost SR performance. To alleviate the vanishing-gradient problem brought by the deep network structure, T. Tong et al. [24] proposed a Densely Connected Convolutional Network for SR (SRDensenet) based on dense skip connections, which achieved significant improvement in the image SR task.
SRDensenet has exhibited a good performance at ICCV 2017 due to its contribution to the effective integration of low-frequency and high-frequency features. However, as the network layer deepens, the number of model parameters grows exponentially, dramatically increasing computational complexity. In particular, SRDensenet could waste unnecessary calculations for low-frequency features. The network lacks the ability to identify and learn across feature channels in terms of staking dense blocks, ignoring the inherent feature relationship. Thus, SRDensenet will produce redundant and conflicting information in rich features, which is useless for reconstruction. To resolve the above-discussed issues, in this paper, we propose a lightweight dense connected approach with attention to single image super-resolution tasks, namely, LDCASR, which uses the attention mechanism to learn more effective channel-wise features and the group convolution structure to decrease model parameters. Experiments show that the proposed LDCASR can achieve a better reconstruction performance over the state-of-the-art SISR methods on public SR benchmarks while significantly reducing the model parameters. (About one-ninth of the parameters of SRDensenet.) In summary, the main contributions of this paper can be summarized as follows.

1.
We introduce the attention mechanism to the dense connection structure as well as the reconstruction layer, which helps to suppress the less beneficial information during model training. Extensive experiments verify the effectiveness of this attention-based structure. 2.
Our model can extract the important features by using a lightweight approach. By introducing the group convolution, we reduce the number of parameters to 0.6 M, which is around 1/9 of the original SRDensenet.
The remainder of this paper is organized as follows. In Section 2, we introduce the related work of super-resolution tasks and the attention mechanism. In Section 3, we describe the proposed network LDCASR architecture, including the details of its compositions: Recursive Dense Group (RDG), Dense Attention Block (DAB), and Channel Attention Unit (CAU). The experimental results and analysis on the comparison with other methods are provided in Section 4. Finally, we draw our conclusions in Section 5.

CNN-Based SISR
Dong et al. unprecedentedly introduced a three-layer CNN framework into the SISR and proposed a super-resolution convolutional neural network (SRCNN) [18], which exhibited a remarkable performance compared to the traditional works [8][9][10][11][12][13][14][15][16][17] and opened the way for neural network-based SR research. After that, plenty of approaches based on convolution neural networks were proposed. A fast super-resolution convolutional neural network (FSRCNN) [19] introduced the deconvolution operation to the CNN model. It not only accelerated the speed but also enhanced the performance of the SRCNN. To further speed up the SR, Shi et al. [26] presented an effective method called efficient subpixel convolutional neural network (ESPCN), which completed the upscaling progress with the aid of subpixel convolution. Lai et al. [20] proposed a progressive upsampling framework called the Laplacian pyramid network (LapSRN) to increase image size gradually. By further deepening the network structure, a very deep super-resolution, VDSR [21], achieved a better result using a deeper network, including almost 20 convolution layers. Afterwards, DRCN [22] and DRRN [23] also realized deep networks by employing recursive learning and sharing parameters. However, with the deepening of the network, the issue of the gradient vanishing appeared. Researchers found the skip connection [25] is a handy way to address the gradient vanishing. Regarding this problem,Ledig et al. [27] proposed residual neural network for SR (SRResNet) with more than 100 layers. They adopted the generator part of the SRGAN as the model structure and employed the residual connections between layers. After that, Lim et al. proposed two even deeper and wider networks: an enhanced deep SR network (EDSR) [28] and a multi-scale deep SR network (MDSR) [28], which both consisted of 1000 convolution layers. These deep SISR networks improve performance by simply stacking the different blocks. However, they ignore the channel-wise feature information. In fact, in addition to the dimension of length and width, the channel is another crucial dimension of an image. Channel attention assigns different weights to each channel to help the network pay attention to important features and suppress unimportant features. With channel attention, the model performance can be improved with a small amount of computation.

Dense Skip Connections in SISR
To deal with deep model training problems, researchers utilized dense skip connections to promote the model's feature representation ability by reusing deep features of different receptive fields. The dense skip connections were first proposed in DenseNet [29], which was the subject of best paper in CVPR 2017. Afterward, SRDenseNet [24] exhibited a good performance in SISR by introducing the dense skip connections. Then, many networks employed the dense connected structure in the SR task and exhibited remarkable performances. However, the dense connected structure simultaneously introduces redundant and useless information, which is harmful to the image's super-resolution. Different from these methods, we combined the dense connected blocks with the attention mechanism to focus on learning important information.

Attention Mechanism
The attention mechanism was derived from a study of human vision, which was first proposed in the field of visual images. Google Mind team [30] proposed Recurrent Models of Visual Attention in 2014, which used the attention mechanism for image classification on the RNN model. In recent years, attention-based methods have yielded attractive results in various tasks, for instance, image recognition [31] and natural language processing [32]. Researchers found that attention mechanism can not only reduce useless information by discriminating the effective feature information but emphasize the importance in various dimensions. Wang et al. [33] designed a stackable network structure with a trunk-andmask attention mechanism for image classification tasks. Hu et al. presented a novel block called squeeze and excitation (SE) [34], which employs channel-wise associations using average-pooled features to increase the representational power of a CNN network and ensure accurate image classification. Specifically, the SE block was introduced to deep convolutional neural networks to enhance performance more [35]. Recently, Dai et al. [36] proposed a novel module called second-order channel attention (SOCA) to gain more useful feature expression as well as feature correlation learning. All the methods mentioned above undoubtedly obtained significant results. Inspired by these works, we introduce the attention mechanism to reinforce our network and improve the effect.

Our Model
In this section, we first describe the entire architecture of the proposed LDCASR. Then, we introduce the details of different components of the proposed network, including the Recursive Dense Group (RDG) and Dense Attention Block (DAB) with a Channel Attention Unit (CAU).

Network Architecture
As illustrated in Figure 1, our LDCASR mainly includes three modules: feature extraction module, upscale module, and reconstruction module. We define the original LR input as I LR and the output as I SR .

Feature Extraction Module
The feature extraction module includes a 3 × 3 convolutional layer for shallow feature extraction and several Recursive Dense Groups (RDGs) for deep feature extraction. First, the original low-resolution images were fed into the 3 × 3 convolutional layer straightway to extract the shallow feature. This part aims to transfer the inputs from color space to feature space. The process can be expressed as: where H SF (·) represents the function of the shallow feature extraction process, including only one 3 × 3 convolution layer. Then, the extracted shallow feature F 0 passed through the stacked Recursive Dense Groups, as well as a 1 × 1 convolution layer to change the number of channels. This process produces the deep image feature, denoted as: where H RDG (·) represents the feature extraction operation by RDGs, and H DF (·) represents the 1 × 1 convolution operation. Each RDG applies multiple Dense Attention Blocks (DABs) with Channel Attention to suppress redundant information (see Figures 2 and 3), which will be discussed in the following subsection.

Upscale Module
The upscale module was placed after the feature extraction module to upscale the feature maps from small size to the size of ground truth. Instead of merely performing deconvolution or subpixel convolution [26] for upscaling, as in the existing methods, we used channel attention units (shown in Figure 3) and deconvolution operations cross-wise to better capture high-frequency information. Specifically, we used one deconvolution operation in the ×2 experiment, two deconvolution operations in the ×4 experiment, and three in the ×8 experiment.
The low-level feature information includes much original image information, a lot of which will be lost during the process of forward propagation. Thus, it is important to combine the low-level features gained by bicubic-interpolation with the high-level information to obtain the final results. Considering fusing extra original image information, the upscaled features were added to the bicubic-interpolated LR image to acquire the final output of the upscale layer F up .
The process is expressed as: where the upscale operation is marked as H up (·).

Reconstruction Module
The final SR image I SR was obtained by the reconstruction module, which contains only one convolution layer of size 3 × 3. This layer aims to recover the images from the feature space to the color space. The reconstruction progress can be expressed as: where H R (·) denotes the reconstruction layer and H LDCASR (·) denotes the whole process of LDCASR. I SR was optimized by the absolute difference between the I LR and I HR .

Dense Attention Block (DAB) with Channel Attention Unit (CAU)
As mentioned before, the recursive dense groups (RDGs) are an essential component of our model. Each RDG consists of staked Dense Attention Blocks (DABs) connected by dense connections. It is verified that a large number of dense blocks are beneficial to form a deep CNN in [24]. However, the stacked dense blocks will introduce redundant and conflicting information, causing a longer training time and unsatisfied reconstruction results. Inspired by the methods based on attention, we employed the channel-dimension attention mechanism to learn the high-frequency features and propose the Dense Attention Block (DAB) (see Figure 2), which contains two 3 × 3 convolution operations and a Channel Attention Unit (CAU). As a result, with the aids of DABs, our model is able to focus on acquiring more important and useful information. The progress of DAB can be expressed as: where F h n , F h−1 n denote output and input of the h − th DAB in the n − th RDG separately, the operation of convolution, Relu, concatenation and CAU as f conv (·), f relu (·), f cat (·), f cau (·), respectively.
We also denote the input of CAU as F in and the output as F out . The specific formula is as follows: where f avgpool (·) represents the operation of average pooling, f sigmoid (·) represents the function of sigmoid and ⊗ denotes the element-wise product.

Group Convolution
Group convolution first appeared in AlexNet [37] architecture, which is shown in Figure 4. Unlike the standard convolution operation, group convolution can divide all the R channel inputs into G groups, so each group responds to R/G channels. Then, the output is cascaded to the final output of the entire set of convolutional layers. In order to reduce the parameters of the network, we introduced group convolution in each dense attention block (DAB) to achieve a lightweight network.

Loss Functions
In SISR tasks, loss functions are used to measure reconstruction error and lead the model optimization direction.
In the previous stages, many methods [18][19][20]22,23,25] employ the L2 loss, which is also named the mean square error (MSE) loss. However, researchers found that L2 loss cannot measure the reconstruction quality precisely. Moreover, the result of the reconstruction is not satisfying using L2 loss. Afterwards, more and more researchers tend to use the L1 loss (mean absolute error), which can achieve a better effect of reconstruction. Furthermore, some researchers use the L1 Charbonnier loss function to train the models, which was first proposed in LapSRN. We mark the original high-resolution image as I HR and the loss functions can be expressed as: where h, w and c represent the height, width and number of channels of the feature maps, respectively. N = h × w × c . is a constant for numerical stability.
To compare different loss functions, we employed the three loss functions mentioned above to train the LDCASR. The comparison results are displayed in Section 4.

Training and Testing Datasets
Similar to previous works, our training dataset uses the DIV2K [38]. It consists of 800 training RGB images, 100 validation RGB images, and 100 test images. Following the previous works, we employed 800 training images as our training set, which was then augmented with a 90°rotation and horizontal flip. Furthermore, we used the bicubic kernel function to down-sample the ground truth images to generate LR image counterparts. The training data were generated with Matlab Bicubic Interpolation; training files were created with https://github.com/wxywhu/SRDenseNet-pytorch/tree/master/data, accessed on 11 April 2021.
We followed the previous works by converting the test images from RGB color space to YCbCr color space for evaluation. After that, we employed the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) to evaluate the performance of our structure only on the Y channel.

Implement Details
The number of RDB and DAB was set to 8, respectively, identical to the dense blocks and single blocks in SRDensenet. Our model was trained by ADAM optimizer with β 1 = 0.9, β 2 = 0.999, the batch size was set to 32, the initial learning rate was 0.0001, and we set the learning rate to the initial LR decayed by momentum every 30 epochs. We conducted our experiment with the scaling factors of ×2, ×4, ×8 between the HR and the LR images. We used PyTorch to implement our models with one Titan Xp GPU.

Different Loss Function Analysis
We show the training results of 60 epochs to compare the training convergence curves of L1 loss, L2 loss, and L1 Charbonnier loss. During the process of gradient descent, the L1 loss was more robust than the L2 loss. Moreover, L1 loss was not easily affected by extreme feature points. This is because the L2 loss function is based on the squared error, which increases twice as much as the L1 loss. As a result, L1 loss is more stable than L2 loss. Additionally, L1 Charbonnier loss adds a variable 2 compared with L1 loss, making L1 Charbonnier loss more robust than L1 loss. Thus, the learning curve of L1 Charbonnier loss yields the best result. We chose the L1 Charbonnier loss to conduct our following experiments. As shown in Figures 5 and 6, we found that the performance of the training using L1 Charbonnier loss is better than others as its curve is more stable and the average PSNR or SSIM is higher. Therefore, we chose the L1 Charbonnier loss to train our models. The equations of loss functions are shown in Equations (8)-(10).
In Table 1, we compare our model with the state-of-the-art methods, including SRCNN, VDSR, LapSRN, and SRDensenet; all the mentioned methods were trained under the same condition. Bold indicates the best performance. As is shown in Table 1, our model achieved was more effective than the other methods under the scale factors of ×2, ×4, ×8 on the four benchmark datasets. Obviously, under the ×2 scale factor, the value of PSNR on the Urban100 dataset increased by almost 0.47 compared to SRDensenet. In Table 2, we compare the average computational time of different methods. As can be seen, our LDCASR achieved the second fastest performance for ×2, ×4, and ×8 SR experiments, preceded only by SRCNN, which is a simple model with only three convolution layers. Furthermore, we present the performance curve on Set5 for ×4 SR in Figure 7, which intuitively reflects that our model LDCASR is superior to SRDensent. By using the bicubic interpolation, our model exhibited a good performance from the beginning. Subsequently, as the training process continued, the PSNR value of LDCASR stayed at around 32, while the PSNR of SRDensenet stayed at approximately 31.54.  In addition, we provide the comparisons of evaluations in terms of visual quality, which are displayed in Figures 8 and 9.

Model Size Comparison
The comparisons of model size and performance are illustrated in Figure 10. It shows the model size and performance of different state-of-the-art methods. The abscissa refers to the model parameter sizes, while the ordinate denotes the image average PSNR obtained by these models, and each point represents a model. The red point represents our proposed network. As can be seen, SRCNN and VDSR exhibited lightweight but slightly lower performances. LapSRN exhibited a better performance than SRDensenet with fewer parameters. Our LDCASR had fewer parameters and exhibited a relatively better performance, which indicates our model achieves a better trade-off between parameter scale and performance.

Conclusions
In this paper, in order to solve the deficiency of dense network in the SR task, a lightweight dense connected approach with attention is proposed for SISR, where dense attention blocks (DABs) capture important information in the channel dimension by a channel attention unit (CAU). The design of DABs makes the whole network focus on high-frequency details and successfully suppresses useless information in smooth areas.
In addition, our model achieved a lightweight effect with fewer parameters and had a relatively superior performance. The extensive experiments on four benchmark datasets illustrate that our LDCASR achieves better results than other state-of-the-art SR methods in terms of objective evaluation and subjective visual effects. In our future work, we will explore more advanced model structures, such as the model based on improving the spatial attention mechanism or multi-attention mechanism. Moreover, we will study the application of our model in other fields.

Conflicts of Interest:
The authors declare no conflict of interest.