Remote Sensing Image Super-Resolution Based on Dense Channel Attention Network

: In the recent years, convolutional neural networks (CNN)-based super resolution (SR) methods are widely used in the ﬁeld of remote sensing. However, complicated remote sensing images contain abundant high-frequency details, which are difﬁcult to capture and reconstruct effectively. To address this problem, we propose a dense channel attention network (DCAN) to reconstruct high-resolution (HR) remote sensing images. The proposed method learns multi-level feature information and pays more attention to the important and useful regions in order to better reconstruct the ﬁnal image. Speciﬁcally, we construct a dense channel attention mechanism (DCAM), which densely uses the feature maps from the channel attention block via skip connection. This mechanism makes better use of multi-level feature maps which contain abundant high-frequency information. Further, we add a spatial attention block, which makes the network have more ﬂexible discriminative ability. Experimental results demonstrate that the proposed DCAN method outperforms several state-of-the-art methods in both quantitative evaluation and visual quality.


Introduction
High-resolution (HR) remote sensing images provide detailed geometric information about land cover. Thus, HR remote sensing images are essential for many applications, such as object detection [1,2], urban planning [3], building extraction [4][5][6] and so on [7][8][9][10][11][12][13][14][15]. However, the spatial resolution of remote sensing images are influenced by the limitations of hardware and environmental factors [16][17][18]. Compared to current physical imaging technology, super resolution (SR) which recovers HR images from low-resolution (LR) images is more convenient and low cost. Thus, SR becomes an alternative method in the remote sensing field. SR can be classified into single image super resolution (SISR) and multiple image super resolution (MISR) [19] by the number of input images. MISR methods utilize multiple LR images of the same area, to provide more information, to better reconstruct high spatial-frequency details and texture [20]. Nevertheless, it is difficult to obtain multiple remote sensing images of the same scene. Thus, this paper considers a study of SISR methods. The traditional SISR methods [21] are mainly grouped into interpolation-based methods and reconstruction-based methods [22]. The interpolation-based methods predict the unknown pixel using a simple linear or non-linear interpolation [23] operation. Although interpolation-based methods are convenient, they have limited performance for images which contain more details. The reconstruction-based methods utilize a certain kind of prior information to produce better results, such as local, nonlocal and sparse priors [23,24]. Although reconstruction-based methods [25] are flexible and allow consideration of different prior constraints [26][27][28][29][30][31], they also have difficulties with complicated remote sensing images. Recently, deep learning-based methods attracted a lot of attention in the remote sensing images super resolution task. In 2015, a super resolution convolutional neural network (SRCNN) [32] was first proposed by Dong et al. to achieve the natural images super resolution. As the poineer, SRCNN learned the mapping between the HR images and the corresponding LR images using a three layers network. While SRCNN had outperformed traditional-based methods, bicubic LR images made the network to operate in a high-dimensional space and largely increased the computational cost. To alleviate the problem, a fast super resolution convolutional neural netork (FSRCNN) [33] and an efficient subpixel convolutional network (ESPCN) [34] were proposed which used a deconvolutional layer and a subpixel convolutional layer to achieve direct reconstruction in a low-dimensional space and save the computational cost, respectively. These networks were shallow and their performance were limited by the network depth. However, increasing the network depth leads to a vanishing gradient and an exploding gradient. To handle the problem, the skip connection operation [35] was proposed by He et al., which combined the low-level features and high-level features to effectively alleviate the gradient vanishing. Thus, the skip connection operation was gradually used in the SR networks. Among these networks, the very deep super resolution convolutional network (VDSR) [36] which used a global residual connection was proposed by Kim et al. to spread the LR information to the network end. It was the first network which introduced the residual learning to SR and succeeded training a 20-layer network. Besides, enhanced deep super resolution (EDSR) [37] and SRResNet [38] also used global residual connection. In addition, EDSR and SRResNet employed residual blocks as basic network module, which introduced local residual connection to ease the deep network training difficulty. Later, Zhang et al. [39] constructed a residual in residual structure where residual blocks compose residual groups using short and long skip connections. Further, cascading residual network (CARN) [40], dense deep back-projection network (D-DBPN) [41], SRDenseNet [42], residual dense network (RDN) [43], employed the dense skip connections or multiple skip connections to increase the training effect.
Deep learning-based SR methods in the field of remote sensing also developed fast in recent years. In 2017, a local-global combined network (LGC) [19] was first proposed by Lei et al. to enhance the spatial resolution of remote sensing images.
LGC learns multi-level information including local details and global priors using the skip connection operation. In 2018, a residual dense backprojection network (RDBPN) [22] was proposed by Pan et al., which consists of several residual dense backprojection blocks that contain the upprojection module and the downprojection module. In 2020, Zhang et al. proposed a scene adaptive method [44] via a multi-scale attention network to enhance the SR reconstruction details under the different remote sensing scenes. Recently, an approach named dense-sampling super resolution network (DSSR) [45] presented a dense sampling mechanism which reuses an upscaler to upsample and overcome the large-scale remote sensing images SR reconstruction problem. However, the complex spatial distribution of remote sensing images need more attention. In 2020, a second-order multi-scale super resolution network (SMSR) [46] was proposed by Dong et al. to reuse the learned multi-level information to the high-frequency regions of remote sensing images. The multi-perception attention network (MPSR) [47] and the multi-scale residual neural network (MRNN) [48] are also doing some related work about using multi-scale information. In addition, the generative adversarial network (GAN)-based SR method is used to generate visually pleasing remote sensing images. In 2019, Jiang et al. presented an edge-enhancement generative adversarial network (EEGAN) [49], which introduces an edge enhancement module to improve the remote sensing images SR performance. In 2020, Lei et al. proposed a coupled-discriminated generative adversarial network (CDGAN) [50] for solving the discrimination-ambiguity problem for the low-frequency regions in the remote sensing images.
Although the above-mentioned methods have good performance, their results can be further improved. First, the distributions of remote sensing images are very complex; therefore, we need more high-frequency details and texture to better reconstruct HR images. Secondly, redundancy feature information are not beneficial to recover details and increase computation cost. So, we propose a dense channel attention network (DCAN) which learns multi-level feature information and pays more attention to the important and useful regions in order to better reconstruct the final image. The major contributions are as follows: (1) We propose a DCAN for SR of the single remote sensing image, which makes full use of the features learned at different depths through densely using multi-level feature information and pay more attention to high-frequency regions. Both quantitative and qualitative evaluations demonstrate the superiority of DCAN over the state-of-theart methods. (2) A dense channel attention mechanism (DCAM) is proposed to utilize the channel attention block through the dense skip connection manner. This mechanism can increase the flow of information through the network and improve the representation capacity of the network. (3) A spatial attention block (SAB) is added to the network. This helps the network have more flexible discriminative ability for different local regions. It contributes to reconstruct the final image. In addition, this helps the network have more flexible discriminative ability for global structure and focus on high-frequency information from the spatial dimension.

Network Architecture
The architecture of our proposed DCAN is illustrated in Figure 1, which consists of three parts: shallow feature extraction, deep feature extraction, and reconstruction. The network first extracts the shallow features from the input LR image. Then, the second part extracts the deep features and increase the weights of important feature maps. Finally, the features which contain abundant useful information are sent to the third part of network for reconstruction. The network details will be introduced in the next part. (1) Shallow Feature Extraction: Let Conv(k, f, c) be a convolutional layer, where k, f, c represent the filter kernel size, the number of filters, and the number of filter channels, respectively. We use a 3 × 3 convolutional layer to extract the shallow feature F 0 from the input image I LR , which contributes to the next feature extraction. The shallow feature extraction f SF (.) operation can be formulated as follows: (2) Deep Feature Extraction: After the shallow feature extraction, the backbone part which contains a series of Dense Channel Attention blocks (DCAB s ) and a Spatial Attention Block (SAB) is designed to extract deep features. The main block DCAB receives the feature maps from each single DCAB as input, the structure of which will be given in Section 2.2. Then, the feature F G which is the sum of output features of G DCAB s is sent to a convolutional layer: where F G is generated by the convolution operation, G denotes the number of DCAB s . Then, the F G is sent to a SAB as follows: where H SAB denotes the operation of the SAB. The operational details will be described in Section 2.3. (3) Reconstruction: The reconstruction part contains two convolutional layers and a deconvolution layer. The SR image I SR is generated as follows: where H up denotes the operation of the deconvolution layer.

Dense Channel Attention Mechanism
As discussed in Section 1, most existing super-resolution models do not make full use of the information from the input LR image. So, we propose a novel dense channel attention mechanism to solve the problem. It concentrates on the high-frequency information and weakens the useless information. Figure 2 shows DCAM, which can be described that the G th DCAB computes the feature map F G from the outputs of the DCAB G−1 , DCAB G−2 , ..., DCAB 1 as follows: where H DCAB G denotes the operation of the G th DCAB, F G−1 , F G−2 , . . . , F 1 denote the outputs of the DCAB G−1 , DCAB G−2 , . . . , DCAB 1 . The purpose of DCAM is to focus on highfrequency components and make better use of the information. As shown in Figure 2, the DCAB is the basic block of the proposed DCAM. It mainly contains two convolutional layers and a CA (Channel Attention) block. To be specific, in Figure 3, the first convolutional layer before Relu consists of n f filters of size n c × 3 × 3. The second convolutional layer after Relu contains n f filters of size n f × 3 × 3. Let F i−1 be the input of DCAB i (the i-th DCAB in the DCAB s ), there is where X i is an intermediate feature contaninting n f feature maps. W (3,n f ,n f ) and W (3,n f ,n c ) denote the weight matrix of n f filters of size n f × 3 × 3 and n f filters of size n c × 3 × 3. σ(.) = max(0, x) indicates the Relu activation. Then, CA is used to improve the discriminative learning ability. As shown in Figure 4, the mechanism of CA can be formulated as follows: where X i is the output of X i , X i 1 , . . . , X i n f is the feature map of X i , S i 1 , . . . , S i n f are the elements of S i , ⊗ indicates the elementwise product. S i is a n f dimensional channel statistical descriptor, which is used to update X i . S i is obtained by It contains several operations, including global average pooling f gap (), channel- and sigmoid function f sigmoid , on X i (r is set to 16). The channel statistical descriptor can help express the different information among the feature maps of X i . As shown in Figure 3, the input of the i th DCAB is from the output of i 1 , . . . , i th−1 DCAB s . In general, the complete operation of DCAB can be formulated as follows: This operation increases the flow of information through the network and the representation capacity of the network.

Spatial Attention Block
Considering the complicated spatial information and distribution of the remote sensing images, we add a SAB to increase the discriminative ability of the network. It helps the network have discriminative ability for different local regions and pay more attention to the regions which are more important and more difficult to reconstruct. As shown in Figure 5, the operation of SAB can be formulated as follows: where F SAB is obtained by several operations, including average pooling f Avgpooling (), f Maxpooling() , concat f concat (), channel-down f conv () = Conv(1, 1, 2), sigmoid function f sigmoid (), and element-wise product σ. The operation of SAB can focus on the local regions of F input which are useful to reconstruct.

Loss Function
We use the L 1 loss as the total loss because the L 1 loss can support better convergence. In addition, the L 1 loss can be described as follows: where θ represents the whole parameter of the DCAN network, n represents the number of training images. The purpose of the L 1 loss function is that the reconstructed HR image can be similar to its corresponding ground truth image I HR i .

Experimental Settings
The relevant experimental settings on the experimental datasets, degradation method, and evaluation metrics are detailed in this section.
(1) Datasets: We use UC Merced dataset and RSSCN7 dataset for qualitative and quantitative analysis. The UC Merced dataset [51] is a classification dataset of remote sensing images containing 21 land use classes, and each class contains 100 images with size 256 × 256 and RGB channels. Figure 6 shows some images. The UC Merced dataset is mainly used as the experimental data and divided into two sections. The first section includes images in 21 classes of agricultural, baseball diamond, beach, building and so on. For each category, 90 images are taken to create the training set. Therefore, the second section includes the remaining ten images for each class which are taken to construct the test set. We can validate the performance of our model for each class. The results are discussed in Section 3.2.
In addition, the other dataset named RSSCN7 [52] is also used to train our method and verify the effectiveness of our method. The RSSCN7 dataset is a classification dataset of remote sensing images containing 7 land use classes, and each class contains 400 images with size 400 × 400 and RGB channels. Figure 7 shows some images. This dataset is divided into two sections. The first section contains images in 7 classes. For each class, 360 images are used to train the model. Thus, the second section includes the remaining 40 images for each class which are used to construct the test set. The test results are discussed in Section 3.2. Figure 6. Some images in the UC Merced dataset: 21 land use classes, including buildings, agricultural, airplane, baseball diamond, beach, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis court. The above images correspond to the above categories respectively. Figure 7. Some images in the RSSCN7 dataset: 7 land use classes, including grass, filed, industry, river lake, forest, resident and parking. The above images correspond to the above categories respectively.
To validate the robustness of the proposed method, the real world data from GaoFen-2 in Ningxia, China, are used to test our model in Section 3.4. Experiments are designed by the ×4 scale factor and the ×8 scale factor, respectively. We obtain LR images by down sampling HR images using the bicubic operation.
All the parameter settings are same on the UC Merced dataset and RSSCN7 dataset. The training step is performed on the three channels of the RGB space. The channel number of the input image n c is 3, the filter size k is set as 3. The Adam [53] optimizer with β 1 = 0.9 and β 2 = 0.999 is used to train the proposed models under the batchsize of 16. The weights of the model are initialized using the method in [54]. We initialize the learning rate as 10 −4 and halve it at every 2 × 10 5 batches updates. We implement the proposed algorithm with the PyTorch [55] framework, and we train the DCAN model using one NVIDIA Tesla V100 GPU.
(2) Evaluation Metrics: Two widely used image quality assessment metrics, the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are used as evaluation metrics. They are used as the criteria for evaluating the quality of reconstructed HR images. Given the ground truth HR image x and its corresponding super-resolved image y, the PSNR of the super-resolved image y is computed as follows: PSNR(x, y) = 10log 10 255 2 MSE(x, y) where x i and y i denote the ith pixel in x and y values, respectively. n s represents the image pixel number. The higher PSNR indicates that the quality of the SR image is better.
In addition to using PSNR to measure image quality, we also use SSIM as the image quality assessment, the SSIM of the super-resolved image y is defined as follows: where µ x and µ y denote the mean values of x and y, and σ x and σ y denote the standard deviation values of x and y, and C 1 and C 2 are constants. The higher SSIM indicates that the quality of the SR image is better.  Tables 1 and 2 show the PSNR and SSIM of the reconstructed HR images of the UC Merced test set on the ×4 and ×8 enlargement, respectively. Our proposed method is also better than other methods in the quantitative evaluation. Tables 3 and 4 show the PSNR and SSIM of the reconstructed HR images of RSSCN7 test set on the ×4 enlargement and ×8 enlargement, respectively. Experimental results demonstrate the effectiveness of our method.

Model Analysis
In this section, we use three groups of experiments to analyze the proposed method using the UC Merced dataset, including the number of DCAB, the dense channel attention mechanism, and the SAB. In addition, we use PSNR as quantitative evaluation.
(1) DCAB: A small DCAN which contains four DCAB s was used to learn the effect of n f . The experimental results about the effect of n f are presented in Table 5. We use three widely used values, including 32, 64 and 128. When n f = 64, the average PSNR values on the test dataset are 0.27 dB higher than n f = 32, and it also are 0.16 dB higher than n f = 128. So, we set the n f = 64 in the rest of the experiments.  Table 6. We set the number of DCAB s to 4, 8 and 10, respectively. As shown in the table, we conclude that the PSNR of ten DCAB s are higher than 4 and 8.  Table 7. It can be seen that the PSNR of DCAN-s is higher than DCAN. It demonstrates that SAB improves the network performance. In addition, we explore the relationship between PSNR and epochs of network training. As shown in Figure 16, it can be seen that the PSNR of around 350 epochs begins to converge with a scale factor of ×4. As shown in Figure 17, at around 50 epochs, it begins to converge with a scale factor of ×8. We can draw the conclusion that, with our model, it is easy to achieve high performance.

Super-Resolving the Real-World Data
In this section, we use the real data to validate the robustness of our proposed method. The model is trained by UC Merced dataset and tested by some remote sensing images from GaoFen-2 in Ningxia, China, as the real-world data. The size and band number of real-word images are 400 × 400 and 3, respectively. Figures 18 and 19 show one LR image on the ×4 and ×8 enlargement, respectively. The reconstructed results show that the DCAN method achieves good results in terms of the visual quality.

Discussion
The proposed DCAN method is proven to have a good performance with experimental results. In Section 3.2, our method outperforms several state-of-the-art methods in both quantitative evaluation and visual quality. In Section 3.3, the increase of PSNR after adding DCAB and SAB demonstrates the effectiveness of our approch. In Section 3.4, we deal with real satellite images from GaoFen-2 using the DCAN model and obtain satisfactory SR results. In addition, we discussed the experimental results in combination with theoretical analysis.
(1) Effect of Dense Channel Attention Block: As shown in Table 6, the PSNR of superresolved images increases after we add the DCAB. It can be seen that when the number of DCAB s increases from 4 to 10, it improves the performance of network. It proves that when we increase appropriately the number of DCAB s , the capacity of the network can be improved. (2) Effect of Spatial Attention Block: As shown in Table 7, it can be seen that the PSNR value ranges from 28.63 to 28.70 after we add the SAB. Thus, we conclude that SAB can improve the performance of network and help the network have more flexible discriminative ability for global structure and focus on high-frequency information from spatial dimension. (3) Effect of Scale Factor: As shown in Tables 1 and 2, due to the increasing scale factor, the improvement of our proposed DCAN method reduces. It indicates that the large-scale super resolution is still a hard problem.

Conclusions
This article develops a DCAN network which achieves good performance in superresolving the remote sensing images with complicated spatial distribution. Specifically, we design a network which densely uses the multi-level feature information and strengthens the effective information. In addition, we propose a dense channel attention mechanism which makes better use of multi-level feature maps which contain abundant high-frequency information. Further, we add a spatial attention block to pay more attention to the regions which are more important and more difficult to reconstruct. Results of the extensive experiments demonstrate the superiority of our method over the other compared algorithms.