Deep Residual Dual-Attention Network for Super-Resolution Reconstruction of Remote Sensing Images

: A super-resolution (SR) reconstruction of remote sensing images is becoming a highly active area of research. With increasing upscaling factors, richer and more abundant details can progressively be obtained. However, in comparison with natural images, the complex spatial distribution of remote sensing data increases the difﬁculty in its reconstruction. Furthermore, most SR reconstruction methods suffer from low feature information utilization and equal processing of all spatial regions of an image. To improve the performance of SR reconstruction of remote sensing images, this paper proposes a deep convolutional neural network (DCNN)-based approach, named the deep residual dual-attention network (DRDAN), which achieves the fusion of global and local information. Speciﬁcally, we have developed a residual dual-attention block (RDAB) as a building block in DRDAN. In the RDAB, we ﬁrstly use the local multi-level fusion module to fully extract and deeply fuse the features of the different convolution layers. This module can facilitate the ﬂow of information in the network. After this, a dual-attention mechanism (DAM), which includes both a channel attention mechanism and a spatial attention mechanism, enables the network to adaptively allocate more attention to regions carrying high-frequency information. Extensive experiments indicate that the DRDAN outperforms other comparable DCNN-based approaches in both objective evaluation indexes and subjective visual quality.


Introduction
With the rapid progress and development of modern aerospace technology, remote sensing images have been widely used in military and civil fields, including agriculture and forestry inspection, military reconnaissance, and urban planning. However, due to hardware limitations and the large detection distance, there is still room for improvement in the resolution and clarity of remote sensing images. Considering the high research cost and long hardware iteration development cycle required to physically improve imaging sensors, it is increasingly important to improve the algorithms used for super-resolution (SR) reconstruction [1] of remote sensing images.
Single-image super-resolution (SISR) technology aims to reconstruct a high-resolution (HR) image from a corresponding low-resolution (LR) image. For aerial remote sensing images, SISR technology can provide richer spatial details by increasing the resolution of LR images. In the past few decades, numerous SISR approaches based on machine learning have been proposed, and these techniques include methods based on neighbor embedding [2], sparse representation [3,4], and local linear regression [5,6]. However, most of these methods use the low-level features of images for SR reconstruction, and the level of ability to represent these features greatly limits the reconstruction effect that is achievable.
With the rapid progress and development of big data and graphics processing unit (GPU) computing capacity, deep convolutional neural networks (DCNNs) have become the dominant approach for achieving success in image processing [7][8][9]. Methods based on DC-NNs have shown powerful abilities in the automatic extraction of high-level features from (1) We propose a novel approach to SR reconstruction of remote sensing images, DRDAN.
This achieves a convenient and effective end-to-end training strategy. (2) DRDAN achieves the fusion of global and local residual information, which facilitates the propagation and utilization of image features, providing more feature information for the final reconstruction. (3) We propose a modified residual block named RDAB, which contains a local multilevel fusion (LMLF) module and dual-attention mechanism (DAM) module. The LMLF module fuses different level features with the input in the current RDAB. In the DAM module, the channel attention mechanism (CAM) submodule exploits the interdependencies among feature channels and adaptively obtains the weighting information of different channels; the spatial attention mechanism (SAM) submodule pays attention to the areas carrying high-frequency information and encodes which regions to emphasize or suppress; a local residual learning (LRL) strategy is used to alleviate the model-degradation problem due to the deepening of the work, and this improves the learning ability. (4) Through comparative experiments with remote sensing datasets, it is clear that, compared with other SISR algorithms, DRDAN shows better performance, both numerically and qualitatively.
The remainder of this paper is organized as follows: Section 2 presents a detailed description of DRDAN, Section 3 verifies its effectiveness by experimental comparisons, and Section 4 draws some conclusions.

Methodology
In this section, we will describe the overall architecture and specific details of our proposed DRDAN, including the internal network structure and its mathematical expressions. In particular, each component of RDAB will be illustrated in detail. Then, we give the optimization direction function during the training process of the network. Specifically, we use I LR and I HR to represent an LR image and an HR image, respectively. Meanwhile, we define I SR as the output of our DRDAN.

Network Architecture
As shown in Figure 1, the DRDAN includes two branches: the GRL branch and the MRN branch. In the GRL branch, we apply bicubic interpolation [25] to make our network learn global residuals inspired by information distillation networks [26]. This process can be formulated as: where H bicubic (·) denotes the upsampling operator using bicubic interpolation and I bicubic denotes the image output from the interpolation upsampling operation. The MRN branch consists of four main parts: shallow feature extraction, deep feature extraction, upsampling, and reconstruction. As with the operation of the EDSR, we extract the shallow features F 0 from the LR input by adopting only one convolutional layer: where H SF (·) denotes a convolutional layer with kernel size of 3 × 3. The resulting F 0 is then used as the input of the deep feature-extraction part with our RDABs. Supposing there are N RDABs, the output feature maps of the n-th RDAB F b,n can be calculated by: where H RDAB,n (·) denotes the operation of the n-th RDAB, and F b,n−1 and F b,n are the input and output for the n-th RDAB, respectively. The H RDAB,n (·) operation enables the network to pay more attention to the useful features and suppress useless features, and thus the network can be deepened effectively. The output F b,n is then used for the input of the nex part. where denotes a convolutional layer with kernel size of 3 × 3. The resulting is then used as the input of the deep feature-extraction part with our RDABs. Supposing there are N RDABs, the output feature maps of the n-th RDAB , can be calculated by: where , denotes the operation of the n-th RDAB, and , and , are the input and output for the n-th RDAB, respectively. The , operation enables the network to pay more attention to the useful features and suppress useless features, and thus the network can be deepened effectively. The output , is then used for the input of the next part.
After obtaining the deep features of the LR images, we apply an upsampling operation to enlarge the LR feature maps to HR feature maps. Previous methods such as EDSR and RCAN have shown that a pixel shuffle [12] operation has lower computational complexity and higher reconstruction performance than bicubic interpolation calculations. Considering this, we utilized a pixel shuffle operation as our upsampling part, and this operator can be expressed as: where denotes a convolutional layer with convolution kernel size of 3 × 3, denotes the upsampling operation by pixel shuffle, , is the output of the last RDAB, and is the upscaled feature maps. To guarantee that the outputs of the MRN branch and the interpolation upsampling branch have the same number of channels, the upscaled features are then reconstructed via: , where denotes a convolution operation with three output channels and a convolution kernel size of 3 × 3, and denotes the output of the MRN branch. Finally, the output of DRDAN is estimated by combining the residual image with the interpolated image using an element-wise summation, which can be formulated as: After obtaining the deep features of the LR images, we apply an upsampling operation to enlarge the LR feature maps to HR feature maps. Previous methods such as EDSR and RCAN have shown that a pixel shuffle [12] operation has lower computational complexity and higher reconstruction performance than bicubic interpolation calculations. Considering this, we utilized a pixel shuffle operation as our upsampling part, and this operator can be expressed as: where H A (·) denotes a convolutional layer with convolution kernel size of 3 × 3, H Pixel Shuffle [·] denotes the upsampling operation by pixel shuffle, F b,N is the output of the last RDAB, and F up is the upscaled feature maps.
To guarantee that the outputs of the MRN branch and the interpolation upsampling branch have the same number of channels, the upscaled features are then reconstructed via: where H rec (·) denotes a convolution operation with three output channels and a convolution kernel size of 3 × 3, and I res denotes the output of the MRN branch. Finally, the output of DRDAN I SR is estimated by combining the residual image I res with the interpolated image I bicubic using an element-wise summation, which can be formulated as:

Residual Dual-Attention Block
In this section, we will describe the overall structure by using an RDAB. Residual learning strategies can be roughly divided into two types, namely global and local residual learning. Global residual learning only learns the residuals between the input and the Remote Sens. 2021, 13, 2784 5 of 18 output; it thus avoids learning the complex transformation from a complete image to another image, and this effectively reduces the difficulty of model training. As noted in Section 1, VDSR is a classical SISR network based on GRL. Local residual learning means that residual learning is used in stacked convolution layers, and this helps to retain a large amount of image detail information. As shown in Figure 2, we compared our RDAB with some existing residual blocks. Figure 2a-c shows the structures of the residual blocks (RB) in EDSR, the residual channel attention block (RCAB) in RCAN, and the RDAB in DRDAN, respectively. Our RDAB is developed using the LMLF module and the DAM module. output; it thus avoids learning the complex transformation from a complete image to another image, and this effectively reduces the difficulty of model training. As noted in Section 1, VDSR is a classical SISR network based on GRL. Local residual learning means that residual learning is used in stacked convolution layers, and this helps to retain a large amount of image detail information. As shown in Figure 2, we compared our RDAB with some existing residual blocks. Figure 2a-c shows the structures of the residual blocks (RB) in EDSR, the residual channel attention block (RCAB) in RCAN, and the RDAB in DRDAN, respectively. Our RDAB is developed using the LMLF module and the DAM module.

Local Multi-Level Fusion
The feature-extraction module plays an important role in image SR. Extracting and fusing local features with different perceptual scales can obtain more contextual information. By drawing on this idea, we propose an LMLF module to learn more diverse feature information and detailed textures in the RDABs so that the network can learn richer details and enhance feature utilization. The details of this module are shown in Figure 2c. We denote the input of the i-th RDAB as , . The LMLF module can be formulated as: , ReLU , , , , , , where: , represents the convolution operator, in which m denotes the m-th convolutional layer and n denotes the size of the filters; represents a concatenation operator; ReLU denotes a rectified linear unit activation function; and denote the feature maps generated by the first and second convolutional layers in the i-th RDAB, respectively; and is the final output of an LMLF module, and this will be used as the input of the DAM module.

Local Multi-Level Fusion
The feature-extraction module plays an important role in image SR. Extracting and fusing local features with different perceptual scales can obtain more contextual information. By drawing on this idea, we propose an LMLF module to learn more diverse feature information and detailed textures in the RDABs so that the network can learn richer details and enhance feature utilization. The details of this module are shown in Figure 2c. We denote the input of the i-th RDAB as F b,n−1 . The LMLF module can be formulated as: where: f m,n [·] represents the convolution operator, in which m denotes the m-th convolutional layer and n denotes the size of the filters; f concat (·) represents a concatenation operator; ReLU(·) denotes a rectified linear unit activation function; F 1 and F 2 denote the feature maps generated by the first and second convolutional layers in the i-th RDAB, respectively; and F LMLF is the final output of an LMLF module, and this will be used as the input of the DAM module.

Dual-Attention Mechanism Module
Early CNN-based SR methods mainly focus on increasing the depth and width of the network, and the features extracted by the network are treated equally in all channels and spatial areas. Such methods lack the necessary flexibility for different feature-mapping networks, thus greatly wasting computing resources in practical engineering tasks. The attention mechanism enables the network to pay more attention to information features that are more useful to the target task and suppress useless features. In this way, computing resources can be more scientifically allocated in the feature-extraction process, and the network can be effectively deepened.
The application of attention mechanisms to SISR tasks has been explored using some network architectures such as RCAN and second-order attention network [27], and this has greatly improved the SISR effect. In this paper, we further strengthen the SISR effect by fusing a SAM and a CAM to construct a DAM, which is shown in Figure 2c.
Reference [20] shows that in neural networks, the feature maps extracted by the convolution kernels of different channels will have different abilities in recovering highfrequency detail information. Considering this, we adopt the CAM, in which the representation ability can be improved by explicitly modeling the interconnection of feature channels, adaptively correcting the feature responses of channels, and discriminating between information of different levels of importance. As shown in Figure 3a, we let F LMLF = (F 1 LMLF , · · · , F k LMLF , · · · , F C LMLF ) denote the input feature maps with C channels. The channel feature descriptor D k channel R C of the k-th feature map F k LMLF is determined by global average pooling: where F k LMLF (i, j) denotes the value at position (i, j) of F k LMLF , and W and H denote the width and height of the feature map, respectively. By computing the global average pooling, we can get C channel feature descriptors D channel = D 1 channel , · · · , D k channel , · · · , D C channel corresponding to F LMLF = F 1 LMLF , · · · , F k LMLF , · · · , F C LMLF , respectively. The parameter D channel describes the importance of different channel features. After the aggregate operation by global average pooling, we introduce a two-layer perceptron network to fully explore the channel-wise correlation dependencies. The calculation process of the perceptron network is: After the aggregate operation by global average pooling, we introduce a two-layer perceptron network to fully explore the channel-wise correlation dependencies. The calculation process of the perceptron network is: where: W 0 denotes the weight matrix of a convolution layer, which downscales the channel with ratio r; W 1 denotes the weight matrix of a convolution layer, which upscales the channel with ratio r; σ[·] and ReLU(·) denote the sigmoid function and ReLU function, respectively; and the output A channel = A 1 channel , · · · , A k channel , · · · , A C channel , in which A k channel denotes a real value that represents the weight of the k-th channel of F LMLF . The final output of the CAM F CA = (F 1 CA , · · · , F k CA , · · · , F C CA ) and F k CA is calculated as: where ⊗ denotes element-wise multiplication and F CA denotes feature maps with channel attention.
The LR images contain lots of low-frequency information and a small amount of high-frequency information. The low-frequency information is generally located in smooth areas, which are easy to recover. The high-frequency information, such as edges and contours, is hard to recover. As discussed in [20], it can be found that there is different texture detail information in different spatial locations. However, existing CNN-based methods usually assign the same weight to all spatial locations, which tends to weaken the importance of high-frequency information. Therefore, this work builds a SAM, which emphasizes the attention to high-frequency information areas, thus obtaining a better SR reconstruction effect. As shown in Figure 3b, along the channel axis of F CA , we generate two 2D spatial feature descriptors D spatial, avg and D spatial, max . These are calculated as: where: D spatial, avg (i, j) and D spatial, max (i, j) denote the average and maximum pooling spatial descriptors at position (i, j), respectively; F k CA (i, j) denotes the value at position (i, j) of the k-th feature F k CA in F CA ; and C is the number of base channels in the feature map. The concatenated spatial feature descriptors D spatial are then calculated as: where f concat (·) denotes the concatenation operator. The concatenated feature maps D spatial are convolved by a standard convolution layer, producing the spatial attention map A spatial , which can be formulated as: where σ[·] denotes the sigmoid function, and W 2 denotes the weight matrix of a convolution layer, which compresses the number of channels of spatial features into one. The output A spatial ∈ R H×W has H × W positions, and A spatial (i, j) represents the weight of the feature value at position (i, j) of F CA . The final output of the SAM F SA = F 1 SA , · · · , F k SA , · · · , F C SA , and F k SA is calculated as: where ⊗ denotes element-wise multiplication and F SA denotes feature maps with spatial attention. Local residual learning alleviates the model degradation problem of deep networks and improves the learning ability. Furthermore, it also makes the main part of the network pay more attention to the high-frequency information of the LR features. Additionally, short-skip connection can propagate features more naturally from the early layers to the latter layers, which enables better prediction of the pixel density values. To promote the effective delivery of feature information and enhance feature utilization, the final RDAB module output is formulated as: For a more intuitive understanding of RDAB, Table 1 shows the network parameter settings of RDAB, in which: H and W denote the height and width of the feature maps, respectively; Conv3×3 and Conv1×1 denote convolution layers with kernel sizes of 3 × 3 and 1 × 1, respectively; ReLU denotes the rectified linear unit; Sigmoid denotes the sigmoid activation function; AvgPool denotes the global average pooling layer; Mean and Max denote the mean and maximum operations of each point on the feature maps in the channel dimension, respectively; and Multiple and Sum denote the pixel-by-pixel multiplication and addition operations of the feature map, respectively. It should be noted that C is defined as 64 in line with EDSR, and the reduction ratio r is set as 16 in line with RCAN; thus, the convolution layer in channel-downscaling has four filters.

Structure Component
Layer Input Output

Loss Fuction
The most widely used loss functions in the field of SR image reconstruction are the L1 and L2 loss functions. The L1 loss function can prevent image distortion and obtain higher test metrics. To perform the same operation as in EDSR, we employ an L1 loss function in our network. We suppose that the given training dataset is , where N denotes the number of training samples. The minimum loss function of neural network optimization is then expressed as: where H DRDAN (·) denotes the SR results from the DRDAN network, and Θ = W i , b i denotes the DRDAN parameter set.

Experiments and Results
In this section, we report experiments using remote sensing datasets to evaluate the performance of DRDAN.

Dataset Settings
To verify the effectiveness and robustness of our proposed DRDAN, we used 10,000 images from the Aerial Image Dataset (AID) [28] to construct an experimental training dataset. To fully utilize the dataset, the training dataset was augmented via three image-processing methods: (1)

Evaluation Metrics for SR
We adopted the peak signal-to-noise ratio (PSNR) [31] and structural similarity (SSIM) [31] as the objective evaluation indexes to measure the quality of the SR image reconstruction. The PSNR is one of the most widely used standards for evaluating image quality, and it is generally defined by the mean square error (MSE): The PSNR is expressed as:

Evaluation Metrics for SR
We adopted the peak signal-to-noise ratio (PSNR) [31] and structural similarity (SSIM) [31] as the objective evaluation indexes to measure the quality of the SR image reconstruction. The PSNR is one of the most widely used standards for evaluating image quality, and it is generally defined by the mean square error (MSE): Remote Sens. 2021, 13, 2784 10 of 18 The PSNR is expressed as: where X denotes an SR image of size W × H, Y denotes an original HR image of size W × H, and I max denotes the maximum pixel value in the image. The unit of PSNR is dB, and larger P PSNR values indicate lower distortion and a better SR image reconstruction effect. The SSIM is another widely used measurement index in SR image reconstruction. It is based on the luminance (l), contrast (c), and structure (s) of samples x and y: c(x, y) = 2σ x σ y+c 2 where µ x denotes the average value of x, µ y denotes the average value of y, σ x denotes the variance of x, σ y denotes the variance of y, and σ xy represents the covariance of x and y. In general, the values α = β = γ = 1 are set. The range of SSIM is [0, 1]; the closer its value to 1, the greater the similarity between the reconstructed image and the original image, and the higher the quality of the reconstructed image.

Experimental Details
In line with EDSR, we set the number of RDABs as 20. The input LR images were randomly cropped in a patch size of 48 × 48, and the corresponding input HR images with sizes of 96 × 96, 144 × 144, and 192 × 192 were cropped according to the upscaling factors ×2, ×3, and ×4, respectively. To avoid size mismatch during the training process, a zero-padding method was used to ensure that the image size remained consistent during feature delivery. The parameter settings during the training process are shown in Table 2. All experiments used the deep-learning framework PyTorch on the Ubuntu 18.04 operating system. Four Nvidia GTX-2080Ti GPUs were used to accelerate the training. The software used included the Python programming language, CUDA 10.1, and cuDNN 7.6.1. Learning rate (LR) Initial LR = 10 −4 , .ecreased by a factor of 10 every 500 epochs

Effect of RDAB
The RDAB is the core of our proposed DRDAN. To further verify the effectiveness of the internal RDAB modules, ablation experiments were implemented on the NWPU VHR-10 and COWC datasets. Table 3 and show the effects of this on the LMLF module, CAM module, and SAM module for SR reconstruction with scale factor ×2. Figure 2a shows the structure of the baseline residual block in the ablation experiments. It can be concluded from the tables that the best performance is seen in the model containing the LMLF module, the CAM module, and the SAM module.
The LMLF module aggregates diverse features to enhance the feature utilization of our deep network. To demonstrate the effect of this module, we added the LMLF module to the baseline residual block. The second and the third rows of Tables 3 and 4 indicate that this LMLF component can achieve gains of 0.12581 dB and 0.23789 dB for the NWPU VHR-10 and COWC datasets, respectively. This is mainly because the LMLF contributes to the power of the network representation ability.
The DAM consists of both a CAM and a SAM. The CAM explicitly models the interconnections of feature channels, adaptively corrects the feature response of channels, and discriminates between information of different levels of importance. The second and fourth rows of Tables 3 and 4 indicate that the CAM can achieve gains of 0.07128 dB and 0.17797 dB for the NWPU VHR-10 and COWC datasets, respectively. The SAM enhances the attention paid to high-frequency information areas. The second and the fifth rows of Tables 3 and 4 indicate that the SAM can achieve gains of 0.07769 dB and 0.15993 dB for the NWPU VHR-10 and COWC datasets, respectively. The third and the last rows of Tables 3 and 4 indicate that the greatest improvement is achieved when the CAM and SAM are applied together. These comparisons firmly demonstrate the effectiveness of the DAM.
Furthermore, the third, fourth, and sixth rows of Tables 3 and 4 indicate that 'LMLF + CAM' achieves better results than only using LMLF or CAM, respectively. The third, fifth, and seventh rows of Tables 3 and 4 indicate that 'LMLF + SAM' achieves better results than only using LMLF or SAM, respectively. In summary, the experiments show that our RDAB is structured in a rational and efficient way.

Effect of number of RDABs
The RDABs are stacked in the deep feature-extraction part to obtain better feature utilization. We configured the DRDAN with different depths and compared their performance. Specifically, numbers of RDABs ranging from 5 to 25 were used. Tables 5 and 6 show the performance with different numbers of RDABs on the NWPU VHR-10 and COWC datasets, respectively. It can be clearly observed that the performance of our DRDAN improves as the number of RDABs increases. This demonstrates that RDAB can be used as a block to train a deep SR reconstruction network.

Effect of GRL
Global residual learning makes the network avoid learning the complex transformation from a complete image to another image; only the residual information needs to be learnt to recover the lost high-frequency details. We now examine the effect of the GRL branch of DRDAN. For rapid testing, we randomly selected ten images from the NWPU VHR-10 dataset to construct a new dataset named FastTest10. Figure 5 shows the performance curve for networks with and without GRL using the FastTest10 dataset in the epoch range 0 to 100. As can be seen from Figure 5, The DRDAN with GRL has a higher PSNR curve and a faster rising speed, which indicates that GRL makes the network converge much faster.

Comparison with Other Approaches
To further verify the advancement and effectiveness of the proposed method, we compared DRDAN with bicubic interpolation, SRCNN, VDSR, local-global combined networks (LGCNet), Laplacian pyramid SR network (LapSRN) [33], EDSR, and wide activation SR (WDSR) [34]. Bicubic interpolation is a representative interpolation algorithm; SRCNN applies a CNN to the image SR task; VDSR adopts residual learning to build a deep network; LGCNet combines global and local features to fully extract multilevel representations of remote sensing images; LapSRN builds a deep CNN within a Laplacian pyramid framework for accurate SR; and EDSR and WDSR are representative versions of deep network architectures with residual blocks. The convolution filters in all the methods were set to 64, and the number of residual blocks in EDSR, WDSR, and DRDAN were all set to 20 to make a fair comparison. Table 7 shows the average PSNR and SSIM results of our DRDAN and the compared methods. It can be clearly observed that the proposed DRDAN always yields the best performance. On the NWPU VHR-10 dataset, the DRDAN outperformed the second-best model, WDSR, under factors of ×2, ×3, and ×4 with PSNR gains of 0.12776, 0.10049, and 0.07795 dB, respectively. With the COWC dataset, the average PSNR values that the

Comparison with Other Approaches
To further verify the advancement and effectiveness of the proposed method, we compared DRDAN with bicubic interpolation, SRCNN, VDSR, local-global combined networks (LGCNet), Laplacian pyramid SR network (LapSRN) [33], EDSR, and wide activation SR (WDSR) [34]. Bicubic interpolation is a representative interpolation algorithm; SRCNN applies a CNN to the image SR task; VDSR adopts residual learning to build a deep network; LGCNet combines global and local features to fully extract multi-level representations of remote sensing images; LapSRN builds a deep CNN within a Laplacian pyramid framework for accurate SR; and EDSR and WDSR are representative versions of deep network architectures with residual blocks. The convolution filters in all the methods were set to 64, and the number of residual blocks in EDSR, WDSR, and DRDAN were all set to 20 to make a fair comparison. Table 7 shows the average PSNR and SSIM results of our DRDAN and the compared methods. It can be clearly observed that the proposed DRDAN always yields the best performance. On the NWPU VHR-10 dataset, the DRDAN outperformed the second-best model, WDSR, under factors of ×2, ×3, and ×4 with PSNR gains of 0.12776, 0.10049, and 0.07795 dB, respectively. With the COWC dataset, the average PSNR values that the DRDAN obtained under factors of ×2, ×3, and ×4 were 0.22263, 0.23744, and 0.14659 dB higher than the WDSR. As for SSIM, the super-resolved results from the DRDAN obtained the highest scores. On the NWPU VHR-10 dataset, the SSIM gains of the DRDAN outperformed the second-best model, WDSR, under factors of ×2, ×3, and ×4 by 0.0019, 0.0024, and 0.0024, respectively. On the COWC dataset, the average SSIM values that the DRDAN obtained under factors of ×2, ×3, and ×4 were 0.0018, 0.0038, and 0.0034 higher than the WDSR values, respectively.

Visual Results
In addition to using objective indicators to evaluate the DRDAN, we also examined the reconstruction results qualitatively. Figure 6 shows the reconstructed visual results obtained using DRDAN and the other approaches on COWC test images with three scales, ×2, ×3, and ×4. For a clearer comparison, a small patch marked by a red rectangle is enlarged and shown for each SISR method. As can be observed from the locally enlarged image of Figure 6a, the edges of the red lines obtained using DRDAN are clearer and closer to those in the real image than all of the compared approaches. Figure 6b demonstrates that the DRDAN obtains better perceptual performance with more details and structural textures. Figure 6c shows that the reconstructed vehicle results obtained using DRDAN recover more high-frequency details and obtain sharper edges. It can also be seen from Figure 6 that DRDAN achieves the highest PSNR and SSIM when compared with the other SISR methods. Overall, the DRDAN outperforms other comparative approaches in both objective evaluation indexes and subjective visual quality.

Model Size Analyses
Model size is a critical issue in practical applications, especially in devices with low computing power. For the scaling factor ×2, Figure 7 shows the relationship between the number of parameters of different network structures and the mean PSNR using the COWC test set, where M represents the number of parameters in millions. As we can see from Figure 7, the number of model parameters of DRDAN is less than half of that of EDSR, but the DRDAN performs the best in terms of the PSNR. This finding indicates that our model is structured in a rational and efficient way to achieve a better balance between performance and model size. computing power. For the scaling factor ×2, Figure 7 shows the relationship between the number of parameters of different network structures and the mean PSNR using the COWC test set, where M represents the number of parameters in millions. As we can see from Figure 7, the number of model parameters of DRDAN is less than half of that of EDSR, but the DRDAN performs the best in terms of the PSNR. This finding indicates that our model is structured in a rational and efficient way to achieve a better balance between performance and model size.

Conclusions
Existing SR image reconstruction methods suffer from low feature information utilization and equal processing of all spatial regions of an image. Inspired by the idea of residual learning, and combining this with attention mechanism, this paper proposes a deep residual dual-attention network for SR reconstruction of aerial remote sensing images. The main contribution of this paper is the residual dual-attention block, which is constructed as the building block of the deep feature-extraction part of the DRDAN. In the RDAB, we firstly use the local multi-level fusion module to fully extract and deeply fuse the features of the different convolution layers. This module can facilitate the flow of information in the network. After that, the DAM, which includes both a CAM and a SAM, enables the network to adaptively allocate more attention to regions carrying highfrequency information. Extensive experiments indicate that: (1) RDAB is structured in a rational and efficient way, and it can be used as a building block for deep SR reconstruction networks; (2) the global residual learning branch effectively reduces the difficulty of model training and makes the network converge much faster; (3) DRDAN outperforms other comparable DCNN-based approaches, and it can achieve better results with fewer parameters in both objective evaluation indexes and subjective visual quality.

Conclusions
Existing SR image reconstruction methods suffer from low feature information utilization and equal processing of all spatial regions of an image. Inspired by the idea of residual learning, and combining this with attention mechanism, this paper proposes a deep residual dual-attention network for SR reconstruction of aerial remote sensing images. The main contribution of this paper is the residual dual-attention block, which is constructed as the building block of the deep feature-extraction part of the DRDAN. In the RDAB, we firstly use the local multi-level fusion module to fully extract and deeply fuse the features of the different convolution layers. This module can facilitate the flow of information in the network. After that, the DAM, which includes both a CAM and a SAM, enables the network to adaptively allocate more attention to regions carrying high-frequency information. Extensive experiments indicate that: (1) RDAB is structured in a rational and efficient way, and it can be used as a building block for deep SR reconstruction networks; (2) the global residual learning branch effectively reduces the difficulty of model training and makes the network converge much faster; (3) DRDAN outperforms other comparable DCNN-based approaches, and it can achieve better results with fewer parameters in both objective evaluation indexes and subjective visual quality.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.