Remote Sensing Single-Image Resolution Improvement Using A Deep Gradient-Aware Network with Image-Specific Enhancement

Super-resolution (SR) is able to improve the spatial resolution of remote sensing images, which is critical for many practical applications such as fine urban monitoring. In this paper, a new single-image SR method, deep gradient-aware network with image-specific enhancement (DGANet-ISE) was proposed to improve the spatial resolution of remote sensing images. First, DGANet was proposed to model the complex relationship between lowand high-resolution images. A new gradient-aware loss was designed in the training phase to preserve more gradient details in super-resolved remote sensing images. Then, the ISE approach was proposed in the testing phase to further improve the SR performance. By using the specific features of each test image, ISE can further boost the generalization capability and adaptability of our method on inexperienced datasets. Finally, three datasets were used to verify the effectiveness of our method. The results indicate that DGANet-ISE outperforms the other 14 methods in the remote sensing image SR, and the cross-database test results demonstrate that our method exhibits satisfactory generalization performance in adapting to new data.


Introduction
Remote sensing (RS) has become an indispensable technology for various applications, including agricultural survey, global surface monitoring, and climate change detection. However, owing to the limitations of RS devices, atmospheric disturbances, and other uncertain factors, it is hard to obtain images at the desired resolution [1]. Low-resolution RS images are gradually becoming an obstacle to many advanced tasks, such as finer-scale land cover classification [2], object recognition [3], and precise road extraction [4].
Super-resolution (SR) methods are devoted to improving image resolution beyond the acquisition equipment limits [5]. SR has the advantages of low cost, easy implementation, and high efficiency compared to updating image acquisition devices. Remote sensing super-resolution (RS-SR) approaches can be roughly divided into two categories: single-image SR (SISR) and multi-image SR (MISR). The former requires only one image of the target scene for generating high-resolution output [5], while the latter requires multiple images that differ in terms of their satellite, angles, and sensors. For example, the authors of [6] improved the spatial resolution of Landsat images by fusion with SPOT5 images; the authors of [7] fused the information of multi-angle images for super-resolution; HR image. Then, the HR image is further improved via the ISE method. No additional information is needed in this module, which focused on the specific characteristics of the single image and on realizing adaptive enhancement. In summary, our contributions are fourfold: (1) A new SISR method, DGANet-ISE, which includes a deep gradient-aware network and an image-specific enhancement approach, is proposed to improve the spatial resolution of RS images. (2) A deep gradient-aware network is proposed to model the complex relationship between LR and HR images. A new gradient-aware loss is designed in the training process to preserve more image gradient information in the super-resolved RS images. (3) This paper proposes an image-specific enhancement approach to further improve the SISR performance of RS images and to enhance the generalization capability and the adaptability of our method when facing inexperienced images. (4) Three data sets are used for evaluating the performance of DGANet-ISE. Compared to 14 methods, the experimental results indicate the superiority of DGANet-ISE.z`

Materials and Methods
The objective of SISR is to construct an HR image from an LR image . Representing the target HR image and the estimated image as and , respectively, the more similar and are, the better the SR effect is. The images have color channels, and is the width and is the height of the images. refers to the upscaling factor. In this section, a detailed description of the proposed method is presented. In addition, the evaluation criteria and comparison methods are introduced.

Overview of the Proposed Method
In this work, a new RS-SR method, DGANet-ISE, is proposed, as illustrated in Figure 1. The method mainly involves training and testing phases. The training phase aims at learning the complex relationships between LR and corresponding HR images. DGANet is the core element of this phase, and it is based on the enhanced residual network. This model employs residual and skip connections to devise a deep architecture, and it exhibits strong feature representation and nonlinear fitting abilities. As geographic information such as the terrain and edges are highly important for RS image interpretation, the precise preservation of more geographic details in the super-resolved images should be a focus of research. However, 1 and 2 losses generate blurry images in generation problems [44]. Therefore, we innovatively propose a new gradient-aware loss to alleviate this problem. Gradient-aware loss imposes gradient constraints to focus on the high-frequency signals in RS images, such as boundaries, edges, and terrains. As such, the proposed model can generate HR images with more detailed geographic information. Figure 1. The architecture of deep gradient-aware network with image-specific enhancement (DGANet-ISE). The training phase aims at learning a complex relationship between low-resolution (LR) and corresponding high-resolution (HR) images. DGANet is the core element of the training phase, and the gradient-aware loss is proposed for facilitating the preservation of more highfrequency information in the estimated HR images. In addition, the ISE approach is first proposed in the testing phase. By using the specific information of each test LR image, image-specific enhancement (ISE) can further boost the generalization of DGANet-ISE on inexperienced data sets.
In the testing phase, the trained DGANet model is applied to generate primary HR results. However, if the testing images differ substantially from the training set (e.g., the images collected from different satellites), the performances of existing supervised learning methods will be strongly affected. To address this problem, an unsupervised learning-based enhancement approach, ISE, is introduced to further improve the performance and generate final HR images. ISE uses each input LR image to supplement more global information, and it has several advantages: (a) no additional supplementary data are required; (b) it focuses on the specific information and features of each test image, i.e., image-specific enhancement; (c) it boosts the generalization performance on inexperienced data sets.

Structure of DGANet
DGANet is proposed to extract the deep features of an LR image and upscale LR feature maps into HR output, as shown in Figure 2. The network contains different types of layers, including convolutional (Conv) layers, rectified linear unit (ReLU) layers, and pixel-shuffle layers. Based on these layers, residual blocks (ResBlocks) and upsample blocks (UpBlocks) are subsequently constructed. The ResBlocks are employed to construct deep networks, and UpBlocks to efficiently transform low-resolution feature maps into high-resolution size. ResBlock: In each ResBlock, there are two Conv layers and one ReLU layer. The output is obtained by summing the input and the result obtained through the Conv and ReLU layers, as shown in Figure 2.
UpBlock: To transform the input feature maps to the target HR image, the UpBlocks are constructed based on one Conv layer and one pixel-shuffle layer. represents the upscale factor. The Conv layer transforms the feature maps to channels and the pixel-shuffle layer is a periodic shuffling operator that rearranges the elements of a tensor to a tensor of shape ( Figure 3). Each UpBlock upscales 2 times. Therefore, if the upscale factor is 2, one UpBlock is used; if the upscale factor is larger, multiple UpBlocks are used to improve the size gradually. According to Figure 2, the whole network mainly consists of four modules: low-level features are extracted from the input image with Conv layers at the first module; the ResBlocks in the second module are used to learn more complex and deeper features; the third module transforms the feature maps from the LR domain to the HR domain; the last module is the global residual block, in which the input image is interpolated to high resolution directly to provide a large number of global signals of the input image.
Take an upscale factor of 8 as an example (the size of the input LR image is 32 × 32), Table 1 gives an exact representation of the components of DGANet. The kernel size of each Conv layer is 3 × 3. Five ResBlocks are used to extract the deep features sufficiently in this network. For the global residual block, the bilinear interpolation algorithm is used to directly transform the input image to another image with the target resolution. In addition, the Adam optimizer was used to train the model, and the loss function will be introduced in the next section. The learning rate was 0.0001 and was halved every 40 epochs, and the batch size in the training phase was 32.

Gradient-Aware Loss
As images usually change quickly at the boundaries between objects, the image gradient is significant in boundary detection [41]. Therefore, the gradient information is usually used to extract object edges [43] and to reflect the changes in the surface's topography [42] from RS images. 1 and 2 losses are widely used in deep learning applications, which are expressed below, where is the number of samples of one batch. The batch size was 32 in this work.
Although they can accurately represent the global pixel difference between images, they usually lead to smooth and blurred images [45], and little attention is paid to the important structure and edge information of RS images. Therefore, we propose a new gradient-aware loss ( ) that facilitates the preservation of more edge and structure information and generates sharper HR images. The Sobel operator ( ) [46] can effectively detect edges, and it enhances high spatial frequency details. Therefore, this operator is applied to generate gradient maps of the target image and the predicted images.
The gradient-aware loss is defined as the mean of samples. The mean absolute error is used to evaluate the difference between gradient maps ( ) and ( ), as shown below: To maintain the balance between the global pixel error and the gradient error, a hyperparameter is introduced. From the experiment, we found that works best at 0.1 (as shown in Appendix A). Therefore, = 0.1 was used in this paper, and the overall loss function can be formulated as By using gradient-aware loss, more high-frequency information, such as edges and terrain, will be preserved in the super-resolution process, and sharper HR images with higher accuracy will be generated.

Image-Specific Enhancement
The performance of supervised methods largely depends on the training data set. If the test images differ substantially from the training set, the performance on inexperienced data will be greatly affected. As for RS images, they differ in terms of sensors, times, places, colors, types, and resolutions. It is hard to collect training samples that cover all scenarios, resulting in the insufficient generalization ability of many supervised SR methods. In this paper, the ISE algorithm is proposed to provide an effective solution to improve the generalization and adaptability of our method.
The core strategy of ISE is to back-project the error between the emulated and actual LR image to the SR image and to iteratively update it. This approach is inspired by the iterative back-projection algorithm (IBP) [47]. The specific procedure of ISE is presented in Algorithm 1, where the iteration number is and is the largest iteration number. The input of ISE is the original LR input and the predicted HR image obtained from the DGANet. The is firstly downsampled to the LR domain ( ). The difference image ( ) between and is calculated. Then, the difference image is upsampled to the HR size, and subsequently, can be updated by adding to the predicted image. This process is repeated iteratively until the difference is sufficiently small or the has been reached. The bilinear interpolation algorithm was used for upsampling and downsampling operations in this paper.
ISE is based on the assumption that if the estimated HR image is closer to the target image, the LR image derived from the estimated HR image should be more similar to the input LR image. By backward-projecting the difference image between and to the super-resolved HR image, more differences can be considered in the estimated HR image. We can thereby obtain better SR results. Algorithm 1. ISE Input: LR input ( ), predicted HR image ( ); Output: Enhanced HR image ( ); While < or is not sufficiently small = +

Return
The main advantages of ISE include that it requires no additional images for training. Furthermore, as ISE focuses on the characteristics of every single image, the image-specific information is used to further improve the quality of the super-resolved HR image. This way, the limitations that are due to the training data set can be effectively alleviated.

Evaluation Criteria
The mean squared error (MSE), peak signal-to-noise ratio (PSNR) [48], and structural similarity index (SSIM) [49] are used to evaluate the performance of models, which are expressed as follows: = 20 × MAX ( , ) where is the target high-resolution image; is the super-resolved image which is generated from the low-resolution image; and are the width and the height of the HR image, respectively; represents the maximum pixel value in the original image; is the root mean squared error; and are the average pixel values of and , respectively; and are the variances of and , respectively; and is the covariance of and . Moreover, is commonly used to measure the error of super-resolved images. An that is closer to 0 implies a higher model accuracy.
is measured in decibels (dB). is used to measure the similarity between two images [5]; the larger the value of and , the better the SR performance.

Methods to be Compared
Fourteen widely used SISR methods, which are shown in Table 2Error! Reference source not found., were compared with DGANet-ISE. These SR methods use LR input images with three channels to generate super-resolved images. The input and output schemes were the same as proposed DGANet-ISE. The theory and characteristics of these methods were described in the corresponding references shown in Table 2.

Results
In this section, the data sets and the implementation details are introduced firstly. Subsequently, the performance of DGANet-ISE is verified by comparison with 14 different SR methods, and crossdatabase tests are performed to evaluate the generalization ability of our method.

Data Sets
Three different data sets were employed to verify the effectiveness and superiority of DGANet-ISE.
RSI-CB256 [55] contains 35 categories and about 24,000 images, which were collected for scene classification. This data set is rather challenging as the scenes are widely different. The pixel size of the images is 256 × 256 with 0.3-3 m spatial resolutions. Figure 4 shows the samples for 10 categories in the RSI-CB256 data set. UCMerced [56] consists of 21 classes of land use images, including agricultural, airplane, beach, buildings, etc. Each class has 100 images with 256 × 256 pixels, and the spatial resolution is 1 foot (≈0.3 m).
Landsat-test is widely used, and the super-resolution of these images is of great value for many applications, such as finer land cover monitoring. The test images used in this paper were Landsat 5 TM data of band 7 (2.08-2.35 µm), band 4 (0.76-0.90 µm), and band 2 (0.52-0.60 µm). The data set was downloaded from the Google Earth Engine (https://developers.google.com/earthengine/datasets/catalog/LANDSAT_LT05_C01_T1). The spatial resolution is 30 m. During the experiment, we cut the image into approximately 400 small images of 256 × 256 pixels. This data set is different from the RSI-CB256 and UCMerced data sets in two aspects: (1) the resolution of the Landsat data set is much smaller than the other two data sets, resulting in very different geographic information and scene content in the same image size; (2) the first two data sets are artificially selected into categories, while Landsat-test data set is a real unselected scene.

Implementation Details
As there are not enough corresponding LR-HR image pairs in reality, we downsampled the images using the Bicubic interpolation (BCI) algorithm with a factor of ∈ {2, 4, 8} to obtain LR images of different scales. Many studies, such as [29,35], have used BCI to generate LR-HR image pairs. The RSI-CB256 data set was used to compare our model with commonly used SISR methods, and all images were randomly divided into training samples (80%) and test samples (20%). Furthermore, in order to evaluate the generalization ability of our model, the Landsat-test data set was utilized for a cross-database test. That is, the models trained on RSI-CB256 were directly applied to the SR experiments of Landsat-test without any tuning. This is difficult because the data were collected from different sensors and have different spatial resolutions.
In addition, the UCMerced data set was randomly split into two balanced halves for training and testing, according to [35], which is consistent with other RS-SR methods, including CNN-7 [25], LGCNet [25], DCM [29], etc. Therefore, we compared our method with these RS-SR methods using the UCMerced data set.
The interpolation-based and deep learning-based methods were implemented in Python, and the machine learning-based methods were implemented in MATLAB. In addition, the models were used with the default settings suggested by the authors. The generation of LR images and the calculation of the evaluation criteria were implemented in the same Python environment to ensure the consistency and accuracy of the results.

Comparison with Baselines
In this section, we experimented on the RSI-CB256 and UCMerced data sets, respectively. UCMerced is widely used in RS-SR; therefore, we compared our methods with the following RS-SR methods available in the literature [29], which experimented on the UCMerced data set as well. The upscale factors were 2 and 4, and the PSNR and SSIM results are shown in Table 3. In addition, Figure  5 provides a more vivid presentation of the comparison of the PSNR and SSIM results.  Additionally, another experiment was conducted on the RSI-CB256 data set, and DGANet-ISE was compared with three kinds of SISR methods: interpolation-based methods (nearest-neighbor interpolation (NNI), bilinear interpolation (BLI), and BCI), machine learning-based methods (simple functions (SF), neighbor embedding (NE), and classical sparsity-based super-resolution (SCSR)), and deep learning-based methods (super-resolution convolutional neural network (SRCNN), efficient sub-pixel convolutional network (ESPCN), and enhanced deep super-resolution network (EDSR)). The experiments were conducted with three different upscale factors, i.e., ∈ {2, 4, 8}. The detailed results are presented in Table 4. In addition to the quantitative assessments, visual results are provided for a qualitative and intuitive evaluation. Three test images of a parking lot, a residence, and a dam were chosen as examples, and Figures 6-8 show the super-resolved HR images for ∈ {2, 4, 8}, respectively. From the global perspective, it is obvious that as the upscale factor increases, the accuracy gradually decreases, because achieving super-resolution from a lower resolution image is much more difficult. Furthermore, according to Table 4, DGANet-ISE significantly outperforms the baselines. With the exception of DGANet-ISE, the results of EDSR are the best among the other methods. The PSNR values of DGANet-ISE are 0.33 dB, 0.26 dB, and 0.18 dB larger than those of EDSR when is 2, 4, and 8, respectively. BCI, SCSR, and DGANet-ISE are the most prominent within the interpolation, machine-learning, and deep learning categories, respectively. For example, when = 2, the MSE values of BCI and SCSR are 69.74 and 48.01, respectively, much smaller than the other methods of the same category. With regard to deep learning methods, our proposed model exhibits superior performance on the SISR task. The MSE and PSNR results of our method are the best among all of the methods. For instance, when = 2, the error of our proposed method is the smallest (MSE = 31.26), while PSNR = 37.92 dB and SSIM = 0.9477 are the largest.   Focusing on the visual results (Figures 6-8), the HR images super-resolved by different methods vary in terms of their features. For example, images obtained from NNI and NE have regular dense squares, similar to mosaics. This causes very blurred edges of these super-resolved images. The reason is that both algorithms rely heavily on the values of neighboring pixels or patches and ignore other significant structural details. NE performs better as it considers that the LR patches and the corresponding HR patches have similar local geometries.
In addition, images super-resolved via BLI and SCSR have very blurred boundaries ( Figure  8d,h). We conducted a thorough analysis and found that both the BLI and SCSR algorithms use linear features for SISR, which results in the emergence of stripes when increases. The basic idea of SCSR supposes that a signal can be represented as a sparse linear combination with respect to an overcomplete dictionary. A linear feature extraction operator behaves like a high-pass filter for feature representation of LR image patches [52]. In this way, the SCSR images have sharper edges than those of NE and BLI.
The HR images (Figure 7i-l) obtained by the deep learning methods generate smoother textures than the other two kinds of methods. DGANet-ISE yields more competitive visual results, because they are the most similar to their ground-truth HR counterparts. Our proposed DGANet-ISE recovers more texture and structure details, such as the edges of the dam in Figure 8. The edges of the dam generated by our method are very clear, whereas the edges are blurred or even missing in other images. This is because the gradient-aware loss proposed in our method can capture more gradient information of the RS images, which can facilitate the generation of sharper edges and textures. In summary, the proposed method performs the best compared to other methods.

Cross-Database Test
To further evaluate the robustness and generalization capability of DGANet-ISE, the Landsattest data set was applied for a cross-database test, i.e., the models trained on the RSI-CB256 data set were directly used to super-resolve the Landsat-test data set without any fine-tuning or modification.
The results of the comparison between the different approaches under the three upscale factors are presented in Table 5, and Figure 9 shows the visual results of the different methods on an example image.  The Landsat-test data set differs substantially from the training samples, since the spatial resolution of training samples is 0.3-3m, while the spatial resolution of Landsat-test images is 30 m, and the scene contents in the same image size are very different. Therefore, it poses a great challenge to apply these supervised learning models directly to such test images.
According to the results, our proposed method shows a reasonably satisfactory performance compared to the other methods. For instance, when = 2, the PSNR is 40.70 dB, which is 0.64 dB larger than that of EDSR and 1.57 dB larger than that of SCSR. This is because the ISE algorithm focuses on the personalized features of each test image, and provides a fast additional improvement of the super-resolved HR image obtained from the deep gradient-aware network. In summary, DGANet-ISE has strong robustness and generalization capability on new data sets.

Discussion
In this section, ablation studies were conducted to analyze the performance of the proposed gradient-aware loss and the impact of the ISE algorithm.

Dependency on the Type of Loss Functions
For evaluating the effectiveness of gradient-aware loss, we trained the same network with the 1, 2, and gradient-aware losses on training samples of the RSI-CB256 data set. The detailed results on the test samples of the RSI-CB256 data set are presented in Table 6Error! Reference source not found.. As we can see, the PSNR results of 1 loss are slightly better than those of the 2 loss, and our proposed loss function yields the best results. When the upscale factor is 2, the PSNR obtained with the gradient-aware loss (37.92 dB) is 0.2 dB better compared to the L1 loss (37.72 dB) and the L2 loss (37.66 dB).
To assess the detailed information preservation capability of the gradient-aware loss, an image that belongs to the "Residence" class was selected for further analysis. Figure 10 presents the reconstructed HR images obtained using the three-loss functions with an upscale factor of 8. The PSNR result of the proposed loss is 25.54 dB, which is 0.15 dB larger than L1 loss. The shape and the direction of the swimming pool that is obtained with the gradient-aware loss are much closer to reality. In addition, the PSNR and SSIM results obtained by our gradient-aware loss are much better than 1 and 2 losses. In summary, our proposed gradient-aware loss is able to generate HR images more accurately.

Impact of ISE
Since ISE aims to improve the SR results of inexperienced data, in this part, cross-data set test performances of the method with and without ISE were compared. The DGANet and EDSR trained on the RSI-CB256 data set were directly applied to the Landsat-test data. The results are presented in Table 7 and Figure 11 gives a visual comparison. The results in Table 7 illustrate that the ISE approach can further improve the SR performance. For instance, when the upscale factor is 2, the PSNR of EDSR and DGANet is improved by 0.46 dB and 0.23 dB, respectively. By using the specific information of the test image, ISE can further enhance the HR results of DGANet. As ISE is an unsupervised approach, it can greatly improve the generalization and adaptivity of the models, and reduce the dependence on the training set.

Conclusions
A new SISR method, DGANet-ISE, was proposed in this work to increase the spatial resolution of RS images. Specifically, the deep gradient-aware network and image-specific enhancement are two important components of DGANet-ISE. The first part extracts features and learns the precise representation between the LR and HR domains. The gradient-aware loss was first proposed in this part to preserve the important gradient information of RS images. Image-specific enhancement was used to improve the robustness and adaptability of our method, thus further improving the performance. Based on the above, three different data sets were used as experimental data sets. The proposed DGANet-ISE was applied to super-resolve the RS images with three upscale factors, which yielded remarkable results. Compared to other SISR methods, DGANet-ISE is better from both quantitative and visual perspectives. Moreover, the results of the cross-database test demonstrate that the proposed approach exhibits strong generalization and robustness, and provides an excellent opportunity for practical applications. Acknowledgments: Thanks to all the anonymous reviewers for their constructive and valuable suggestions on the earlier drafts of this manuscript.

Conflicts of Interest:
The authors declare no conflicts of interest.
To analyze the influence of the hyperparameter value on the model, we plotted the trends of the average PSNR and SSIM values of = {1,0.1, 0.01,0.001,0.0001,0.00001} versus the number of iterations (up to 200 epochs) in Figure A1. The experiment was conducted with an upscale factor of 2. When = 1, the PSNR is lower than other cases. In terms of the SSIM results, = 1 and 0.1 performed better than the others. Therefore, in this paper, we used = 0.1 as the final weight of the gradient-aware loss. Figure A1. Comparison of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) between the five different values.