Deep Residual Squeeze and Excitation Network for Remote Sensing Image Super-Resolution

Recently, deep convolutional neural networks (DCNN) have obtained promising results in single image super-resolution (SISR) of remote sensing images. Due to the high complexity of remote sensing image distribution, most of the existing methods are not good enough for remote sensing image super-resolution. Enhancing the representation ability of the network is one of the critical factors to improve remote sensing image super-resolution performance. To address this problem, we propose a new SISR algorithm called a Deep Residual Squeeze and Excitation Network (DRSEN). Specifically, we propose a residual squeeze and excitation block (RSEB) as a building block in DRSEN. The RSEB fuses the input and its internal features of current block, and models the interdependencies and relationships between channels to enhance the representation power. At the same time, we improve the up-sampling module and the global residual pathway in the network to reduce the parameters of the network. Experiments on two public remote sensing datasets (UC Merced and NWPU-RESISC45) show that our DRSEN achieves better accuracy and visual improvements against most state-of-the-art methods. The DRSEN is beneficial for the progress in the remote sensing images super-resolution field.


Introduction
High-resolution (HR) images with rich detailed textures and critical information play an essential part in later remote sensing image analysis, such as target detection, object recognition, land cover classification, etc.However, due to hardware limitation and large detection distance, the spatial resolution of these satellite imageries in ordinary civilian applications is often low-resolution (LR).Instead of enhancing the physical imaging technology, many researchers aim to reconstruct a visually pleasing HR remote sensing image from existing LR observed images, which is called single image super-resolution (SISR) [1].
In the past few years, a series of SR techniques based on the sparsity prior of image statistics have been proposed to recover HR remote sensing images.A dictionary of image edges and contours were utilized by Yang et al. [2] and Dong et al. [3].The compressive sensing and structural self-similarity of the remote sensing images were used for the super-resolution task by Pan et al. [4].The sparse properties in the spectral and spatial domain were explored by Li et al. [5] to recover the resolution of hyperspectral images.However, these methods all utilize low-level features of the remote sensing images.
With the immense popularity of deep learning, convolutional neural network (CNN) stands out as a powerful image super-resolution basic method.These deep learning methods learn high-level feature representation automatically from data to provide significantly improved resolution restoration performance.Among them, Dong et al. firstly introduce a three-layer CNN into image SR named SRCNN [6], and achieved considerable improvement.Kim et al. increased the network depth in VDSR [7] and DRCN [8], achieving notable improvements over SRCNN.Tai et al. [9] later introduced recursive blocks in DRRN for deeper networks.These methods would have to first interpolate the original LR images to the desired size and then apply them into the neural network.This pre-processing method increases computation greatly and inevitably loses some details of the original LR inputs.
To deal with the pre-processing problem above, FSRCNN [10] extracts the features from the original LR images and then introduces a transposed convolution layer to up-scale the spatial resolution at the tail of the network.An efficient sub-pixel convolution layer was proposed in ESPCN [11] to up-scale the final LR feature maps into the HR output.Then, this sub-pixel convolutional layer became the main choice for deep architecture.SRResNet [12] took advantage of residual learning to construct a deeper network and achieved better performance.By removing unnecessary modules in conventional residual networks, Lim et al. [13] proposed EDSR and MDSR by removing the batch normolization layer in SRResNet, which achieves significant improvement.
In order to make the image more visual pleasing, many generative adversarial networks (GAN) [14] based models were proposed for single image super resolution.Leding et al. first introduce the GAN framework and the perceptual loss [15] into SRGAN [12], which achieves more visually pleasing images.Compared with the L1 or L2 loss function supervision method, GAN can produce visually sensible samples, but the accuracy of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [16] evaluation decreases.
With regard to the super-resolution of remote sensing images, Lei et al. proposed an algorithm named local-global combined networks (LGCNet) [17] to learn multilevel representations of remote sensing images.Haut et al. [18] learn the distribution of the image based on GAN and proposed an unsupervised SISR method.Xu et al. [19] propose a novel method named deep memory connected network (DMCN) to ease time consumption of reconstructing the remote sensing images resolution.
Recently, many super resolution methods tend to grow the network depth for better performance.As the depth grows, the features in the deep neural network would be hierarchical with different receptive fields.In addition, objects in remote sensing images have different scales due to factors such as the angle of view and the scale of the zoom.Therefore, hierarchical features from the network will provide more clues for image resolution reconstruction tasks.However, most methods based on the neural networks neglect to use hierarchical features and blindly increase network depth for super resolution.The highly complex spatial distribution of remote sensing images indicates that higher level abstraction and better data representation are essential for applications.Meanwhile, the ground objects of remote sensing images usually share a wider range of their scales, saying that the object itself and its surrounding environment are mutually coupling in the joint distribution of their image patterns [17], which is highly different from those of natural images.Therefore, for the super-resolution task of remote sensing images, a more powerful representational ability is of crucial importance for achieving better performance.
Aiming to promote remote sensing image super-resolution tasks, we design a novel network structure named deep residual squeeze and excitation network (DRSEN) with the inspiration of some newly emerging concepts in deep learning.Based on the new concepts, we have designed an efficient network structure and achieved satisfactory results.The whole network can be divided into the identity branch and the residual branch, which can be regarded as a variant of residual learning based on global thought.The identity branch directly takes an LR RGB image as input and outputs the HR counterpart.Unlike the global residual path of the linear stack of several convolutional layers in EDSR, our identity branch uses a single convolution layer as the feature extraction part and then uses the ESPCN [11] structure to output the image.Such identity branch reduces network parameters by absorbing some redundant convolution layers and guarantees the same accuracy.In particular, for the residual branch, the residual squeeze and excitation block (RSEB) is proposed as the basic building block for our network.With RSEB, a high performance network can be built in a simple but rather effective manner.The RSEB can be regarded as a variant building block of ResNet [20].RSEB consists of three parts: the feature extraction part, the local feature fusion (LFF) part and the squeeze and excitation (SE) part.The feature extraction module consists of two convolutional layers.Concatenating the states of preceding RSEB and some preceding layers within the current RSEB, LFF extracts local features inside each block and aggregates them simultaneously.We exploit the local feature fusion module to fuse different layer features and explore the most appropriate fusion method.The LFF part can make use of the features and adaptively preserve the information to improve the network representation ability.Squeeze and excitation module is first proposed in SENet [21] to improve network presentation capabilities and has also been rapidly applied in the field of image super-resolution.The authors in [22][23][24][25] have verified that the attention mechanism contributes to image super-resolution tasks.The output of one RSEB has direct access to the next RSEB.Furthermore, in contrast to EDSR, we remove redundant convolutional layers after ESPCN to reduce network parameters and not affect final result performance.In summary, the main contributions of this work are as follows: 1.
We propose a deep residual squeeze and excitation network (DRSEN) for remote sensing satellite image SR reconstruction.Our DRSEN is in a convenient and effective end-to-end training manner and obtains a better accuracy and visual super-resolution performance.

2.
We propose a modified residual block named RSEB, which contains local feature fusion (LFF) module and squeeze and excitation (SE) module based on some common concepts.The LFF module fuses the different level features in the current block and the SE module adaptively rescales features by considering interdependencies among feature channels.Both the LFF module and the SE module improve the representation ability of the network.

3.
We propose an identity branch to replace the global residual path way and a more simplified upsampling module.These strategies can reduce network parameters and calculations without compromising the accuracy of the results.
The remainder of this paper is organized as follows.Section 2 introduces the details of our proposed method.Section 3 verifies the effectiveness of DRSEN by performing comparisons with the state-of-the-art image super-resolution methods.In Section 4, we discuss the issues of our network according to the experimental results.Section 5 concludes the discussions of the study.

Proposed Method
In the following, we will demonstrate the architecture of the proposed DRSEN, including the interior structure and mathematical expressions.Then, the local feature fusion module and the squeeze and excitation module within the RSEB will be illustrated in detail.Finally, the implementation details of our network will be introduced.

Network Architecture
As shown in Figure 1, our DRSEN consists of a residual branch and an identity branch.The residual branch consists of three parts: shallow feature extraction module, residual squeeze and excitation blocks (RSEBs) and up-sampling module.Let's denote I LR and I SR as the input and the output of DRSEN.As the same with EDSR, we use only one convolutional layer to extract the shallow feature F 0 from the LR input: where H SF (•) denotes convolutional operation of the shallow feature extraction layer.F 0 is then used for deep feature extraction with our RSEBs.Supposing there are N residual blocks, the output F i of the ith (1 ≤ i ≤ N) RSEB can be obtained by: where H RSEB,i denotes the operations of the ith RSEB.H RSEB,i can be a composite function of operations.More details about RSEB will be given in Section 2.2.
(a) After extracting the features in the LR space, we use a convolutional layer to adjust the depth of the feature maps to 3 × S 2 .Then, we up-scale these features via an up-sampling module.Inspired by EDSR, we also utilize the ESPCN as our up-scale part.In contrast to EDSR, there are no extra convolutional layers inserted after up-sampling in our network to improve the speed and reduce the parameters.The output of the residual branch can be formulated as: where H UP (•) and I RBUP denote the ESPCN and up-scaled images of the residual branch.H A and F N denote the convolutional layer before the up-scale module and the output of the last RSEB, respectively.For the identity branch, we use a single convolutional layer and ESPCN module to directly output the corresponding high-resolution image, which reduces the convolution operations after the up-sampling layer in the EDSR.According to our experiment result, the removal of these convolutional layers has essentially no effect on the performance of the network while improving the speed.The process can be formulated as: where H identity and I IBUP denote the single convolutional layer and up-scaled image of the identity branch.
The outputs of the residual branch and the identity branch are combined finally via an element-wise summation to estimate the HR image, which can be formulated as: where H DRSEN denotes the function of our DRSEN.Then, the proposed DRSEN is optimized with L 1 loss function, which has been demonstrated to be powerful for SR [26].Given a training set {I i LR , I i HR } n i=1 , which contains n LR inputs and their HR counterparts.The goal of training DRSEN is to minimize the L 1 loss function where Θ = {W i , b i } denotes the parameter set of our proposed network.The loss is averaged over the training set.More of the training details would be shown in Section 3.1.3.

Residual Squeeze and Excitation Block
Our residual squeeze and excitation block (RSEB) is built upon the local feature fusion (LFF) and the squeeze and excitation (SE) module.The details of our proposed RSEB are illustrated in Figure 2.

Local Feature Fusion Module
The network representation power is significant for image SR.Extracting and aggregating the features among the whole network can make full use of the features and the network representation power will be enhanced.Thus, we propose a local feature fusion (LFF) module in the RSEB.The states from the preceding RSEB and some layers within current RSEB are adaptively fused by the LFF module.It is indispensable to reduce the feature number as the features of the ith RSEB are directly introduced into the (i − 1)th RSEB in series.We introduce a 1 × 1 convolutional layer to adaptively control the output information.This operation can be formulated as: where H i LFF denotes the function of the 1 × 1 convolutional layer in the ith RSEB.F i−1 is the output of the ith RSEB.F i,σ and F i,conv2 refer to the feature maps produced by the activation funcation and the second convolutional layers in the ith RSEB.The symbol [•] denotes the concatenation of the feature maps.

Sequeeze and Excitation Module
Squeeze and excitation module enhances the network representation ability by exploiting channel dependencies.The details of the SE module are illustrated in Figure 3. Through this module, important features are emphasized among the channels while suppressing useless features.The squeeze function in RSEB is shown as below: where z c is the cth element of the squeezed channels and F sq denotes the squeeze function.u c is the cth channel of the input.H and W denote the height and width of the input.
Then, an excitation function follows the squeeze operation, which aims to fully capture the channel-wise dependencies.The excitation can be formulated as follows: where F ex denotes the excitation function and z is the input squeezed signal from the previous layer.

Implementation Details
Now, we introduce the implementation details of our proposed network DRSEN below.We build a residual branch by stacking 20 of the RSEBs.For residual branch, the size of the convolutional layer is 3 × 3 except for the SE module and the LFF module, whose kernel size is 1 × 1.The zero-padding strategy is used to keep the size fixed.To save the memory and decrease the computation, the shallow feature extraction of our residual branch has 32 filters.Within the RSEB, the number of channels of the first convolution layer is increased to 64 and then decreased to 32 with the second convolution layer.Later, the features are concatenated at the channel dimension and followed by a 1 × 1 convolution layer.In the SE module, one convolutional layer is utilized to reduce the number of output channels to 8 and then increase the number of channels to 64.We use the ESPCN to up-scale the coarse resolution features to fine ones.For identity branch, we use a 3 × 3 convolution layer with the dilation of 2 to increase the number of channels and then up-scale the resolution with ESPCN.

Settings
In this section, we first introduce the two datasets, the degradation model and the details of our training set.Then, we perform some ablation experiments to verify the effectiveness of the local feature fusion module and SE module.Finally, quantitative and visual results of our method and the state-of-the-art methods are shown.

Datasets and Evaluation Metrics
We choose two datasets with different spatial resolutions to verify the robustness of our proposed method.There are some of the training images as shown in Figure 4. (

1) UC MERCED [27]:
The UC Merced land-use dataset composed of 2100 land-use scene images measuring 256 × 256 pixels with high spatial resolution (0.3 m/pixel) in the RGB color space.This dataset is usually adopted for super-resolution task of remote sensing images.We randomly select 1700 images of the dataset for training and the other 400 samples as the testing set.
(2) NWPU-RESISC45 [28]: This dataset is a public benchmark created by Northwestern Polytechnical University (NWPU), which contains images with spatial resolutions varying from 30 m to 0.2 m per pixel.This dataset contains 45 scenes with a total number of 31,500 images, 700 per class.The size of each image is 256 × 256 pixels.We randomly select 4500 images for training and 180 images for testing.
We use the peak signal-to-noise ratio (PSNR) [dB] and structural similarity index measure (SSIM) as criteria to evaluate the performance of our proposed model.

Degradation Model
In order to demonstrate the effectiveness of our proposed DRSEN, we use bicubic down-sampling by adopting the Matlab function imresize with the option bicubic (denote as BI for short) to simulate LR images.We use this BI degradation model to simulate LR images with scaling factor ×2, ×3, and ×4.

Training Settings
Following settings of EDSR, in each training batch, we randomly extract LR RGB patches with the size of 48 × 48 as inputs.We randomly augment the patches by flipping horizontally and randomly rotating by 90 • , 180 • and 270 • .The batch-size is set to 16.Our models are implemented with the Pytorch [29] framework and optimized with Adam [30] by setting β1 = 0.9, β2 = 0.999 and = 10 −8 .We set the initial learning rate to 1 × 10 −4 and halve it decreased every 100 epochs.All convolutional filters are initialized by the method of He et al.'s initialization [31].Training a DRSEN roughly takes 12 h with a NVIDIA Tesla P100 GPU which is manufactured through United States NVIDIA for 200 epochs.

Residual Squeeze and Excitation Block Analysis
The residual squeeze and excitation block is the most critical property in our proposed network.To demonstrate the effect of each component in the block and verify the usefulness of this block, we carry out ablation experiments of super resolution tasks on the UC Merced dataset.In order to avoid the randomness of network training and the influence of noise, the network is trained 10 times under the same parameters and hardware conditions.Our experimental results are the mean of multiple training results.Table 1 shows the ablation investigation on the effects of local feature fusion (LFF) and SE module.For fairness, the number of blocks and features are the same.The experiment results measured by the mean PSNR(dB) of the testing dataset.We can conclude from the results that the model with both the LFF and SE module achieves the best performance.Local Feature Fusion Module.The LFF module adaptively aggregates the features to improve the representation capability of our deep network.To demonstrate the effect of this module, we remove the concatenation operator and the followed convolutional layers.The second and the last rows of Table 1 demonstrate that this LFF component can improve the performance of the baseline about 0.239 dB.This is mainly because LFF contributes to the power of the network representation ability.
As shown in Figure 5, we have done some research on three different structures of the local feature fusion module.Structure (a) combines only the input and output features of our novel blocks.Structure (b) combines the input features and each convolutional layer output features in the RSEB.Structure (c) is slightly different from structure (b), using the feature activated by the activation function to replace the output of the first convolutional layer in structure (b).The experimental result is the mean of multiple training results.
The experimental results are shown in Table 2.By comparing the results of structure (a) and the other two structures, we can find that the full utilization of the features in the block is more conducive to the improvement of the performance.Comparing the results of structure (b) and structure (c), we can conclude that the features activated by the activation function can obtain higher accuracy.Squeeze and excitation Module.In order to evaluate the effect of the squeeze and excitation module, we run an ablation study on this component.Comparing the first and third columns of the results shown in Table 1, the SE module can improve the performance from 34.451 dB (last row) to 34.687 dB (3th row).These comparisons firmly demonstrate the effectiveness of the SE module and indicate that recalibrating the channel importance of features really enhance the performance.

Quantitative Results
We evaluate the performance of our proposed remote sensing image SR network DRSEN on the two test datasets with three different up-sampling factors ×2, ×3 and ×4.We compare our method with Bicubic [32] and five other state-of-the-art methods: SRCNN [6], FSRCNN [10], VDSR [7], LGCNet [17] and EDSR [13].For a fair and convincing comparison, we retrain these methods under our experimental datasets.For EDSR, in order to fairly compare the performance of the network, we reduce the number of residual blocks to the same number as our DRSEN and set the convolution filters to 64.The parameter settings for the other methods are the same as in the paper.Table 3 presents the ultimate mean PSNR and SSIM over the test images in two datasets.For the rigor and credibility of the experiment, we independently train all these network 10 times under the same conditions.We use the average of multiple experimental results as the final result.Figure 6 shows the results of our multiple experiments with the proposed DRSEN.As illustrated in Table 3, our method achieves the best performance with the highest PSNR and SSIM.
For the UC Merced dataset, our proposed method outperforms EDSR with the average 0.260 dB increase on three scale factors in the terms of PSNR.Since the resolution of the original NWPU-RESISC45 dataset is slightly worse, the reconstruction performance is slightly worse than the UC Merced dataset.The PSNR of our method is 0.178 dB higher than EDSR.However, the amount of parameters in our network is much less than that of EDSR.The number of parameters for our DRSEN is 8.6 M, while the EDSR is 16.6 M. Figure 7 shows the number of network parameters for different CNN-based methods and the scaling factor ×2 reconstruction quality on our dataset.In the case where the parameter amount is almost half of the EDSR, our performance indicators are even better.

Visual Results
In order to more fully demonstrate the effectiveness of our approach, we also show some of the visual comparisons on three scales ×2, ×3 and ×4 in Figures 8-10.We observed that, on the different scale factors, our proposed DRSEN achieves better results, reduced sawtooth and ringing artifact and better reconstructed the structure of the objects in the picture.
Super-resolution results of "overpass26" (UC Merced) and "roundabout132" (NWPU-RESISC45) with scale factor ×2.The edges of the overpass and the lane line in our results are more clear.We refer to FSRCNN as FSR for short.

Figure 9.
Super-resolution results of "airplane76" (UC Merced) (a) and "runway512" (NWPU-RESISC45) (b) with scale factor ×3.The texture of the airplane and the lines in the sidewalk are observed in our methods, while others suffer from blurring artifacts.We refer to FSRCNN as FSR for short.
As shown in Figure 10, on the large scale factor, the bicubic up-sampling strategy results in loss of texture and structure and produces blurry SR results.Methods such as SRCNN, VDSR, and LGCNet that take such bicubic up-sampling results as network inputs will produce erroneous structural and texture information and fail to recover more details, ultimately resulting in poor SR image quality.Although EDSR uses the original LR image as the input to the network, it cannot restore the correct texture structure, while our DRSEN can recover more structural and texture information in the original corresponding HR image.This comparison apparently shows that our network has more powerful representation capabilities and can extract complex features from the LR space.Our network is a fully convolutional network and the input image size can be arbitrary.In the network training process, the neural network learns the mapping relationship between the low-resolution image patch which is cropped from the original low-resolution input and the high-resolution image patch.In the testing phase, we only need to input the image to be processed into the network to get the corresponding super-resolution results.The upper limit of the input image size is only related to the memory of the device.If there is not enough memory, we can also divide the large image into several small images and finally stitch them together to get the final output.

Discussion
According to the experimental results and analysis in Sections 3.3 and 3.4, the proposed algorithm performs better than the other methods.However, there still remains some limitations.Our DRSEN is a PSNR-oriented method, which tends to output over-smoothed results without sufficient high-frequency details at large ratio scaling, as illustrated in Figure 10.
Although our DRSEN has acquired competitive PSNR and SSIM, there is still a certain distance between visual results and human visual perception.The possible reason is that the PSNR metric fundamentally disagrees with the subjective evaluation of human observers.At the same time, when training the network, we need to use the paired low-resolution observations and high-resolution remote sensing images.However, in actual situations, it is often difficult to obtain pairs of high-resolution and degraded images.In the future, we will explore the perceptual loss to keep a balance between the value of PSNR and visual quality.Additionally, we will investigate the cycle-consistency concept of CycleGAN [33] to train the super-resolution network with the unpaired dataset.In this case, we can learn to convert the low-resolution remote sensing image into a high-resolution image by observing only the degraded image.

Conclusions
In this paper, we propose a novel network named DRSEN to improve the representation of deep networks and achieve better performance in remote sensing image super-resolution task.Specifically, the RSEB is proposed as the building module of our SR deep network.We use the local feature fusion module in the RSEB to make use of the features within the input and the block.Such module can improve the network representation capability and stabilize the training.Meanwhile, the squeeze and excitation module is used to adaptively recalibrate channel-wise feature responses by explicitly modelling interdependencies between channels to improve the ability of our network further.We modify the global residual path way and remove some redundant convolutional layers to decrease the parameters and computation.Our model is trained on two public benchmark remote sensing datasets with various spatial resolution.The experimental results demonstrate that our proposed method can obtain accurate results with fewer parameters and outperform most of the state-of-the-art methods regarding quality and accuracy.In the future, we will continue to focus on improving super resolution quality and addressing the problem of the paired dataset.

Figure 1 .
Figure 1.The network structure comparison of the EDSR and our DRSEN.(a) network architecture of EDSR; (b) network architecture of our deep residual attention network (DRSEN).

Figure 2 .
Figure 2. The comparison of the original Residual Block and our RSEB.(a) residual block structure of EDSR; (b) residual squeeze and excitation block (RSEB) architecture.

Figure 4 .
Figure 4. Examples of images in two datasets.The first line is the UC Merced dataset, and the second line is the NWPU-RESISC45 dataset.

Figure 5 .
Figure 5.We researched about three different structures of the local feature fusion module.(a) has no connections in the block, (b) adopt a long-distance skip connection and (c) use a short path connection.

Figure 6 .
Figure 6.The index value of 10 training results.The results are evaluated with the UC Merced test dataset for ×2 SR.

Figure 7 .
Figure 7.The number of network parameters versus performance.The results are evaluated with a UC Merced test dataset for ×2 SR.Our proposed models achieve better performance with relatively fewer parameters.

Figure 10 .
Figure 10.Super-resolution results of "denseresident191" (UC Merced) (a) and "railwaystation565" (NWPU-RESISC45) (b) with scale factor ×4.The outline of the car is distinct and the lattices of the building roof are closer to the original image.We refer to FSRCNN as FSR for short.

Table 1 .
Comparative experiments of our model on UC Merced for ×2 SR.Removing each component will degrade the final performance.

Table 2 .
Comparative experiments of the local feature fusion module with three different structures on UC Merced for ×2 SR.

Table 3 .
Evaluation of state-of-the-art SR methods on remote sensing datasets UC Merced and NWPU-RESISC45.Average PSNR(dB) and SSIM for scale ×2, ×3 and ×4.The bold numbers indicates the best performance.For the same scale, the upper row is the PSNR and the bottom row is the SSIM.