In the field of remote sensing, high-resolution (HR) images contain many detailed textures and critical information, which are essential for object classification and detection tasks. Under the limitation of hardware such as chips and sensors and the high production costs, super-resolution (SR) is regarded as one of the most effective approaches to obtain high spatial resolution images from single or multiple low-resolution (LR) images [1
]. In the multi-frame method, establishing the relation between a targeted HR image and several LR images of the same scene acquired at different condition is used to create a higher resolution result. However, single-image SR algorithms have to solely rely on one given input image, which is crucial when there is no additional data available. Single-image SR methods can be efficiently used as pre-processing operations for additional manual or automatic processing steps, such as classification or object extraction in general. However, with the loss of high-frequency detailed information and multiple targets for a single LR image, the SR task is an ill-posed inverse problem.
The detail of a physical object that a conventional optical system can reproduce in an image has the limitations imposed by diffraction on the resolving power of optical systems. Harris [3
] and Goodman [4
] established the theoretical foundation for the SR problem by introducing the theorems of how to solve the diffraction problem in an optical system and introduced the term of SR to use as a single LR image to reconstruct HR images. In the case of imaging objects with optical fields propagating to the far-field, the basic constraint is the diffraction of light, which limits the conventional optical system to a spatial resolution comparable to the wavelength of light. Optical diffraction by the imaging system transforms all radiation sources into blurred spatial distributions. In the field of remote sensing, a point of the HR domain is blurred in the LR space during the acquisition process, which is specified as the point spread function (PSF). Hence, SR can be seen as the inverting process of the degradation generated by the imaging system to obtain an HR image. Tsai and Huang [5
] proposed the idea of using multiframe LR images for reconstruction to improve the spatial resolution of Landsat TM images. A variety of approaches for remotely sensed images can be categorized as optics-based methods, interpolation methods, and machine-learning methods [6
]. Optics-based methods such as dielectric cube terajet generation, wide-aperture lens, and solid-immersion technique have been proposed for enhancing the resolution of imaging systems [10
]. Compared with optics-based methods, the idea behind machine-learning methods is to learn the potential relationships between low-resolution and high-resolution domains from an external training set, then to generate the final super-resolved image using this prior knowledge and machine-learning methods can improve the reconstruction quality in parallel with optics-based methods. Deep learning methods have achieved great performance over others among these machine-learning methods. Dong et al. [15
] first proposed SRCNN with three layers of neural networks to learn the end-to-end mapping between the LR and HR patches.
Recently, many methods (e.g., VDSR [16
], EDSR [17
], WDSR [18
]) based on very deep neural networks outperformed the relatively shallow CNN model [15
]. It can be observed that among these methods, there are two main strategies for the design of the SR model as shown in Figure 1
However, the deeper CNN model (by adding more convolutional layers) gives rise to enormous parameters and the difficulty of the training procedure. The increasing depths of neural networks by adding more convolutional layers result in overwhelming parameters and the risk of overfitting [21
]. As very deep networks with enormous parameters are highly likely to overfit and demand more storage space, DRCN [22
] and DRRN [23
] used recursive learning to repeatedly apply the same convolutional layer or residual units to reduce the model parameters and make the model compact. To address these problems, various techniques have been introduced in SR neural networks. We mainly review these techniques under three groups shown in Figure 2
Since there is a high correlation between the input image and the target HR image, many methods such as VDSR [2
], EDSR [16
], WDSR [17
], SRResNet [25
], SRDCN [26
], and DRRN [27
] use the local residual path from ResNet [24
] and the global residual path to propagate the information from the shallow layer to the final reconstruction layer. Several methods based on DenseNet [27
] like SRDenseNet [28
] and RDN [29
] use a concatenation strategy to combine preceding features to a bottleneck layer for reconstruction. MemNet [30
], CARN [31
], RDN [29
], ESRGAN [32
], and DBPN [33
] also adopt dense connection to alleviate vanishing gradients and reuse the features from shallow layers. To control the parameters while achieving a large receptive field, DRCN [22
] repeatedly applied the same convolutional layer at 16-recursions to reach a receptive field of 41 by 41. The DRRN [23
] proposes a recursive block consisting of several residual units and shares the weights among these residual units to further improve its performance.
In the SR task, recurrent neural networks are usually implied in video SR to capture the long-term dependency from neighboring frames [34
]. BRCN [35
] consists of three parts: The feedforward convolutional layer to capture the spatial dependence between LR and HR, the bidirectional recurrent convolutional network to capture the temporal dependency between successive frames, and the conditional convolutional layer to further capture spatial-temporal dependency. STCN [37
] proposes a bidirectional LSTM structure to capture spatial-temporal information for video frame reconstruction. For single image SR, based upon the idea of viewing a ResNet as an unrolled RNN [38
], DSRN [40
] proposes a dual-state recurrent network, and each state operates at the LR and HR spatial resolution separately to explore the connection between the LR and HR pairs and provides information flow from LR to HR at every recursion.
In the remote sensing area, Hua et al. [41
] proposed a novel RNN model for hyperspectral image classification which can effectively analyze hyperspectral pixels as sequential data. Convolutional LSTM was utilized to address the spectral-spatial feature learning problem and extract more discriminative and abstract features for hyperspectral image classification [42
]. Besides that, Mou et al. [45
] proposed a recurrent convolutional network architecture to effectively analyze temporal dependence in bitemporal images for multitemporal remote sensing image analysis.
In this paper, we proposed a BiConvLSTM SR network (BCLSR) for remote sensing images. Our intuition was that in order to reduce the model parameters while increasing the receptive fields of our network, we had to build a recursive inference block with dense connection to extract the hierarchical features. In the structure of the recursive inference block, paths are created between a layer and every preceding layer to strengthen the flow of information through deep networks and reuse the features extracted in the previous layers. Since there is redundancy and complementarity between each recursion, we inserted a BiConvLSTM layer to effectively learn the correlations of each different level and select the complementary information for the reconstruction layer. To fuse the hierarchical features extracted from the recursive inference block, we concatenated the recursions of the recursive inference block to a temporal sequence in the order of the recursion and passed through this sequence to the BiConvLSTM layer to extract complementary information from the low-level features. Due to the high correlations of the LR and HR images, we constructed a global residual path by upsampling the LR images with the nearest-neighbor interpolation to the size of the HR images, and the other path learned the high-frequency details.
In summary, our contributions are as follows in Figure 3
(1) We proposed a novel recursive inference block to reuse the local low-level feature and widen the receptive field without additional parameters. By recursively implying the block with the shared weights, our deeper model with more nonlinearities could model more complex mapping functions to further improve performance.
(2) We introduced a BiConvLSTM layer to fuse the hierarchical features by exploiting the dependency and correlations of different level features. The BiConvLSTM layer adaptively extracts complementary information from the low-level features to improve the performance.
(3) Our BCLSR achieves an improvement of about 0.9 dB over state-of-the-art results on multispectral satellite images, panchromatic satellite images, and nature high-resolution remote sensing images, while needing fewer parameters. Cross-validation experiments and comparison of parameters-to-PSNR relationship further demonstrate the effectiveness of our method.
The rest of this paper is organized as follows. Section 2
describes the generation of our dataset and presents our methods in detail. Section 3
provides extensive experiments to verify our methods and a discussion of experimental results.
(1) Difference to SRCNN and VDSR: First, SRCNN [15
] and VDSR [16
] needed to upsample the original LR image to the desired size using Bicubic interpolation. This procedure results in feature extraction and reconstruction in the HR space, while BCLSR extracts hierarchical features from the original LR image, reducing computational complexity significantly and improving the performance. Secondly, SRCNN [15
] and VDSR [16
] used L2 loss function while we utilize the L1 loss function, which has been demonstrated to be more powerful for performance and convergence.
(2) Difference to EDSR and WDSR: First, both EDSR [17
] and WDSR [18
] applied global residual learning and local residual learning, but the global residual path of EDSR [17
] and WDSR [18
] are the addition of low-level features and high-level features, which is computationally expensive. While in BCLSR, as shown in Section 2.5
, we directly introduced the nearest interpolation path to upsample LR to the size of HR, forming the global residual path, which can accelerate the convergence. Secondly, there is no dense connections among EDSR [17
] and WDSR [18
]. BCLSR adopts the densely-connected structure to reuse low-level features to provide richer information for reconstructing high-quality details.
(3) Difference to RDN: RDN [29
] also built upon the dense connection and constructed the basic local dense block, while BCLSR utilized the recursive learning strategy and repeatedly applied the same inference block, which can reduce the storage demand and keep a concise model while increasing its depth. Secondly, RDN [29
] concatenated all feature-maps produced by residual dense blocks and then used a composite function of 1 × 1 and 3 × 3 convolution layer to fuse this concatenation. However, as demonstrated in Section 3.3
, this fusion strategy cannot fully extract global features. BCLSR applied the BiConvLSTM layer to selectively extract complementary information from different level features and avoids passing through redundant feature to the reconstruction layer.
(4) Parameters-to-PSNR comparison: To further compare the processing time and number of parameters of different methods, in Figure 14
, we illustrate the parameters-to-PSNR relationship on the COWC testing sets at a scale factor ×2 of our model with recursions 4,8,16 (denoted as BCLSR_R4, BCLSR_R8, BCLSR_R16), SRCNN [15
], VDSR [16
], EDSR [17
], WDSR [18
], and RDN [29
]. The proposed BCLSR benefits from inherent parameter sharing and therefore obtains higher parameter efficiency compared to other methods, and the local dense connection reuses the local low-level feature, strengthening the information flow with each recursion. Besides that, due to the variant scale of objects in remote sensing images, the BiConvLSTM layer extracts complementary information from different level recursions and provides additional information to reconstruct the HR images. As demonstrated in Figure 14
, while the BiConvLSTM layer is a relatively time-consuming process compared with RDN [29
], our method outperforms RDN with fewer parameters and our method represents a reasonable trade-off between model size and SR performance with modest inference time.
(5) Failed cases: As the spatial resolution of GF-2 images are only 4 m and 1 m, too much high-frequency information is lost especially in the area full of buildings. As shown in Figure 15
, the reconstruction of the white road is much better than the dense buildings at a scale of ×4. It is quite a common phenomenon for large scale reconstruction that too much loss of information usually makes SR methods fail to recover the fine details and the reconstruction result over-smooth.