1. Introduction
Remote sensing images are being increasingly widely utilized in various fields, such as target characteristic analysis [
1], detection [
2], and classification [
3,
4]. However, due to the trade-off between spectral and spatial resolution, the multi-channel images have a coarse spatial resolution, limiting their further development. Super-resolution (SR) approaches, which directly reconstruct high-resolution (HR) images from low-resolution (LR) images, play a vital role in resolution enhancement and are meaningful for the practical application of remote sensing images. How to design an effective SR model for remote sensing images is the focus of this research paper.
Numerous SR models have been proposed in recent years and can be classified into two types: multi-image SR models (MISR) and single-image SR models (SISR). For the first type, models [
5,
6,
7] employed diverse methods by fusing multi-remote sensing images to improve the spatial resolution. Dian et al. [
5] learned a spectral dictionary from multi-spectral and hyperspectral images to produce sparse representations for enhancing the resolution. In [
6], Liu et al. applied a two-stream fusion network to enhance the resolution of multi-spectral images combined with panchromatic images. Huang et al. [
7] proposed a compact step-wise fusing strategy by incorporating multi-spectral and panchromatic images into a framework to favor the resolution improvement in hyperspectral images. These models achieved a remarkable SR performance for remote sensing images due to the additional fused data. However, the reconstructed results of multi-image SR models are sensitive for geometric correction and time variations, limiting their further applications.
The SISR models concentrate on creating powerful models to extract features from a single image. The reconstructed images using SISR [
8] can be applied for many other applications, such as target tracking [
9] and disparity map generation [
10]. Most existing SISRs can be roughly categorized into hand-crafted models and end-to-end models [
11]. For the former, each step of these methods is manually designed with good interpretability. Interpolation-based models such as bilinear and bicubic models [
12] are a kind of hand-crafted SR model which have been widely applied in remote sensing production. Earlier studies have worked on optimizing linear regression models to improve their reconstruction performance. Ma et al. [
13] proposed a robust local kernel regression approach to enhance the spatial resolution of multi-angle remote sensing images. In [
14], Schulter et al. also presented a locally linear model which employed random forests for mapping LR images into HR images. Timofte et al. [
15] summarized seven techniques which are widely applicable in SISR methods. These models have an intuitive structure and can quickly enhance the resolution, but suffer from a serious problem of quality degradation after reconstruction. Sparse representation-based models are another type of hand-crafted SR model which flexibly combine atoms and elements [
16] to reconstruct HR images. Peleg and Elad [
17] designed a dictionary pair model by extracting the sparse coefficients of HR and LR images to improve the resolution. In [
18], Hou et al. explored a global joint dictionary model to sufficiently obtain the global and local information of remote sensing images. To increase the representation ability of sparse decomposition, Shao et al. [
19] applied a coupled sparse autoencoder to effectively map LR images into HR images. However, these models carry huge computational expense for sparsity constraint, which may have a minimal effect on image representation [
20,
21]. More importantly, the sparsity-based SISR models suffer from the weakness of extracting deep features, restricting their reconstruction precision.
End-to-end models are composed of various networks [
22] in which parameters can be automatically updated by forward and afterward propagation. These models [
18,
19,
20,
21,
22,
23,
24,
25,
26,
27] are designed for natural images, which provide references for the SR task of remote sensing images. Patil et al. [
23] proposed using a neural network to extract the structural correlation and predict fine details of reconstructed images. Su et al. [
24] combined a Hopfield neural network and contouring to enhance the super resolution of remote sensing images. In recent years, convolutional neural networks (CNNs) have been widely used to enhance the resolution of images. The pioneering study [
25] employed a CNN to improve the resolution, and achieved a better performance than hand-crafted ones. Shi et al. [
26] designed an efficient sub-pixel convolutional network (ESPCN) which introduced a pixel-shuffle layer to reduce the computation complexity. In [
27], Kim et al. used a residual-learning module and designed a very deep SR model (VDSR) to reconstruct HR images. A deeper model named a residual dense network (RDN) [
28] was constructed to make full use of the hierarchical features from the LR images and produce a better trade-off between efficiency and effectiveness in recovering the HR images. These models fully exploited spatial information to improve the resolution, but they ignored the internal relations between different channels. In [
29], Zhang et al. designed residual channel attention networks to weigh the spectral band and built an SR model named a residual channel attention network (RCAN). Basak et al. [
30] optimized an RCAN model and applied it to enhance single-image resolution. In [
31], Mei et al. explored the effects of cross-scale spatial information on SR requirements and proposed a cross-scale non-local network (CSNLN). It introduced the non-local priors into framework for extracting multi-scale features within an LR image. Xia et al. [
32] built an architecture called an efficient non-local contrastive network (ENLCN). This model consists of non-local attention and a sparse aggregation module to further strengthen the effect of relevant features. However, these models are not designed for multi-channel remote sensing images and fail to extract nonlinear spectral information.
Inspired by the aforementioned approaches, a great number of SR models for remote sensing images were proposed. Mei et al. [
33] constructed a three-dimensional full convolutional neural network (3D-FCNN) for multi-spectral and hyperspectral images. This model exploited both the spatial neighboring pixels and spectral bands without sufficient distinction between interesting and uninteresting information. Li et al. [
34] proposed a gradient-guided group-attention network to map LR images into HR images. The gradient information was introduced in the reconstruction framework to promote sharp edges and realistic textures. This strategy causes texture distortion when it enhances resolution on a small scale. In [
35], Wang et al. employed a recurrent feedback network to exploit the spatial–spectral information. They introduced a group strategy for spectral channels which destroyed the structure of the spectral curve. Lei and Shi [
36] designed a hybrid-scale self-similarity exploitation network (HSENet) which used different scales’ similarities to enhance the remote sensing images. Then, they designed the transformer-based enhancement network (TranENet) [
37] which applied the transformers to fuse multi-scale features for image enhancement. The multi-scale self-similarity exploitation provides abundant textures for reconstructed images, but they ignore spectral features. Deng et al. [
38] designed a multiple-frame splicing strategy to enhance the resolution of hyperspectral images. However, this model focuses on improving distorted images, limiting the stability of the spectral information in the reconstructed images.
Generally, the adjacent spectral bands and spatial pixels in multi- or hyperspectral remote sensing images are correlated [
39]. To fully extract interesting spatial-spectral information, we designed a novel algorithm named the hybrid SR network (HSRN) to map the LR multi-channel (channel number ≥ 3) remote sensing images into HR images. Specifically, we designed a hybrid module consisting of three-dimensional (3D) and two-dimensional (2D) convolutional layers to extract the nonlinear information of the spectral and spatial domains. Additionally, to exploit the inherent differences and interdependence across feature maps, we introduced channel (spectral) and spatial attention mechanisms, which prompt an increase in discriminative learning ability. We employed a sub-pixel upsampling module (pixel-shuffle layer) to recombine the feature maps to enhance the resolution of the images. In the end, we applied the joint loss function to constrain the model and recover the images with the most accurate texture and spectra possible, compared with label maps. We tested our model on three public datasets and calculated three evaluation metrics to assess the performance of the SR methods. The experimental results prove that our model outperforms state-of-the-art models. The main contributions of this article can be summarized as follows.
- (1)
We propose a novel hybrid SR model combining 3D and 2D convolutional networks for multi-channel images. The improvement encourages our model to capture the spatial and spectral information simultaneously, and fully utilizes the different responses of various channels to enhance the spatial resolution.
- (2)
We designed an attention structure to strengthen the SR performance for multi-channel images. We applied channel attention to learn the inter-band correlation before the upsampling block and employed spatial attention to refine the spatial texture of the upsampling feature maps.
- (3)
We introduced multi-scale structural similarity (MS-SSIM) into our loss function to constrain the proposed model and acquire a rich texture. The MS-SSIM function forces our model to learn the multi-scale structure from labels and reconstruct high-quality HR images.
The organization of this article is as follows. In
Section 2, we present the related works on SR models.
Section 3 depicts the details of our proposed algorithm for multi-channel remote sensing images. The experimental consequences and the analysis of the public datasets are exhibited in
Section 4. Conclusions are drawn in
Section 5.