1. Introduction
With the rapid development of the imagery sensors, remote sensing images are gaining great attention [
1]. Among these sensors, hyperspectral imagery sensors usually collect hundreds of bands which range from the visible to the infrared wavelength by a step of less than 10 nm, coming into a three-dimensional (3D) data cube [
2,
3]. The rich and fine spectral information in the hyperspectral images (HSIs) makes their wide application in various fields, including the medical diagnosis [
4], military rescue [
5], change detection [
6,
7], object detection [
8], agriculture monitoring [
9], etc. However, as a result of the high spectral resolution, electrons reaching a single band are limited, creating the poor spatial resolution of the HSIs. This trade-off between the spectral and the spatial information is one of the fundamental issues in HSI processing [
10]. HSI super-resolution (SR) aims to enhance the spatial information of the input HSI without modifying the equipment, which plays a significant role in accurate classification and has become a hot issue in remote sensing [
11].
There are lots of studies on the SR methods for the HSIs. According to the number of input images, the existing HSI SR methods can be roughly classified into the fusion-based and the single-image super-resolution (SISR) approaches [
12]. The fusion-based methods create a combination of multi-modal images captured by different sensors, and they lead to obtaining the final image with all the characteristics these different sensors own [
13]. The SISR methods are mainly focused on how to improve the spatial information by importing priors [
14].
The typical input images for the fusion-based methods can be a combination between the HSI and the multispectral image (MSI) [
15], the HSI and the RGB [
16,
17] or the HSI and the panchromatic (PAN) image [
18]. Among all these combinations, the HSI provides the rich spectral information, and the other image provides the rich spatial information. Due to the excellent performance of the deep learning-based methods in nonlinear mapping learning, many neural networks have been designed for the fusion task [
19]. When it comes to the fusion between the high-resolution (HR) PAN and low-resolution (LR) HSI, many networks have been focused on how to inject the spatial information in the PAN image into the HSI and to obtain the desired HR HSI [
20]. The Laplace pyramid decomposition technique is proposed to be merged in the convolution neural network, decomposing the image into different levels and achieving a more efficient performance [
21]. By injecting the residual between the HSI and the PAN image into the structure of the HSI, which served as an additional constraint during the super-resolving process, an appealing performance is achieved [
22]. When compared with the HR PAN, there is more information conveyed by the HR MSI, making the SR between the LR HSI and the HR MSI much more complicated [
23]. Both input images are usually formulated as the spatial and spectral down-sampled version of the desired HR HSI. Inspired by the physical imaging models, the fusion can be formulated as a differentiable problem. A model-guided unfolding network is proposed to optimize the problem by an end-to-end training [
24]. An MSI/HSI fusion net is designed to represent the desired HSI by its complete basis set and to build an interpretable network through unfolding the proximal gradient algorithm [
25]. In addition, due to the intrinsically high correlation among the HSIs, they can be formulated by a three-dimensional tensor. Consequently, many tensor-based methods have been proposed [
26,
27]. By importing the tensor–tensor product in the sparse representation process, the relationship between the input images and the desired images is better formulated, achieving an acceptable performance. Meanwhile, by importing the nuclear norm regularization of the third tensor ring core with mode-2 unfolding, the spectral low rankness of the desired HR HSI is exploited [
28]. With more information already known for the target scene, the reconstruction performance of the fusion-based methods always outperforms the SISR methods. However, in reality, it is always difficult to access two fully registered multi-modal images, which creates a limitation on the practicability of this type of method [
29].
As for the SISR methods, they can be further divided into two sub-groups, including the sub-pixel mapping ones and the reconstruction-based ones. Considering the coarse resolution of the HSIs, pixels in the spatial tend to be mixed by different endmembers. The sub-pixel mapping methods aim to arrange class spatial location within each mixed pixel [
30]. The two main modules include endmember extraction and fractional abundance estimation [
22]. During the endmember extraction process, the simple but effective linear spectral mixture model is frequently employed to formulate the mixing mechanism inside the pixels [
31]. In the abundance fraction estimation process, the two most frequent constraints for the abundance are the sum-to-one constraint and the non-negative constraint. By formulating a hybrid library in which both labeled and unlabeled endmembers are included, abundances of unlabeled endmembers are used to optimize the abundances of the labeled endmembers, achieving a more accurate sub-pixel mapping result [
32]. By incorporating the spectral information in the original HSI and the concept of spatial dependence, a sub-pixel mapping result is generated, which is independent from the intermediate abundance map. For this type of method, the noise generated by the unmixing operation is inevitable and propagated into the mapping operation [
33], which has a negative influence on the SR process.
As for the reconstruction-based SISR methods, they are mainly focused on the spatial characteristics of the HSIs, while giving some consideration to the spectral information. The most direct way is to design a 3D fully convolutional neutral network (3D-CNN) for learning the mapping between the input HSI and the desired HSI [
34]. Considering the spectral preservation ability of the network, a spectral difference convolutional neural network (SDCNN) is designed [
12]. In addition, a deep intrafusion network (IFN) is also proposed to fully utilize the spatial–spectral information [
35]. Networks with both 2D and 3D units have also been designed to share spatial information in the reconstruction process [
36]. By importing the detail in the reconstruction process, a gradient-guided residual dense network (G-RDN) is also proposed [
37]. All these mentioned HSI SR methods deal with the original HSI as a whole, in which both the texture and the structure are simultaneously super-resolved in a mixture. By observing that the structure and texture exhibit different sensitivities to the spatial degradation, a residual structure-texture dense network (RSTDN) with two branches has also been proposed [
38]. Combining 2D and 3D convolutions to better utilize single and adjacent band information to foster information complementarity and simplify the network structure [
39]. It alternately employs 2D and 3D units and includes a split adjacent spatial and spectral convolution (SAEC) design to parallelly analyze spectrum and spatial direction information [
40]. However, these two branches just directly send the structure and the texture into the residual dense blocks to make a feature extraction, lacking consideration of the probable error propagation and texture-weakening phenomenon at a large degradation extent. Meanwhile, HSIs can also fuse with the LiDAR to promote the classification accuracy [
41,
42]. In addition, super-resolution can be further applied in the spectral domain to obtain the rich spectral information via the RGB images [
43,
44]. Images captured by different sensors can be fused to achieve the finer application accuracy [
45,
46].
In this paper, a gated content-oriented residual dense network (GCoRDN) is innovatively designed for HSI super-resolution. This network is designed by three important observations. Specifically, based on the first observation that the structure and texture exhibit different sensitivities to the spatial degradation, a content-oriented network with two branches is designed, in which the original HSI is respectively integrated with the structure and texture to make a further utilization of the input HSI. This integration avoids the possible information loss during the extraction process. Meanwhile, different bands have distinct roles in the reconstruction process, and the attention mechanism can be employed to account for this variation. Furthermore, to maintain consistency in both the structure and texture, a weight-sharing strategy is integrated into the network through the utilization of the attention mechanism. In addition, by analyzing the super-resolved results via different methods, a gating mechanism is applied in post-processing to further enhance the performance. Experimental results and data analysis on both the ground-based HSIs and airborne HSIs have demonstrated the effectiveness of the proposed method. The main contributions of this paper are highlighted as follows:
(1) By integrating the input HSI with the structure and texture, and super-resolving them separately via two branches, the proposed GCoRDN not only individualized super-resolves the different contents in the scene but also avoids the possible error caused in the content extraction process;
(2) By incorporating a weight-sharing attention mechanism between these two branches, bands with the same indices in both the structure and the texture cube are equally weighted. This strategy not only respects the different roles these bands play in the reconstruction process but also ensures the consistency among the structure and the texture, achieving a more acceptable reconstruction performance;
(3) Data analysis of the reconstruction results achieved by different methods have demonstrated that the classical bicubic method tends to have a good performance for the homogeneous regions with little information. Consequently, a gating mechanism is applied to these regions with less details, which leads to a further enhancement of the reconstruction performance.
(4) The proposed GCoRDN is evaluated on three different datasets with different degradation degrees, including both the ground-based HSIs and airborne HSI at the scaling factors of 2, 3 and 4. Experimental results and data analysis have demonstrated the superiority of the proposed method.
The rest of this paper is organized as follows.
Section 2 describes the proposed method. The experimental setup and ablation study are in
Section 3 and
Section 4 contains a discussion of the experimental results. Finally, the conclusions are drawn in
Section 5.
3. Results
To thoroughly analyze the proposed method, three datasets including both the indoor scene and the real scenarios have been exploited for the experiments. Meanwhile, both the subject visual exhibition and the objective measurement metrics have been applied to the evaluation. Details about the experimental setup and data analysis are discussed in the following subsections.
3.1. Datasets
Cave dataset: The CAVE dataset [
53] was collected by a cooled CCD camera in the laboratory environment. There are 32 HSIs in the dataset, which covers a wide variety of real-word objects including flower, food and drink, paints, etc. Each HSI contains 512 × 512 pixels with 31 bands in the spectral range from 400 to 700 nm, whose spectral resolution is 10 nm per band.
Harvard dataset: The Harvard dataset [
54] was collected by the Nuance FX, CRI Inc., covering both indoor and outdoor environments in fifty natural scenes under daylight illumination. All the HSIs in this dataset have 31 spectral bands at a step of 10 nm from 400 to 700 nm. The spatial size of each HSI is 1040 × 1392 pixels.
Pavia center: The Pavia center [
55] was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over Pavia, northern Italy. This scene are composed by 1096 × 715 pixels with a geometric resolution of 1.3 m per pixel. After removing the invalid information in the original image, there remains 102 spectral reflectance bands. Different from both the CAVE and Harvard datasets which are ground-based remote sensing HSIs, the Pavia center HSI is an airborne HSI which suffers from more environmental disturbance during the acquisition process.
In short, all the HSIs in the CAVE dataset are captured at the indoor environment, and they are with comparatively fine details. As for the HSIs in the Harvard dataset, they were captured in the outdoor scenario with a large shooting distance, leading to their rough spatial information. The Pavia center is an airborne HSI whose spatial information is also comparatively poor.
3.2. Experimental Setup
In order to conduct a comprehensive evaluation of the super-resolving performance, four universal metrics are used, including the Peak Signal-to-Noise Ratio (PSNR), Structure Similarity Index Measurement (SSIM), Spectral Angle Mapper (SAM) and the Erreur Relative Globale Adimensionnelle de Synthese (ERGAS). The optimal value for the SAM, RMSE and ERGAS is 0.
PSNR is an indicator that measures the degree of image distortion, and a higher PSNR value indicates better image quality. It is calculated as follows
where the MSE is calculated as shown in Equation (
19)
where
m and
n represent the dimensions (width and height) of the image, and
and
represent the grayscale values of the original image and reference image at pixel position
, respectively.
SSIM is an indicator used to evaluate the similarity between two images or video frames. It is formulated as follows
where
x and
y denote two images,
and
represent the average values of images
x and
y, respectively, while
and
represent the standard deviations of images
x and
y.
denotes the covariance between images
x and
y.
and
are constants used to avoid division by zero in the denominator.
SAM is mainly used to compare and classify different spectral regions in multispectral or hyperspectral images, and SAM evaluates the similarity between spectra by calculating their angles. It is formulated as follows
where
x and
y denote two spectral vectors, where
and
represent the elements of these vectors, and
n represents the length of the spectral vectors.
ERGAS measures the global image quality with lower ERGAS values corresponding to the higher spectral quality of pansharpened images, which is given by the following equation
where ERGAS represents the Relative Global Error in the Synthesis index.
N represents the total number of bands in the image.
denotes the Root Mean Squared Error of the
i-th band, while
represents the average spatial resolution of the
i-th band.
3.3. Training Details
During the training process, the images in the datasets are randomly divided into the training, validation and testing sets at the ratio of 4:1:1 or 10:1:1. For the CAVE dataset, there are 32 images in the dataset, of which 24 images are randomly selected for the training, and 4 images are selected for validation. The remaining 4 images are utilized for testing. For the Harvard dataset, there are more hyperspectral images, and the images have a bigger spatial size, which causes a great burden for the training process. In this way, 30 randomly selected images composed the training set. Meanwhile, the validation and testing sets are both composed by three HSIs. It is noted that to speed up the training process, the HSIs in both the validation and test sets are cropped into a cube with 512 × 512 × 31. For the Pavia center dataset, the whole scene is cropped into three nonoverlapping regions. The top region with a size of 728 × 712 × 102 is designed for the training, the bottom-left region with a size of 368 × 356 × 102 is designed for the validation, and the remaining area is for the testing.
The input LR image is acquired by down-sampling the ground truth image via the classical bicubic interpolation operation, which is a conventional manipulation in the HSI SR. During the training process, the LR HSIs are cropped into the 32 × 32 patches and act as the input. Experiments in ref. [
56] have demonstrated that optimizing the
norm requires less computational complexity and achieves a performance improvement. In this way, the difference between the reconstructed HSI and the desired HR HSI in the proposed GCoRDN is also measured by the
norm, and it acts as the loss function. To be specific, the parameters
and
of the ADAM optimizer are set as 0.9 and 0.999, respectively. The initial learning rate is
whereas it is set to
on the real scenario, and the training epoch is fixed as 200. The weight decay and the momentum are empirically set as the default value and 35 epochs, respectively. The training is implemented on the PyTorch framework, and the platform is with an Intel core i9-10850K 3.6 GHz CPU, 16 GB memory and NVIDIA RTX 3070 GPU.
In addition, the size of all the kernels in the convolution layers is 3 × 3, except for those in the global and local fusion module whose size is 1 × 1. All the compared methods are implemented based on the code released by the authors. To make a fair comparison, each method has been retrained according to the experimental setup described in the corresponding papers until convergence.
3.4. Parameter Sensitivity
To conduct an analysis of the correlation between the numbers of RDSAB and the performance, the parameter N ranges from 4 to 10 by step 2, and the corresponding reconstruction performance is exhibited in
Table 1.
Table 1 demonstrates the average reconstruction performance of the four test images in the CAVE dataset at the scaling factor 2. It is demonstrated that both the SSIM and SAM are comparatively stable to the variation of N. The optimal PSNR is achieved when N is 8. In this way, we empirically set N as 8 in the experiments.
Meanwhile, experiments have also been conducted to validate the effectiveness of the gating mechanism. The HSI named as “flowers” in the CAVE dataset has been employed for this validation. A comparison is made between the HSIs reconstructed by the SRCNN and the bicubic methods. As shown in
Figure 4a, the “LR_flowers” denotes a down-scaled LR HSI by the ratio of 0.5. Two rectangle regions are manually selected, which are marked with ’local_1’ and ’local_2’. The ’local_1’ and ’local_2’ regions denote the areas with poor and rich spatial information, respectively. There are three columns in
Figure 4b, which visually displays the ground truth, the bicubic reconstructed area and the SRCNN reconstructed area. It is noticed that for the ‘local_1’ area with a homogeneous background, the classical bicubic method can easily achieve an acceptable performance. As to the ‘local_2’ area with more texture, both the SRCNN reconstructed and the bicubic reconstructed areas exhibit blurs between the neighboring petals. However, the information brought by the bicubic method is far less than that brought by the SRCNN, leading to its poor performance on the complex textures, such as the left-top regions of ’local_2’.
To further validate this observation, the PSNR, SSIM and SAM have been employed to conduct a quantitative evaluation. The results have been displayed in
Table 2. The data in the first row of
Table 2 have validated that the bicubic method outperforms the SRCNN in all the measurements when super-resolving the ‘local_1’ area. It demonstrates the effectiveness of the classical bicubic method for the homogeneous areas. On the other hand, the data in the last row of the
Table 2 have proved the effectiveness of SRCNN for the areas with rich information. In this way, it is rational to incorporate the gating mechanism in the proposed GCoRDN to further improve the super-resolving performance.
Meanwhile, it is mentioned in
Section 2.4 that there is an all-zero tensor
manually formulated to make an evaluation of the information richness, whose size is
.
k has ranged from 16 and 32 to 64 and to find the optimal size. The corresponding reconstruction performance of the GCoRDN on the CAVE dataset at the scaling factor of 2 is listed in
Figure 5. The data in
Figure 5 have shown that both the PSNR and SSIM obtain the optimal values when
k is 32. The SAM is comparatively stable to the variation of the
k. In this way, the
k value is finally determined as 32 in the GM module. In this way, to make the proposed method more practical, the k is fixed as 32 for all the experimental data.
3.5. Ablation Study
To verify the effectiveness of key modules, this section tests variant versions of the proposed model by removing each component on the Cave dataset.
Table 3 shows the ablation study for key modules. Specifically, to validate the effectiveness of the COC module, the concatenation operation between the S, T, and the LR image was excluded, and the comparison of PSNR and SAM values between the first column and the last column proves that the introduction of the COC module can effectively avoid the problem of information loss in the process of structure extraction. The ablation experiment regarding the RDSAB module involves replacing the SA strategy with the RDB module, which proves that the module can improve the image spatial quality comparing with the last column. Then, we also conduct the experiment without weight sharing, which is to generate texture feature weights and structure feature weights separately in the attention mechanism of RDSAB.
Table 3 demonstrates the effectiveness of adopting this strategy. Compared with the last column, the fourth column verifies that the GM module can further improve the network performance as a type of post-processing.
One can observe that the proposed combination produces better performance than any other arbitrary module combination, especially achieving the highest value in PSNR which measures the reconstruction ability of the methods. However, the SAM metric performs better without the inclusion of RDSAB module and weight sharing, indicating that these modules may introduce noise or uncertainty of the spectrum during the spatial reconstruction process. At the same time, using these modules significantly improves the PSNR and ERGAS metrics. Improvement of the PSNR indicates that the spatial information is enhanced by these modules. Meanwhile, better ERGAS values show that the spatial enhancement suppressed the spectral uncertainly brought by these modules. Overall, these modules can extract and retain more useful feature information, thereby enhancing the overall image quality and detail reconstruction ability.