1. Introduction
Hyperspectral images (HSIs) contain rich spectral information, which is beneficial for discriminating between different materials in the scene. Due to the discriminative ability, HSI has been applied in many fields, including target detection [
1], mineral exploitation [
2], and land cover classification [
3]. Earth observation applications often need HSI with high spatial resolution. However, the spatial resolution of HSI is often limited because of the trade-off between the spatial and spectral resolution (e.g., Hyperion HSI is of 30 m spatial resolution). Compared with HSIs, multispectral images (MSI) have wider bandwidth, and are often of higher spatial resolution (e.g., ASTER MSI is of 15 m resolution). Fusing low resolution (LR) HSIs with a high resolution (HR) MSIs is an important technology to enhance the spatial resolution of HSI [
4,
5].
Several HSI-MSI fusion algorithms have been proposed in the last decades [
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20]. HR HSI can be reconstructed by combining endmember of LR HSI and an abundance of HR MSI. According to this principle, several unmixing based fusion methods have been proposed. For example, in [
6], HSI and MSI were alternatively unmixed by applying nonnegative matrix factorization in a coupled way; thus, HR HSI was reconstructed with the endmember and the HR abundance under a linear mixture model. This method was also used to fuse Hyperion HSI with ASTER MSI and produce HSI with 15 m resolution [
7]. Similarly, by exploiting the sparsity prior of endmembers, a fusion method based on sparse matrix factorization was proposed in [
8]. It was also used to fuse MODIS HSI with Landsat 7 ETM + MSI, and to enhance the resolution of MODIS HSI by 8 times [
9]. HR HSI can also be reconstructed with a dictionary. In [
10,
11], a spatial dictionary was learned from HR MSI, and HR HSI was then reconstructed via joint sparse coding. In [
12,
13], a spectral dictionary was learned from LR HSI, then it was used to reconstruct HR HSI based on the abundance map of MSI. The HSI-MSI fusion problem could also be solved in a variation framework [
14,
15,
16,
17]. Wei et al. [
14] proposed a variation model for the HSI-MSI fusion whereby the sparsity prior of HSI was exploited as a regularizer. Other than the sparsity regularizer, a vector-total-variation regularizer was used in [
15], a low rank constraint and a spectral embedding regularizer were designed for fusion in [
16]. In [
17], a maximum a posteriori fusion method was proposed by exploiting the joint statistics of endmembers under a stochastic mixing model. A new concept, “hypersharperning”, was proposed in [
18], and applied to Worldview-3 data in [
19], which aims at fusing LR HSI in the short-wavelength infrared bands with HR HSI in the visible and near infrared bands of the same sensor.
Most of the above fusion methods suffer from three major drawbacks. Firstly, they are based on hand-crafted features such as the dictionary, which can be regarded as low-level feature with limited representative ability. Secondly, they rely on prior assumptions, such as the linear spectral mixture assumption in [
6,
7,
8,
9], and the sparsity prior in [
10,
11,
12,
13,
14]. Quality degeneration may be caused if these assumptions do not fit the problem. Finally, optimization problems are often involved in the testing stage, making the HSI reconstruction time-consuming. Recently, deep learning has attracted research interests due to its ability to automatically learn high-level features, and its high non-linearity [
21,
22,
23,
24,
25,
26,
27,
28], which is of great potential for modeling the complex nonlinear relationship between LR and HR HSIs in both the spatial and spectral domains. Compared with the hand-crafted features, the features extracted by deep learning are hierarchical: both of the low-level and high-level features can be extracted, which would be more comprehensive and robust for reconstructing HR HSI. In addition, deep learning is data-driven; it does not rely on any assumption or prior knowledge. After the off-line training, only feed forward computation is needed in the testing stage of deep learning, which would make the HSI reconstruction fast. Therefore, the performance is expected to be improved if deep learning is applied to the spatial enhancement of HSI.
Among the typical deep learning models, convolutional neural networks (CNN) are the most widely used model for single image enhancement. Several CNN-based image super-resolution methods has been proposed [
29,
30,
31,
32,
33]. The success of CNN in image super-resolution could be summarized by the following three points. Firstly, CNN is built upon 2-D convolution computation, which could naturally exploit the spatial correlation of images. Secondly, CNN with deep architecture has large capacity and flexibility for representing the mapping between LR and HR images [
34]. Thirdly, compared with other deep learning models, such as stacked auto-encoders (SAE) [
35], due to the weight sharing and local connection scheme, CNN often has fewer connections, and is less prone to over-fitting [
36].
Inspired by the success of CNN in single image enhancement, in this study, we propose a deep CNN with a two-branch architecture for the fusion of HSI and MSI. In order to exploit the spectral correlation of HSI and fuse the MSI, the spectrum of LR HSI and the corresponding spatial neighborhood in HR MSI is used as input pair of the network. We extract the features from the spectrum of LR HSI and the corresponding neighborhood in MSI with the two CNN branches. In order to fully fuse the information extracted from HSI and MSI, the extracted features of the two branches are concatenated and then fed to fully connected (FC) layers. The final output of the FC layers is the spectrum of the expected HR HSI.
We consider three main contributions in this work:
We propose learning the mapping between LR and HR HSIs via deep learning, which is of high learning capacity, and is suitable to model the complex relationship between LR and HR HSIs.
We design a CNN with two branches extracting the features in HSI and MSI. This network could exploit the spectral correlation of HSI and fuse the information in MSI.
Instead of reconstructing HSI in band-by-band fashion, all of the bands are reconstructed jointly, which is beneficial for reducing spectral distortion.
The rest of this paper is organized as follows. In
Section 2, some basics on deep learning-based image super-resolution are presented. In
Section 3, we give the proposed HSI-MSI fusion method based on deep learning, including the architecture of the network and the training scheme. Experiment results on the simulated and real HSI are presented in
Section 4. Discussions on the experiment results are in
Section 5. We make the conclusions in
Section 6.
2. Background of CNN Based Image Super-Resolution
CNN has been successfully applied to spatial enhancement of single images [
29,
30,
31,
32,
33]. In [
29], Dong et al. proposed a super-resolution CNN network (SRCNN). As shown in
Figure 1, the CNN architecture for super-resolution is composed of several convolutional layers. The input of the network is the LR image, which is first up-scaled to the same size of its HR version. Activity of the
i-th feature map in the
l-th convolutional layer can be expressed as [
36]
where
is the
j-th feature map in the (
l − 1)-th layer that connected to
in the
l-th convolutional layer.
,
are the number of rows and columns of
.
is the convolutional kernel for
associated with the
i-th output feature
, and
is the size of the kernel.
is bias.
denotes the convolutional operator. The size of
is
.
is a nonlinear activation function, such as rectified linear units (ReLU) function
[
36]. The output of the network is the expected HR image. In the training stage, the mapping function between the up-scaled LR and HR images can be learned and represented by the CNN network. In the testing stage, the HR image is reconstructed from its LR counterpart with the learned mapping function.
Inspired by this idea, some other CNN based super-resolution methods have also been proposed. For example, a faster SRCNN (FSRCNN) was proposed by adopting a deconvolution layer and small kernel size in [
30]. Kim et al. [
31] pointed out that increasing the depth of CNN is helpful for improving the super-resolution performance. A very deep CNN for super-resolution (VDSR) was proposed and trained with a residual learning strategy in [
31]. In [
32], the authors proposed an end-to-end deep and shallow network (EEDS) composed of a shallow CNN and a deep CNN, which restored the principle component and the high-frequency component of the image respectively. In order to reduce the difficulty of enhancing the resolution by a large factor, the authors in [
33] proposed a gradual up-sampling network (GUN) composed of several CNN modules, in which each CNN module enhanced the resolution by a small factor.
Other than single image super-resolution, CNN also shows the potential in HSI super-resolution. The mapping between LR and HR HSIs can be learned for super-resolution by different deep learning models, such as 3D-CNN [
37]. A deep residual CNN network (DRCNN) with spectral regularizer was proposed for HSI super-resolution in [
38]. In [
39], a spectral difference CNN (SDCNN) was proposed, which learns the mapping of spectral difference between LR and HR HSIs. CNN has also been applied to pan-sharpening. In [
40], HR panchromatic image was stacked with up-scaled LR MSI to form an input cube, a pan-sharpening CNN network (PNN) was used to learn the mapping between the input cube and HR MSI. A deep residual PNN (DRPNN) model was proposed to boost PNN by using residual learning in [
41]. In [
42], in order to preserve image structures, the mapping was learned by a residual network called PanNet in the high-pass filtering domain, rather than the image domain. Multi-scale information could be exploited in mapping learning. In [
43], Yuan et al. proposed a multi-scale and multi-depth CNN (MSDCNN) for pan-sharpening, whereby each layer was constituted by filters with different sizes for the multi-scale features.
The above CNN-based super-resolution methods are also summarized in
Table 1. Despite the success of CNN in super-resolution and pan-sharpening, two issues still exist when applying deep learning to HSI super-resolution. On the one hand, the CNN models for single image super-resolution mainly deal with the spatial domain, while for HSI, the deep learning model should also exploit the spectral correlation of HSI, and jointly reconstruct different bands. On the other hand, some auxiliary data (e.g., MSI) of the same scene with HSI is often available; these auxiliary data can provide complementary information for HSI super-resolution. The means to fuse HSI with these auxiliary data in the deep learning framework still lacks study.
4. Experiment Results
In this section, the performance of the proposed fusion algorithm (denoted as Two-CNN-Fu) is evaluated on several simulated and real HSI datasets. We first evaluate the proposed method on the simulated data by comparing it with other state-of-the-art fusion methods. In order to demonstrate the applicability of the proposed method, we also apply it on real spaceborne HSI-MSI fusion. Because the original HR HSI is not available in this case, and there is no reference HSI for assessment, we use our previously proposed no-reference HSI quality assessment method in [
45] to evaluate the fusion performance; land-cover classification accuracy of the fused HSI is also used to evaluate the fusion performance.
4.1. Experiment Setting
Two datasets are used in the experiment. The first dataset was collected by an Airborne Visible Infrared Imaging Spectrometer (AVIRIS) sensor [
46], which consists of four images captured over
Indian Pines,
Moffett Field,
Cuprite, and
Lunar Lake sites with dimensions 753 × 1923, 614 × 2207, 781 × 6955 and 614 × 1087, respectively. The spatial resolution is 20 m. The dataset was taken in the range of 400~2500 nm, with 224 bands. After discarding the water absorption bands and noisy bands, 162 bands remained. The second one is Environmental Mapping and Analysis Program (EnMAP) data, which was acquired by HyMap sensor over
Berlin district on August 2009 [
47,
48]. The size of this data is 817 × 220 with spatial resolution 30 m. There are 244 spectral bands in the range of 420~2450 nm.
Higher resolution MSIs as relative to future earth observation hyperspectral sensors are available. But due to the fact that earth observation HSIs with reference (at higher spatial resolution for evaluation) are not available, simulations are often used. The above HSI datasets are regarded as HR HSI and reference image, and both LR HSI and HR MSI are simulated from them. The LR HSI are generated from the reference image via spatial Gaussian down-sampling. The HR MSI is obtained by spectrally degrading the reference image with the spectral response function of Landsat-7 multispectral imaging sensor as filters. There are six spectral bands of the simulated MSI, which cover the spectral regions of 450~520 nm, 520~600 nm, 630~690 nm, 770~900 nm, 1550~1750 nm, and 2090~2350 nm, respectively. We crop two sub-images of 256 × 256 from Indian pines and Moffett Field from the AVIRIS dataset, and one sub-image of 256 × 160 from Berlin from the EnMAP dataset as testing data. Fifty thousand samples are extracted for training each Two-CNN-Fu model. There is no overlapping between the testing region and the training region. The network is trained on the down-sampled data; the original HR HSI does not appear in training, and is only used as an assessment reference.
The network parameters of our deep learning model are given in
Table 2. The deep learning network needs to be initialized before the training. All of the convolutional kernels and weight matrix of the FC layers are initialized from Gaussian random distribution, with standard variance of 0.01 and a mean 0. The bias values are initialized to 0. The parameters involved in the standard stochastic gradient descent method are learning rate, momentum, and batch size [
44]. The learning rate is fixed at 0.0001, momentum is set to 0.9, and the batch size is set to 128. The number of training epochs is set to 200.
4.2. Comparison With State-of-the-Art Methods
In this section, we compare our method with other state-of-the-art fusion methods. The compared methods are: the coupled nonnegative matrix factorization (CNMF) method [
6], the sparse spatial-spectral representation method (SSR) [
12], and the Bayesian sparse representation method (BayesSR) [
13]. The Matlab codes of these methods are released by the original authors. The parameter settings in the compared methods first follow the suggestions from the original authors; we then empirically tune them, to achieve the best performance. The number of endmembers is a key parameter for the CNMF method; it is set to 30 in the experiment. The parameters in the SSR method include the number of dictionary atoms, the number of atoms in each iterations, and the spatial patch size, which are set to 300, 20, and 8 × 8, respectively. The parameters in the BayesSR method consist of the number of inferencing sparse coding in Gibbs sampling process, and the number of iterations of dictionary learning, which are set to 32 and 50,000, respectively. It is noted that all the compared methods fuse LR HSI with HR MSI. Although there are some deep learning-based HSI super-resolution methods, such as 3D-CNN [
37], only LR HSI was exploited in these methods. Therefore, they are not used for comparison for reasons of fairness.
The fusion performance is evaluated by peak-signal-noise-ratio (PSNR, dB), structural similarity index measurement (SSIM) [
49], feature similarity index measurement (FSIM) [
50], and spectral angle mean (SAM). We calculate PSNR, SSIM, and FSIM on each band, and then compute the mean values over the bands. The indices on the three testing data are given in
Table 3 and
Table 4. The best indices values are highlighted in bold.
It can be seen that our proposed Two-CNN-Fu method has competitive performance on the three testing data. In
Table 3, the PSNR, SSIM, and FSIM of our results are higher than those of compared methods, which means that our fusion results are closer to the original HR his, with fewer errors. The SSR method is based on spatial-spectral sparse representation; a spectral dictionary is first learned with the sparsity, and then combined with the abundance of MSI to reconstruct the HR HSI. While in the CNMF method, the endmember of LR HSI and the abundance of MSI are alternatively estimated in a coupled way, the estimated endmember and the abundance would be more accurate, so CNMF could achieve better performance than SSR. The BayesSR method learns the dictionary in a non-parametric Bayesian sparse coding framework, and often performs better than the parametric SSR method. The best performance is achieved by Two-CNN-Fu on the three testing data. Two-CNN-Fu extracts hierarchical features, which are more comprehensive and robust than the hand-crafted features in [
6,
12,
13]. The performance of Two-CNN-Fu demonstrates the effectiveness and potential of deep learning in the HSI-MSI fusion task. In order to verify the robustness over a larger resolution ratio between LR HSI and HR MSI, we also simulate the LR HSI by a factor of four, and then fuse it with MSI. The Two-CNN-Fu also performs better than other methods, as shown in
Table 4. The PSNR curves over the spectral bands are presented in
Figure 5. It can be found that the PSNR values of Two-CNN-Fu are higher than compared methods in most bands.
It is worth noting that the result of our Two-CNN-Fu method has the lowest spectral distortion among the compared methods in most cases, as shown in
Table 3 and
Table 4. Our deep learning network directly learns the mapping between the spectra of LR and HR HSIs. The objective function Equation (4) for training the network aims at minimizing the error of the reconstructed spectra of HR HSI. In addition, instead of reconstructing HSI in a band-by-band way, our deep learning model jointly reconstructs all bands of HSI. These two characteristics are beneficial for reducing the spectral distortion.
We present parts of the reconstructed HSIs in
Figure 6,
Figure 7 and
Figure 8. In order to visually evaluate the quality of different fusion results, we also give pixel-wise root mean square error (RMSE) maps, which reflect the errors of reconstructed pixels over the whole bands. It is clear that the fusion result of our Two-CNN-Fu method has fewer errors than the compared methods. The compared methods rely on hand-crafted features such as the dictionary. Their RMSE maps have materials-related patterns, which may be caused by the errors introduced in dictionary learning or endmember extraction. Our Two-CNN-Fu method reconstructs the HR HSI based on the mapping function between LR and HR HSIs, which is trained by minimizing the error of the reconstructed HR HSI, so the fusion result of Two-CNN-Fu has fewer errors.
4.3. Applications on Real Data Fusion
In order to investigate the applicability of the proposed method, we apply the proposed method to real spaceborne HSI-MSI data fusion. The HSI data was collected by Hyperion sensor, which is carried on Earth Observing-1 (EO-1) satellite. This satellite was launched in November 2000. The MSI data was captured by the Sentinel-2A satellite, launched on June 2015. The spatial resolution of Hyperion HSI is 30 m. There are 242 spectral bands in the spectral range of 400~2500 nm. The Hyperion HSI suffers from noise; after removing the noisy bands and water absorption bands, 83 bands remained. The Sentinel-2A satellite provides MSIs with 13 bands. We select four bands with 10 m spatial resolution for the fusion. The central wavelengths of these four bands are 490 nm, 560 nm, 665 nm, and 842 nm, and their bandwidths are 65 nm, 35 nm, 30 nm, and 115 nm, respectively.
The Hyperion HSI and the Sentinel-2A MSI in this experiment were taken over Lafayette, LA, USA in October and November, 2015, respectively [
51]. We crop sub-images to 341 × 365 and 1023 × 1095 as study areas from the overlapped region of the Hyperion and Sentinel data, as shown in
Figure 9. The remainder of the overlapped region is used for training the Two-CNN-Fu network.
In this experiment, our goal is to fuse the 30 m HSI with the 10 m MSI, and then generate a 10 m HSI, so a Two-CNN-Fu network that could enhance HSI by a factor of three should be trained. In the training stage, we first down-sample the 30 m Hyperion HSI and 10 m Sentinel-2A MSI into 90 m and 30 m, by a factor of three, respectively. Then we train a Two-CNN-Fu network that could fuse the 90 m HSI with the 30 m MSI, and reconstruct the original 30 m HSI. This network could enhance HSI by a factor of three. We assume that it could be transferred to the fusion task of 30 m HSI and 10 m MSI. By applying the trained network to the 30 m HSI and the 10 m MSI, an HSI with 10 m resolution could be reconstructed. The network parameters are set according to
Table 2, except that the number of convolutional layers in the HSI branch is one, because we only use 83 bands of the Hyperion data. The maximal number of convolutional layers in the HSI branch is one, in this case.
The fusion results of different methods on the study area are presented in
Figure 10. The size of the fusion result is 1023 × 1095. In order to highlight the details of the fusion results, we also display two small areas by enlarging them in
Figure 11 and
Figure 12. It is clear that there is some noise in the results of SSR and CNMF, as shown in
Figure 11b,d. In
Figure 12, we also find that some details in the results of SSR and CNMF are blurred, as indicated in the dashed box. The results of BayesSR and Two-CNN-Fu are sharper and cleaner, and our Two-CNN-Fu method produces the HR HSI with higher spectral fidelity. It is clear that the spectral distortion of BayesSR is heavier than our Two-CNN-Fu results, if we compare them with the original LR images. The color of the BayesSR results seems to be darker than the original LR image. Spectral distortion in the fusion would affect the accuracy of applications such as classification.
It is noted that the Hyperion HSI and the Sentinel-2A MSI in this experiment were not captured at the same time; the temporal difference is about one month. Some endmembers may change during this month, which may be one of the factors that lead to the spectral distortion. Even though nearly all of the fusion results in
Figure 11 and
Figure 12 suffer from the spectral distortion, our Two-CNN-Fu method generates results with less spectral distortion, which demonstrates the robustness of the proposed method.
In order to assess the performance quantitatively, we evaluate the fusion results using the no-reference HSI quality assessment method in [
45], which would give quality scores for each reconstructed HSI. In this no-reference assessment method, some pristine HSIs are needed as training data to learn the benchmark quality-sensitive features. We use the original LR Hyperion data after discarding the noisy bands as training data. The quality score measures the distance of the reconstructed HSI and the pristine benchmark; a lower score value means better quality. The quality scores of different fusion results are given in
Table 5. The best index is highlighted in bold. It is clear that the score of our fusion result is lower than others. This means that our method is competitive with other compared methods. The larger score values of the compared methods may be caused by the spectral distortion, or the noise in the fusion results.
Land-cover classification is one of the important applications of HSI. In this experiment, we test the effect of different fusion methods on the land-cover classification. Land-cover information is provided by Open Street Map (OSM) layers [
52]. According to the OSM data, there are 12 classes of land-covers in the study area. We select parts of the pixels from each class as ground truth, as shown in
Figure 13 and
Table 6. Two classifiers, Support Vector Machine (SVM) [
53] and Canonical Correlation Forests (CCF) [
54], are used in the experiment due to their stability and good performance. The SVM classifier is implemented with the LIBSVM toolbox [
55], and the radial basis function is used as kernel function of SVM. The regularization parameters in SVM are determined by five-fold cross-validation in the range of
. The parameter involved in the CCF classifier is the number of trees; we set it to 200 in the experiment. Fifty samples of each class are randomly chosen for training the classifiers; the remainder of the ground truth are used as testing samples. We repeat the classification experiment 10 times, and then report the mean value and standard variance of overall accuracy in
Table 7. The best indices are highlighted in bold.
In
Table 7, it can be found that our fusion result can lead to competitive classification accuracy on both SVM and CCF classifiers, and the classification results have similar trend on the two classifiers. The classification accuracy of our fusion result is higher than that of the other three fusion methods. As we can observe in
Figure 11 and
Figure 12, the spectral distortion and noise of our fusion method is less than that of other methods, which may explain why our classification accuracy is higher. The classification map of Two-CNN-Fu fusion results is given in
Figure 14. In
Figure 14, most of the land-covers can be classified correctly; even some details, such as roads and residential areas can be classified well with the fusion enhanced image. Misclassification of some land-covers, such as forests and gardens, may be caused by the similarity in spectra between these two land-covers. It is worth noting that the fusion method can not be absolutely assessed by the classification accuracy, because our labeled ground truth is only a subset of the study area, and the classification performance also depends on the classifier. The aim of this classification experiment is to demonstrate that our proposed fusion method has the potential to be applied in real spaceborne HSI-MSI fusion, and the reconstructed HR HSI could result in competitive classification performance.