An Encoder–Decoder with a Residual Network for Fusing Hyperspectral and Panchromatic Remote Sensing Images

: For many urban studies it is necessary to obtain remote sensing images with high hyper-spectral and spatial resolution by fusing the hyperspectral and panchromatic remote sensing images. In this article, we propose a deep learning model of an encoder–decoder with a residual network (EDRN) for remote sensing image fusion. First, we combined the hyperspectral and panchromatic remote sensing images to circumvent the independence of the hyperspectral and panchromatic image features. Second, we established an encoder–decoder network for extracting representative encoded and decoded deep features. Finally, we established residual networks between the encoder network and the decoder network to enhance the extracted deep features. We evaluated the proposed method on six groups of real-world hyperspectral and panchromatic image datasets, and the experimental results confirmed the superior performance of the proposed method versus six other methods.


Introduction
Remote sensing image fusion refers to the complementary operation and processing of multisource remote sensing image data in space, time and the spectrum according to certain rules and algorithms.This obtains more accurate and richer information than single image data, and generates synthetic image data with new spatial, spectral and temporal characteristics [1,2].In the remote sensing community, hyperspectral and panchromatic images are two important image types.Hyperspectral remote sensing images usually have high spectral resolution and can provide rich spectral information.However, due to the limited energy acquired by the remote sensing image sensor, the spatial resolution is too low to maintain images with high spectral resolutions.This means that the spatial details of ground objects cannot be reflected in hyperspectral remote sensing images [3].Panchromatic remote sensing images usually have high spatial resolution and can provide many spatial details of the ground objects.However, the spectral resolution of panchromatic images is usually low.Thus, panchromatic images cannot provide enough spectral information [4].As a result, fusing hyperspectral and panchromatic images can obtain images with both high spectral and high spatial resolution.Therefore, this kind of image fusion can complement the deficiency of pre-fused images, and the fused images can be used in a variety of applications such as ground object classification [5], spectral decomposition [6], and urban target detection [7], among others.
Remote sensing image fusion can be at the pixel-level, feature-level or decision-level [8].In pixel-level fusion, each pixel of all the image data is directly fused through various algebraic operations.Then it extracts the feature information of ground objects after processing and analysis.Pixel-level fusion requires multiple sensors to be placed on the same platform to achieve both accurate spatial registration of the sensors and strict correspondence between the pixels.Pixel-level image fusion methods are generally based on the space domain and the transform domain.Feature-level fusion consists of image spatial registration, feature extraction, feature fusion and description of the attributes according to the fusion results.When multiple image sensors report similar features at the same location, the likelihood of actual feature occurrence can be increased, and the accuracy of measured features can be improved.Feature-level image fusion is very important for target recognition and identity authentication.Decision-level fusion is the highest level of fusion.First, it carries out spatial registration of the image, and feature extraction and description of the image information's attributes is carried out by using a large-scale database and an expert decision system to simulate the process of human analysis, reasoning, recognition and decision-making.Finally, it fuses the feature information and attributes.This method mainly aims to fuse multisource information and have strong fault tolerance.The difference between decision-level and feature-level fusion lies is that the goal of feature-level fusion is to extract features from remote sensing images and directly fuse them into new features through various algorithms, while decision-level fusion aims to extract features and recognize new ground objects and then combine the ground object information into new ground objects.
Remote sensing image fusion methods can be divided into the following categories: multiresolution analysis (MRA)-based image fusion methods; component substitution (CS)-based fusion methods; matrix decomposition-based methods; Bayesian-based image fusion methods, and remote sensing image fusion methods based on deep learning (DL).MRA-based fusion methods first obtain information on spatial details by multiscale decomposition of panchromatic images, then this is fused into multispectral or hyperspectral images.MRA fusion methods mainly include undecimated wavelet transform (UWT) [9], decimated wavelet transform (DWT) [10], the indistinguishable transform method based on a curve wave [11], and the Laplacian pyramid method [12], the indistinguishable transform method based on a contour wave [13].These extract spatial details from panchromatic images through a spatial filter, then insert the extracted spatial details into hyperspectral images.
CS-based fusion methods replace components in the multispectral or hyperspectral images with panchromatic images.CS fusion methods include the intensity-hue-saturation (IHS) method [14][15][16], principal component analysis (PCA) [17][18][19], and the Gran-Schmidt (GS) method, among others.They also rely on projection of the hyperspectral images into another spectral space to separate the spatial and spectral information, so that the transformed hyperspectral image data can be fused by replacing the spatial components of the panchromatic images.The stronger the correlation between the panchromatic images and the replaced components, the less the spectral loss in the fused images.Therefore, before replacing panchromatic images, histogram matching is usually carried out.Then the fused images are obtained through inverse spectral transformation.
Matrix decomposition-based image fusion methods assume that hyperspectral images can be decomposed into the product of the spectral primitives and the correlation coefficient matrix.The spectral primitives refer to abstract representations of the spectral information, including sparse representation [20][21][22] and low-rank representation [23,24].The spectral primitive form of sparse expression has a complete dictionary and assumes that each spectrum is a linear combination of several dictionary items.The items are usually based on a complete dictionary with a low spatial resolution of the hyperspectral remote sensing images by sparse dictionary learning methods, such as K-SVD [25], online dictionaries [26], and non-negative dictionary learning.Next, sparse priors are used to regularize the coefficients; usually sparse coding algorithms to estimate the coefficients.The low-rank expression holds that spectral features can be represented by low-dimensional subspaces, and the matrix composed of spectral primitives is a low-rank matrix.The low-rank spectral elements are usually composed of vertex component analysis (VCA), simplex identification via split augmented Lagrangian (SISAL), principal component analysis, and truncated singular value decomposition (TSVD).Both sparse representation and low-rank representation methods aim to model the similarity and redundancy between spectral bands.Thus, both can maintain the spectral characteristics well.However, the low-rank representation method can greatly reduce the dimensions of the spectral pattern and has less computational complexity than the sparse representation method.
The Bayesian method relies on the posterior distribution of hyperspectral and panchromatic images.This posterior distribution is obtained by Bayesian reasoning, and posterior distribution contains two factors: (a) a likelihood function, which is the probability density of the multispectral or hyperspectral images and panchromatic images obtained after a given target image, and (b) the prior probability density of the target image, in that the characteristics of the target image can be improved by the desired characteristics, such as segmentation smoothing.Selection of the appropriate prior information can solve the inverse ill-condition problem in the process of fusion [27].Hyperspectral images and images with a high spatial resolution can be described in the framework of Bayesian inference.This method can intuitively explain the fusion process through the posterior distribution of the Bayesian fusion model.Since fusion problems are often ill-conditioned, Bayesian methods provide a convenient way to regularize the problem by defining an appropriate prior distribution for the scenarios of interest.According to this strategy, many scholars have designed different Bayesian estimation methods for fusing images with high spatial resolution and hyperspectral images [28,29].
DL has made achievements in many fields, such as natural language processing [30], computer vision [31,32], speech recognition [33], search engines [34] and so on.The DL fusion method is considered to be a new trend, which trains a network model to describe the mapping among the hyperspectral images, panchromatic images and target fusion images [35].Current DL fusion methods include the Deep Residual Pansharpening Neural Network (DRPNN) [36], the Pansharpening Neural Network (PNN) [37], the Multiscale and the Multidepth Convolutional Neural Network (MSDCNN) [38].The DIP-Hy-perKite method proposed in [39] defines the spatial-domain constraint as the L1 distance between the predicted PAN image and the actual PAN image, proposes a learnable spectral response function (SRF), and also proposes a novel over-complete network, called HyperKite, which focuses on learning high-level features by constraining the receptive from increasing in the deep layers.The RCNN method proposed in [40] utilize the network to map the differential information between the high spatial resolution panchromatic (HR-PAN) image and the low spatial resolution multispectral (LR-MS) image.Moreover, RCNN makes full use of the LR-MS image and utilizes the gradient information of the up-sampled LR-MS image (Up-LR-MS) as auxiliary data to assist the network.Furthermore, an attention module and residual blocks are incorporated in the proposed network structure.The MARB-Net proposed in [41] assigns multiple weights to each feature using multiple attention mechanism models.Then, the MARB-Net deeply mines and integrates the features using the residual network.Finally, MARB-Net performs contextual semantic integration on the deep fusion features using the Bi-LSTM network.These methods can usually obtain better spectral fidelity, but spatial enhancement is inadequate in the image fusion results.The current DL-based image fusion methods usually lack richly formalized and diversified deep features.The current DL-based image fusion methods usually lack deep feature enhancement.Meanwhile, these methods usually regard spatial and spectral features as individual units.
In the late 1980s, the invention of the back propagation algorithm for artificial neural networks brought hope to machine learning, setting off a boom in machine learning based on statistical models.In 2006, Geoffrey Hinton and Ruslan Salakhutdinov published an article in Science, a leading academic journal, that started a wave of deep learning in academia and industry.Since 2006, deep learning has been gaining momentum in academia.Today, Google, Microsoft, Baidu, and other well-known high-tech companies with large data are scrambling to invest their resources into occupation of deep learning technology.
In this study, we proposed a deep learning model of an encoder-decoder with a residual network (EDRN) for fusing hyperspectral and panchromatic images (Supplementary Materials).The advantage of the end-to-end neural network is that the model can be changed from the original input to the final output as much as possible by reducing manual pretreatment and follow-up processing.This gives the model more space for automatic adjustment according to the data and increases the overall fit of the model.The proposed method can be divided into three parts: (1) hyperspectral and panchromatic image combination; (2) establishment of the encoder-decoder network, and (3) residual enhancement of the encoded and decoded deep features.To overcome the independence of the hyperspectral and panchromatic image features adopted in the majority of fusion methods, we first combined hyperspectral and panchromatic images by a particular means.The image features then produced interactive effects.The integration mode in our manuscript leads to a more concise combination mode for hyperspectral and panchromatic images to interact spectral-spatial information.Second, we established an encoderdecoder network for extracting the representative encoded and decoded deep features.
Our model extracts richly formalized encoded and decoded deep features with different feature sizes for image fusion.Our model extracts more diversified deep features, which allows image fusion with more effective and hierarchically variable feature levels for image fusion.Finally, to solve the lack of deep feature enhancement in the current fusion methods, we established residual networks between the encoder network and the decoder network to enhance the extracted encoded and decoded deep features.Our model achieves residual enhanced encoded and decoded deep features to attain enhanced image fusion result.The establishment of the proposed method is complete.The proposed DLbased image fusion method is able to enhance the features of extracted deep features.At the same time, the proposed method is able to combine hyperspectral and panchromatic images.
In this paper, we propose a novel encoder-decoder with residual network fusion model.The main contributions and novelties of the proposed method are as follow: (1) Spatial-spectral information interaction.We first up-sampled the hyperspectral image as the size of panchromatic image, then contacted the panchromatic and up-sampled hyperspectral images for information interaction.(2) One-to-one encoder and decoder construction.For the process of construction, the previous encoded layers were corresponded with the latter decoded layers.Then, we constructed a one-to-one encoder-decoder network to extract encoded and decoded deep features.(3) Encoded and decoded deep feature enhancement.We utilized the convolutional residual network from the encoded layers to corresponding decoded layers to enhance the encoded and decoded deep features.We computed the encoded deep features with a convolutional implementation and then added them to the corresponding decoded deep features to enhance the entire deep features.
The rest of this article is organized as follows: Section 2 provides a detailed description of the proposed method.Section 3 presents the experimental results.Section 4 represents a discussion.Section 5 summarizes the conclusions.

Proposed Method
In this article, we propose an end-to-end DL model for fusing hyperspectral and panchromatic images."End-to-end" means that the input data of the model are the original raw data, and the output data are the result.Classical machine learning uses the raw original data as features in preprocessing, then utilizes the features in a specific application.The results of the specific application depend on the quality of the image features to a certain degree; earlier machine learning methods spent most of their time on feature design.Machine learning at that time was more appropriately named feature engineering.Later, people found that it would be better to use neural networks and let the network learn how to obtain features by itself.This led to a rise in representation learning, and this method is more flexible for data fitting.With the further deepening of the network, the multilayer concept of representation learning brought the accuracy of the algorithm to a new height and led to multilevel feature extraction as well as recognizer unified training and prediction networks.An end-to-end neural network excels in reducing manual pretreatment and follow-up processing and can be changed from the original input to the final output as much as possible.It gives the model more space for automatic adjustment according to the data and increasing the overall fit of the model.Features can be learned by themselves, so we integrate feature extraction into the classification algorithm without human intervention.For the end-to-end DL model proposed in this research, the inputs are hyperspectral and panchromatic images, while, the output is the fusion result.
Current DL-based remote sensing image fusion methods usually regard spatial and spectral features as individual units.On the input end of the proposed end-to-end deep learning model, we first regard the hyperspectral and panchromatic images as an entirety.This operation is carried out for the sake of combining the spectral information in the hyperspectral images and the spatial information in the panchromatic images as an entirety.In addition, this spectral-spatial combination entity allows feature interaction between the spectral and spatial information in the later DL model.In this operation, the hyperspectral image is up-sampled as the spatial size of the panchromatic image.To overcome the problem that current DL-based image fusion methods usually regard spatial and spectral features as individual units, the integration of up-sampled hyperspectral and panchromatic images can provide them with spatial-spectral information interaction for image fusion.The integration mode in our manuscript leads to a more concise combination mode for hyperspectral and panchromatic images to interact spectral-spatial information.After up-sampling the hyperspectral image, we contacted the panchromatic image with the up-sampled hyperspectral image according to Equation (1): where is the panchromatic image and ( * ) represents the combination of the upsampled hyperspectral and panchromatic images.
Next, we took as the input data for the deep learning model.After matching the panchromatic and up-sampled hyperspectral images, we used an encoder-decoder network is for elementary deep feature representation of the spectral-spatial interactive input data.An encoder-decoder is an artificial neural network used in supervised learning.The encoder is the first half of an encoder-decoder, and its function is to turn the input data into the middle layer to produce a hidden representation.This part transforms the deep representative features of input data to a hidden representation.These deep representative features are also called encoded deep features.The decoder is the back half of an encoder-decoder and its function is to refactor the hidden representation from the middle layer to the output data.This part extracts the deep representative features from the hidden representation to the output data.These deep representative features are also called decoded deep features.There are several categories of encoder-decoder: (1) ordinary encoder-decoders, which are neural networks with three layers (i.e., a neural network with a hidden layer); (2) multilayer encoder-decoders, which are extended from ordinary encoder-decoders to a encoder-decoder with multiple hidden layers, and (3) a convolutional encoder-decoder, which is extended from a fully connected network to a convolution network.We utilized the convolutional encoder-decoder in our proposed DL model.A convolutional neural network (CNN), which is a type of deep learning method, was adopted to establish a deep network fusion model for the intelligent fusion of hyperspectral and panchromatic images in this study.
CNN is one of the representative deep learning algorithms and is a feedforward neural network containing convolution computation [42].In the convolution layer of CNN, we only connected a neuron with the neurons of the adjacent layers, usually containing multiple feature planes.Neurons in the same feature plane share the same weights, i.e., convolution kernels.Subsampling, also known as pooling, usually takes two forms: average subsampling and maximum subsampling.Subsampling can be regarded as a special convolution process.The output result of convolution forms a layer of neurons through the activation function to form a layer characteristic map.
Figure 1 illustrates a schematic diagram of the encoder-decoder network.The encoder-decoder network adopted in our proposed DL model comprised three operations: convolution, pooling and up-sampling.In Figure 1, HSI refers to the hyperspectral image, PAN represents the panchromatic image, conv represents the convolution operation, pool represents the max-pooling operation and up represents up-sampling.The current DL-based image fusion methods usually lack richly formalized and diversified deep features, so we constructed an encoder-decoder network to extract richly formalized and diversified deep features.In Figure 1, there are two parts in the encoderdecoder network.The left part is the encoder network, while, the right part is the decoder network.The encoder network includes a series of convolution layers and pooling layers.The feature sizes of the convolution layers diminish gradually because of the pooling operation.The decoder network constitutes a series of up-sampling and convolution layers.The feature sizes of the convolution layers increase gradually because of up-sampling.In the encoder network, we convoluted and pooled the combined hyperspectral and panchromatic image data layer by layer to form the encoded deep features.Next, we extracted the encoded deep features from the combined hyperspectral and panchromatic image data and completed the establishment for the encoder network.Equation (2) shows the convolution operation of the encoder network: where 2 ( * ) and ⊗ is the convolution operation, is the deviation value and and represent the input and output of the + 1 level convolution in the encoder network, respectively.Equation (3) shows the pooling operation in the encoder network: where pool(*) represents the pooling function, and refer to the input and output of the + 1 level's convolution in the encoder network, the step length is and pixel ( , ) has the same meaning as the convolution layer.When = 1, pooling takes the mean value in the pooled area, i.e., average pooling; when → ∞, pooling takes the maximum value in the region, i.e., max-pooling.At the same time, in the decoder network, we subjected the middle layer between the encoder and decoder networks to the up-sampling and convolution operation layer by layer to form the decoded deep features.The decoded deep features are then extracted from the middle layer between the encoder and decoder networks to complete the establishment for the decoder network.Equation (4) shows the convolution operation of the decoder network: where and represent the input and output of the + 1 layer's convolution in the decoder network.
There are the same number of levels in the encoder and decoder networks.Each layer from the encoder network and each layer in the decoder network correspond one to one.For the encoder network, we used the pooling operation between layers to obtain encoded convolution feature blocks with different sizes and dimensions for each layer.For the decoder network, we obtained the decoded convolution feature block with the same size as the corresponding encoded convolution feature block by using up-sampling between layers.In this way, we established the encoder network and the decoder network with the corresponding feature blocks of the same size and dimension.To overcome the problem that current DL-based image fusion methods usually lack of richly formalized and diversified deep features, our model extracts richly formalized encoded and decoded deep features with different feature size for image fusion.Our model extracts more diversified deep features that makes image fusion with more effective and hierarchically variable feature levels for image fusion.
The current DL-based image fusion methods usually lack deep feature enhancement, so we utilized residual network to enhance the extracted encoded and decoded deep features.Figure 2 is a schematic diagram of residual enhancement for the encoder network and the decoder network.In Figure 2, the plus sign represents the residual block.This study used a residual network structure to adjust and enhance the encoded and decoded deep features.For the residual network, denoted as the desired underlying mapping function ( * ), we let the stacked nonlinear layers fit other mapping, as shown in Equation ( 5): where ( ) is the residual part, is the mapping part, is a residual variable and is a mapping variable.The original mapping was recast into Equation ( 6): The formulation of + ( ) can be realized by feedforward neural networks with shortcut connections.The shortcut connections simply perform identity mapping.We also added their outputs to the outputs of the stacked layers.Identity shortcut connections add neither extra parameters nor computational complexity [43].For the convolutional residual network with Equation ( 6), there is a convolutional operation in the residual part ( ) and an identity mapping with the mapping part .We adopted a convolutional residual network in the proposed EDRN method.The residuals of the network structure were established on the decoder network at each level of convolution to join the corresponding encoder's convolution.When establishing the encoder and decoder networks, we formulated the corresponding one-to-one encoded and decoded convolutions.For establishing the residual enhancement structure between the encoder and decoder networks, we added each encoded convolution layer to the corresponding decoded convolution layer with the same convolution feature size shown in Figure 2. Before the addition operation, there was a convolution operation to enhance the residual network.The residual network constitutes a series of residual blocks that constitute mapping and residual parts.Equation (7) shows the operation of the residual enhancement network structure: where is the result of the convolution at the th level in the decoder network, which is the mapping part for identity mapping in the residual block; ( ) is the residual part in the residual block according to Equation (6); , is the result of the enhanced residuals of the convolution of the − layer in the decoder network, and is the total number of convolution layers in the decoder network.For convolution of the residual network, Equation (8) represents ( ) as a convolution operation: where is the convolution weight of the th residual part and is the biases of the th residual part.By substituting Equation (8) into Equation (7), Equation ( 9) obtains To overcome the problem that the current DL-based image fusion methods usually lack deep feature enhancement, our model achieves residual enhanced encoded and decoded deep features to attain enhanced image fusion result.
For the encoder and decoder network, we set each layer of the encoded and decoded layers with specific spatial size and number of layer channels.We constructed the final data cube with the final layer of the decoder network.For the final layer of the decoder network, the spatial size was the same with the spatial size of the panchromatic image, and we set the number of final layer channels as the band number of the hyperspectral image.Then, the spatial and spectral resolution of the final layer of the decoder network was the same as the ground truth image.We constructed the final data cube with the above implementation so that it could be used in loss function computed with the ground truth image.

Results
This section includes a series of experiments used to verify and evaluate the fusion performance of the proposed EDRN method.The components of this section are as follows.
(1) Description of the hyperspectral and panchromatic datasets used for verifying and investigating the performance of the proposed EDRN method.(2) A comparison of the experimental hyperspectral and panchromatic datasets for the proposed EDRN method versus other fusion methods.

Description of the Experimental Datasets
We utilized six groups of real-world hyperspectral and panchromatic datasets in our experiments to verify and investigate the performance of the proposed EDRN method.The three datasets had different characteristics in terms of image coverage and the distribution and clutter of ground land cover.
We obtained the groups of datasets using ZY-1E hyperspectral and panchromatic remote sensors.The hyperspectral sensor contains 90 spectral channels with a 30-m spatial resolution.After removing the bands of low signal-to-noise, poor quality and water absorption (22-29, 49-58 and 88-90), there were 68 bands remaining.The panchromatic remote sensor contains relatively unambiguous optical effects for the ground land cover with a 2.5 m spatial resolution.The size of all panchromatic datasets was 3600 × 3600 pixels and that of the hyperspectral datasets was 300 × 300 pixels.
We collected the first hyperspectral and panchromatic dataset in the Baiyangdian region located at Hebei, China, which includes ground land cover of shadows, croplands, roads and buildings.The panchromatic, hyperspectral and ground truth images in this region are shown in Figure 3.We collected the second dataset from the Chaohu region in the middle range of the Yangtze River, China, which includes ground land cover of croplands, mountains, roads and water.The panchromatic, hyperspectral and ground truth images of the Chaohu region are shown in Figure 4. We collected the third dataset from Dianchi region in Kunming, China, which includes ground land cover of jungles, rivers, water and mountains.The panchromatic, hyperspectral and ground truth images of this region are shown in Figure 5.The RGB images of hyperspectral data and ground truth data are constituted with three bands for illustration.For the RGB images, we chose bands 11, 6, 2 as the red, green and blue channels for illustration.For all three datasets, the ground truth images had the same spatial resolution as respective panchromatic images, and the ground truth images had a size of 3600 × 3600 pixels.Meanwhile, the ground truth images had the same spectral resolution as the respective hyperspectral image.All the ground truth images had 68 spectral bands (after removing the bands of low signalto-noise, poor quality and water absorption).We obtained all the ground truths of the three datasets by unmanned aerial vehicles with remote sensing image sensors.The sensor had the same retrievable spectral resolution and wave range with the hyperspectral images of the three datasets in our experiments.Meanwhile, with the low altitude flight of the unmanned aerial vehicle, the images obtained from this sensor had very high spatial resolution, which was same as the panchromatic images of the three datasets in our experiments.
In addition, we utilized three other datasets in our experiments: (1) The Pavia Center scene was captured by the ROSIS camera.The original HSI consists of 115 spectral bands spanning from 430 to 960 nm and has 1096 × 1096 pixels with the spatial resolution of 1.3 m.Thirteen noisy bands were discarded, resulting in an HSI with 102 spectral bands spanning from 430 to 860 nm.In addition, a rectangular area with 1096 × 381 pixels with no information at the center of the original HSI was discarded, and the resulting "two-part" image with size of 1096 × 715 × 102 was used for the experiments.We also used only the top-left corner of the HSI with a size of 960 × 640 × 102, and partitioned it into 24 cubic patches of size 160 × 160 × 102 with no overlap.To generate panchromatic (PAN) images and low spatial resolution hyperspectral images (LR-HSI) corresponding to each high spatial resolution hyperspectral image (HR-HSI), we utilized Wald's protocol [44].

Experimental Setup
Experiments on the three datasets were conducted to evaluate the performance of the proposed EDRN method.Six image fusion methods were chosen for the experimental comparison.They were implemented in MATLAB R2016 software and included Coupled Nonnegative Matrix Factorization (CNMF) [24], Modulation Transfer Function-Generalized Laplacian Pyramid (MTF_GLP) [45], General Intensity-Hue-Saturation (GIHS) [46], A Trous Wavelet transform-based Pan-sharpening (AWLP) [47], High Pass Filtering (HPF) [48], Smoothing Filter-based Intensity Modulation (SFIM) [48] and DIP-HyperKite [39] methods.The HPF method choses the simplest scheme achievable by using the box mask and additive injection among the possible couples of filters and coefficients.The box mask was a mask with uniform weights and implement an average.The HPF method uses a procedure to extrapolate edge information from a high-resolution band to lower spatial resolution bands.In the HPF method the higher spatial resolution data, PAN in this case, has a small high-pass spatial filter applied.The results of the small high-pass filter contained the high-frequency component/information related mostly to spatial information.The spatial filter removed most of the spectral information.On the other hand, the SFIM method employed the HPM injection scheme, named smoothing filter-based intensity modulation.The SFIM method was based on a simplified solar radiation and land surface reflection model.By using a ratio between a higher resolution image and its low pass filtered (with a smoothing filter) image, spatial details can be modulated to a co-registered lower resolution multispectral image without altering its spectral properties and contrast.
We implemented the proposed EDRN method with four convolution and pooling layers in the encoder and decoder on all three dataset and set the residual network with the previous encoded layer to the latter decoded layer.We also set the final layer of the decoder network with the number of channels as the number of hyperspectral bands to construct the final output.All the kernels of the convolution layers in both the encoder and decoder had a size of 3 × 3 kernels.The kernels of the pooling layers in the encoder had a 2 × 2 kernel size; thus, in the encoder, the output data of one convolution layer and the pooling layer shrank to a quarter of the input data size.We up-sampled all up-sampling layers in the decoder twice for the input data.For the panchromatic image, the input data was a 240 × 240 image patch, and the input data was a 20 × 20 image patch for the hyperspectral image.Then the total number of experimental image batches was 78,400.We spat the experimental image batches with 50,176 as the train data, and spat 12,544 as validation data, and spat 15,680 as the test data.For model training, when the model was initializing, we set the weight of each convolutional layer with xavier normalization, the bias of each convolutional layer was set as zero, and the batch size as 32 in our experiments.We trained our model with the Pytorch platform.We utilized a NVIDIA GeForce RTX 3080Ti graphics card to train our model.The input of the network was the contacted panchromatic and up-sampled hyperspectral images.The loss function for our network was the Root Mean Square Error (RMSE).The optimizer for our network was the Adam optimizer.
Figure 9 illustrates the qualitative details of the proposed EDRN model for which the qualitative details are mentioned above.

Experimental Results and Analysis
Figure 10 illustrates the RGB images of the ground truth, the different methods and the proposed EDRN method for the Baiyangdian dataset.For buildings, the AWLP, GIHS, MTF_GLP and the proposed EDRN methods achieved better fusion performance than the CNMF method.The AWLP, MTF_GLP and the proposed EDRN methods achieved better fusion performance than the CNMF and GIHS methods for roads.The AWLP and the proposed EDRN methods achieved better performance than the CNMF, GIHS and MTF_GLP methods for croplands.The GIHS and the proposed EDRN methods achieved better fusion performance than the AWLP, CNMF and MTF_GLP methods for shadows.The GIHS method did not provide reasonable spectral restoration according to the ground truth image.Therefore, the fusion result of the GIHS method achieved poor RGB contrast.The fusion results of the HPF and SFIM methods achieved good restoration for buildings and roads but poor results for croplands and shadows.The fusion result of the DIP-Hy-perKite method achieved good restoration for buildings, roads and shadows but poor results for croplands.To sum up, the proposed EDRN method achieved the best visual effects in terms of fusion performance among the fusion methods.Figure 11 shows the RGB images of the ground truth, the different methods and the proposed EDRN method for Chaohu region.The fusion results of the AWLP, CNMF and the proposed EDRN methods were better than those of the GIHS, MTF_GLP methods for the ground truth images for water.For mountain and croplands, the proposed EDRN method produced better restoration than the AWLP and CNMF methods.For all classes of land cover, the fusion results of the GIHS and MTF_GLP methods had worse performance than the other methods and the proposed EDRN method.The AWLP, CNMF and the proposed EDRN methods had better fusion performance than the GIHS and MTF_GLP methods for roads.The fusion results of the GIHS method did not achieve reasonable spectral restoration according to the ground truth image and had poor RGB contrast.For roads, croplands and mountains, the MTF_GLP method had poor fusion performance.The HPF and SFIM methods had poor performance for shadows, roads and croplands but good performance for mountains.The DIP-HyperKite method had good fusion performance for roads, croplands and mountains but poor performance for shadows.In conclusion, for all classes of land cover, the proposed EDRN method had better performance and better sharpness than the other fusion methods.Figure 12 illustrates the RGB images of ground truth, the different methods and the proposed EDRN method for Dianchi.The AWLP method, the GIHS method and the MTF_GLP method had poor restoration performance for water, while the CNMF and the proposed EDRN methods had good fusion performance.For the water class, the fusion results of the GIHS method had poor RGB contrast and the fusion results of the GIHS method did not show very good spectral performance according to the ground truth image.The CNMF and MTF_GLP methods had poor fusion performance.The AWLP, GIHS and the proposed EDRN methods had better restoration performance than the CNMF and MTF_GLP methods for jungles and mountains.The HPF and SFIM methods had good performance for mountains and water but poor performance for jungles.The DIP-Hyper-Kite method had good performance for water and jungles but poor performance for mountains.In brief, the proposed EDRN method showed better performance for all classes of land cover than the other fusion methods.That is, the proposed EDRN method achieved better results than all the other fusion methods.Figure 13 illustrates the RGB images of the ground truth, the different methods and the proposed EDRN method for Pavia Center.The AWLP method achieved good performance for buildings, water and vegetations but poor performance for roads.The CNMF method achieved good performance for water and vegetations but poor performance for buildings and roads.The GIHS method achieved good performance for water, buildings and roads but poor performance for vegetations.The MTF_GLP method achieved good performance for roads, water and vegetations but poor performance for buildings.The HPF method achieved good performance for water but poor performance for roads, buildings and vegetations.The SFIM method achieved good performance for buildings and roads but poor performance for water and vegetations.The DIP-HyperKite method achieved good performance for buildings, water and vegetations but poor performance for roads.The proposed EDRN method achieved good performance for all the land covers.Figure 14 illustrates the RGB images of the ground truth, the different methods and the proposed EDRN method for Botswana.The AWLP method achieved good performance for mountains, water and vegetations but poor performance for jungles.The CNMF method achieved good performance for water and vegetations but poor performance for mountains and jungles.The GIHS method achieved good performance for water, mountains and jungles but poor performance for vegetations.The MTF_GLP method achieved good performance for jungles, water and vegetations but poor performance for mountains.The HPF method achieved good performance for water but poor performance for jungles, mountains and vegetations.The SFIM method achieved good performance for mountains and jungles but poor performance for water and vegetations.The DIP-Hyper-Kite method achieved good performance for mountains, water and vegetations but poor performance for jungles.The proposed EDRN method achieved good performance for all the land covers.Figure 15 illustrates the RGB images of the ground truth, the different methods and the proposed EDRN method for Chikusei.The AWLP method achieved good performance for buildings, water and croplands but poor performance for bare lands.The CNMF method achieved good performance for water and croplands but poor performance for buildings and bare lands.The GIHS method achieved good performance for water, buildings and bare lands but poor performance for croplands.The MTF_GLP method achieved good performance for bare lands, water and croplands but poor performance for buildings.The HPF method achieved good performance for water but poor performance for bare lands, buildings and croplands.The SFIM method achieved poor performance for all the land covers.The DIP-HyperKite method achieved good performance for buildings, water and croplands but poor performance for bare lands.The proposed EDRN method achieved good performance for all the land covers.

Discussion
We utilized a series of image fusion performance indices to precisely verify the spectral and spatial performance of the fusion results in our experiments.We adopted eight performance indices in this study, namely the structural similarity index (SSIM), the peaksignal to noise ratio (PSNR), spectral curve comparison, spatial correlation coefficient (SCC), the spectral angle mapper (SAM), the root mean squared error (RMSE), the relative dimensionless global error in synthesis (ERGAS) and the Q metric [48,49].We computed the SSIM index as Equation ( 10): where, is the fused image, is the ground truth, = 1, 2, 3 … , with the band number of hyperspectral image, is the th band of the fused image, is the th band of the ground truth, is the average value of , is the average value of , is the covariance value between and , is the variance of , is the variance of .The SSIM index is a band-based performance index.We computed the PSNR index as in Equation (11): where, is the pixel number of fused image, = 1,2,3 … , , is the th value in , , is the th value in .The PSNR index is also a band-based performance index.We computed the SCC index as in Equation ( 12): It is clear that the SCC index is also a band-based performance index.We computed the SAM index as in Equation ( 13): where, is the th pixel in fused image, is the th pixel in ground truth, and ‖•‖ is the norm.That the SAM index is a pixel-based performance index.We computed the RMSE index as in Equation ( 14): The RMSE index is a pixel-based performance index.We computed the ERGAS index as in Equation ( 15): where, is the spatial downsampling factor.The ERGAS index is a band-based performance index.We computed the Q metric index as in Equation ( 16): The Q metric index is a pixel-based performance index.Among the performance indices, the SCC and SSIM are spatial quality metrics, and SAM and the spectral curve are spectral quality metrics.The RMSE, PSNR, ERGAS and the Q metric are comprehensive spatial-spectral quality metrics.We utilized the SAM, spectral curve comparison, RMSE, PSNR, ERGAS and Q metric to test the spectral information enhancement in high-resolution images.We also utilized the SCC, SSIM, RMSE, PSNR, ERGAS and Q metric to test the spatial information enhancement in hyperspectral images.The spectral curve comparison compares the spectral curves of a pixel in the results of a fusion method with the corresponding spectral curve of the pixel in the original hyperspectral image.In our experiments, we compared the spectral curve of the (360, 360) pixel in the fusion results of all the compared and proposed methods with the spectral curve of the (30,30) pixel in the corresponding hyperspectral image for Baiyangdian, Chaohu and Dianchi datasets.For Pavia Center dataset, we compared the spectral curve of the (120, 120) pixel in the fusion results of all the compared and proposed methods with the spectral curve of the (30,30) pixel in the corresponding hyperspectral image.For the Botswana dataset, we compared the spectral curve of the (90, 90) pixel in the fusion results of all the compared and proposed methods with the spectral curve of the (30,30) pixel in the corresponding hyperspectral image.For the Chikusei dataset, we compared the spectral curve of the (120, 120) pixel in the fusion results of all the compared and proposed methods with the spectral curve of the (30,30) pixel in the corresponding hyperspectral image.It is important to emphasize that PSNR, SSIM and the Q metric are better when they are larger, and RMSE, SAM and ERGAS are better when they are smaller.For SCC, the performance is better when most of the SCC values of the bands are bigger than those of the other fusion methods.For spectral curve comparison, the performance is better when the spectral curve is near to the original spectral curve in the hyperspectral images.
Figure 16 shows the quality of the compared and proposed fusion methods for the Baiyangdian dataset in terms of SCC and spectral curve comparison, while Table 1 illustrates the quality of the compared and proposed fusion methods for the Baiyangdian dataset in terms of RMSE, SAM PSNR, SSIM, ERGAS and the Q metric.In Table 1, the best performance for each of all the indices is shown in bold font.The AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF and GIHS methods to a great degree, while the proposed EDRN method achieved the best RMSE performance, with a RMSE lower than 100.The AWLP, CNMF, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS method to a great degree, while the EDRN method achieved the best performance for the SAM index.The GIHS and the proposed EDRN methods achieved better performance than the other methods, while the proposed EDRN method achieved the best SCC performance in most of the spectral bands.The proposed EDRN method also achieved the best performance for the spectral curve comparison.AWLP, MTF_GLP, DIP-Hyper-Kite and the proposed EDRN methods achieved better performance than CNMF and GIHS, while the proposed EDRN method achieved the best performance for the PSNR index.GIHS, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than AWLP and CNMF, while the proposed EDRN method achieved the best performance for the SSIM index.AWLP, MTF_GLP, HPF, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF, GIHS and SFIM methods, while the proposed EDRN method achieved the best performance for the ER-GAS index.MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the AWLP, CNMF, GIHS, HPF and SFIM methods, while the proposed EDRN method had the best performance for the Q metric index.Hence, the proposed EDRN method achieved the best performance for all eight evaluation indices.Figure 17 shows the quality of the compared and proposed methods for the Chaohu dataset in terms of SCC and spectral curve comparison, while Table 2 illustrates the quality of the compared and proposed fusion methods for the Chaohu dataset in terms of RMSE, SAM PSNR, SSIM, ERGAS and the Q metric.In Table 2, the best performance of each of the indices is shown in bold font.The AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF and GIHS methods to a great degree, while the proposed EDRN method achieved the best RMSE performance.The AWLP, CNMF, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS method to a great degree, while the EDRN method achieved the best performance for the SAM index.The proposed EDRN method achieved better performance than all the compared methods in most of the spectral bands for the SCC index.The proposed EDRN method also achieved the best performance for the spectral curve comparison.AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than CNMF and GIHS, and the proposed EDRN method achieved the best performance for the PSNR index.GIHS, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than AWLP and CNMF, and the proposed EDRN method achieved the best performance for the SSIM index.AWLP, CNMF, MTF_GLP, HPF, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS and SFIM methods, while the proposed EDRN method had the best performance for the ERGAS index.GIHS, MTF_GLP, HPF, SFIM, DIP-HyperKite and the proposed EDRN methods achieved better performance than the AWLP and CNMF methods, while the proposed EDRN method achieved the best performance for the Q metric index.Hence, the proposed EDRN method achieved the best performance for all eight evaluation indices.Figure 18 shows the quality of the compared and proposed methods for the Dianchi dataset in terms of the SCC and spectral curve comparison.Table 3 illustrates the quality of the compared and proposed fusion methods for the Dianchi dataset in terms of RMSE, SAM PSNR, SSIM, ERGAS and the Q metric.In Table 3, the best performance for each of all the indices is shown in bold font.The AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF and GIHS methods to a great degree, and the proposed EDRN method achieved the best RMSE performance.The AWLP, CNMF, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS method to a great degree, and the EDRN method achieved the best performance for the SAM index.The proposed EDRN method achieved better performance than all the other methods in most of the spectral bands for the SCC index.AWLP and the proposed EDRN methods achieved better performance than the CNMF, GIHS and MTF_GLP methods for the spectral curve comparison.The proposed EDRN method achieved the best performance for the spectral curve comparison.AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than CNMF and GIHS, and the proposed EDRN method achieved the best performance for the PSNR index.GIHS, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than AWLP and CNMF, and the proposed EDRN method achieved the best performance for the SSIM index.MTF_GLP, DIP-Hyper-Kite and the proposed EDRN methods achieved better performance than the AWLP, CNMF, GIHS, HPF and SFIM methods, while the proposed EDRN method achieved the best performance for the ERGAS index.GIHS, MTF_GLP, HPF, SFIM, DIP-HyperKite and the proposed EDRN methods achieved better performance than the AWLP and CNMF methods, while the proposed EDRN method achieved the best performance for the Q metric index.4-6, the best performance corresponded by each index is shown in bold font.The AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF and GIHS methods to a great degree, while the proposed EDRN method achieved the best RMSE performance.The AWLP, CNMF, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS method to a great degree, while the EDRN method achieved the best performance for the SAM index.The proposed EDRN method achieved better performance than all the compared methods in most of the spectral bands for the SCC index.The proposed EDRN method also achieved the best performance for the spectral curve comparison.AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than CNMF and GIHS, and the proposed EDRN method achieved the best performance for the PSNR index.GIHS, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than AWLP and CNMF, and the proposed EDRN method achieved the best performance for the SSIM index.AWLP, CNMF, MTF_GLP, HPF, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS and SFIM methods, while the proposed EDRN method had the best performance for the ERGAS index.GIHS, MTF_GLP, HPF, SFIM, DIP-Hy-perKite and the proposed EDRN methods achieved better performance than the AWLP and CNMF methods, while the proposed EDRN method achieved the best performance for the Q metric index.Hence, the proposed EDRN method achieved the best performance for all eight evaluation indices.In terms of the quality evaluation for the compared and proposed EDRN methods on the six datasets, the proposed EDRN method achieved better evaluation performance than all the other fusion methods.
We counted the time cost for taking entire hyperspectral image as up-sampling input and taking hyperspectral image patch as up-sampling input.The time costs are shown in Table 4 as follow.
We set the size of image patches for the hyperspectral image as 20 × 20, and the size of hyperspectral image as 300 × 300.Then, we segmented the hyperspectral image with a total of 255 image patches.As shown in Table 7, taking the entire hyperspectral image as up-sampling input cost a little more time, at roughly 37 s.Then, we computed the time cost for taking one image patch as up-sampling input, which cost roughly 0.11 s.We also computed the total time cost for taking all image patches as up-sampling input which is that time cost for one image patch multiplied by 255.Taking image patches as up-sampling input saves more time than taking the entire hyperspectral image as up-sampling input.We set a series of experiments for comparing up-sampling with nearest, bilinear and bicubic modes that affect the downstream task.The fused results of up-sampling with nearest, bilinear and bicubic modes on Baiyangdian, Chaohu and Dianchi dataset are shown in Figures 22-24  As shown in Figures 22-24, fusion results with up-sampling of the nearest mode have a very fuzzy visual effect, the results with up-sampling of bilinear mode have a little clearer visual effect than that with up-sampling of nearest mode, while the results with up-sampling of bicubic mode have the best visual effect.As shown in Figure 22a, the fusion result with up-sampling of nearest mode achieved fuzzy fusion performance on the land covers of croplands, roads, buildings and shadows in the Baiyangdian region dataset.As shown in Figure 22b, the fusion result with up-sampling of bilinear mode achieved fuzzy fusion performance on the land covers of cropland, buildings and shadows but a little clearer fusion performance on the land cover of road on the Baiyangdian region dataset.As shown in Figure 22c, the fusion result with up-sampling of bicubic achieved clear fusion performance on all of the land covers in the Baiyangdian region dataset.As shown in Figure 23a, the fusion result with up-sampling of nearest mode achieved fuzzy fusion performance on the land covers of croplands, mountains, roads and water in the Chaohu region dataset.As shown in Figure 23b, the fusion result with upsampling of bilinear mode achieved a fuzzy fusion performance on the land covers of croplands, mountains and roads but a little clearer fusion performance on the land cover of water in the Chaohu region dataset.As shown in Figure 23c, the fusion result with upsampling of bicubic mode achieved clear fusion performance on all land covers in the Chaohu region dataset.As shown in Figure 24a, the fusion result with up-sampling of nearest mode achieved fuzzy fusion performance on the land covers of rivers, water and mountains in the Dianchi region dataset.As shown in Figure 24b, the fusion result with up-sampling of bilinear mode achieved fuzzy fusion performance on the land covers of mountains but a little clearer fusion performance on the land covers of jungles, rivers and water in the Dianchi region dataset.As shown in Figure 24c, the fusion result with upsampling of bicubic mode achieved clear fusion performance for all of the land covers in the Dianchi region dataset.It can be concluded that up-sampling with nearest and bilinear modes affected the downstream task with a comparatively fuzzy up-sampling result.Upsampling with the bicubic mode reached the sweet spot for the up-sampling step.If the image was up-sampled beyond this point, the performance of the downstream task began to diminish.

Conclusions
In this research, we established a deep network fusion model for the process of fusing hyperspectral and panchromatic images.This method first matches the panchromatic images and up-sampled hyperspectral images to combine the original spectral and spatial information.Second, the method establishes an encoder-decoder network structure to extract representative deep features from the spectral-spatial information combined as input.Finally,, the method establishes residual networks for a one-to-one encoder layer and a decoder layer.The latter two operations aim at adjusting the deep features learned from the deep learning network to make the deep features more representational.We compared the proposed method with the AWLP, CNMF, GIHS, MTF_GLP, HPF, SFIM and DIP-HyperKite methods.The experimental results suggest that the proposed method can achieve competitive spatial quality compared with the existing methods and recover most of the spectral information that the corresponding sensor would observe with the highest spatial resolution.For our model, we treated the input as contacted panchromatic and upsampled hyperspectral images.This limited the deep feature extracted from the data with entire spatial and spectral features.In the future, we will pay attention to the deep features extracted from respective panchromatic and hyperspectral images to achieve deeper insight of effective spatial and spectral deep features.

Figure 1 .
Figure 1.Schematic diagram of the encoder-decoder network.

Figure 2 .
Figure 2. Schematic diagram of residual enhancement for the encoder network and the decoder network.

Figure 9 .
Figure 9. Qualitative details of the proposed EDRN model.

Figure 16 .
Figure 16.Quality of the compared and proposed fusion methods on the Baiyangdian dataset: (a) SCC; (b) spectral curve comparison.

Figure 17 .
Figure 17.Quality of the compared and proposed fusion methods on the Chaohu dataset: (a) SCC; (b) spectral curve comparison.

Figure 18 .
Figure 18.Quality of the compared and proposed fusion methods for the Dianchi dataset: (a) SCC; (b) spectral curve comparison.

Figure 19 .Figure 20 .Figure 21 .
Figure 19.Quality of the compared and proposed fusion methods for the Pavia Center dataset: (a) SCC; (b) spectral curve comparison. .

Table 1 .
Quality of the compared and proposed fusion methods on the Baiyangdian dataset for RMSE, SAM, PSNR, SSIM, ERGAS and the Q metric.

Table 2 .
Quality of the compared and proposed fusion methods on the Chaohu dataset for RMSE, SAM, PSNR, SSIM, ERGAS and the Q metric.

Table 3 .
Quality of the compared and proposed fusion methods for the Dianchi dataset for RMSE, SAM, PSNR, SSIM, ERGAS and the Q metric.

Table 4 .
Quality of the compared and proposed fusion methods for the Pavia Center dataset for RMSE, SAM, PSNR, SSIM, ERGAS and the Q metric.

41 4.5655 40.5698 0.8765 2.7451 0.9076Table 5 .
Quality of the compared and proposed fusion methods for the Botswana dataset for RMSE, SAM, PSNR, SSIM, ERGAS and the Q metric.

Table 6 .
Quality of the compared and proposed fusion methods for the Chikusei dataset for RMSE, SAM, PSNR, SSIM, ERGAS and the Q metric.

Table 7 .
Time cost (second) for taking different input for up-sampling.