1. Introduction
Hyperspectral remote sensing is a multi-dimensional information acquisition technology combining imaging and spectral technology, which can simultaneously obtain two-dimensional spatial and one-dimensional spectral information targets. Each pixel of a hyperspectral image (HSI) has its own spectrum with high spectral resolution, which reflects the physical nature of the captured object. Therefore, hyperspectral imagers have been developed for environment classification [
1,
2,
3,
4], target detection [
5,
6,
7,
8], feature extraction and dimensionality reduction [
9,
10,
11,
12], spectral unmixing [
13,
14,
15], and so on. However, for a hyperspectral imaging system, there is trade-off between spatial and spectral resolution due to limited sensor size and imaging performance. The spatial resolution of HSIs is lower than that of panchromatic images or multispectral images (MSIs). The low spatial resolution severely limits the performance of HSIs in applications. In order to enhance the spatial resolution of HSI, fusion-based methods have been proposed to merge HSI with a relative high-resolution (HR) MSI. The existing fusion methods can be categorized in three types: extensions of pan-sharpening methods [
16,
17,
18,
19], bayesian-based approaches [
20,
21,
22,
23], and spectral unmixing based methods [
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35].
In the first category, pan-sharpening image fusion algorithms are extended to fusing low-resolution (LR) HSI and HR-MSI. For example, Gomez et al. [
16] first extended a wavelet-based pan-sharpening algorithm to fuse HSI with MSI. Zhang et al. [
17] introduced a 3D wavelet transform for HSI-MSI fusion. Chen et al. [
18] divided the HSI into several regions and fused the HSI and MSI in each region using a pan-sharpening method. Aiazzi et al. [
19] proposed a component substitution fusion method, which took the spectral response function (SRF) as part of the model.
In the second category, Eismann et al. [
20] proposed a Bayesian fusion method based on a stochastic mixing model of the underlying spectral content to achieve resolution enhancement. Wei et al. [
21] proposed a variational-based fusion method by incorporating a sparse regularization using trained dictionaries and optimization the problem through the split augmented Lagrangian shrinkage algorithm. Simões et al. [
22] formulated the fusion problem as a minimization of a convex objection containing two quadratic terms and an edge-preserving term. Akhtar et al. [
23] proposed a nonparametric Bayesian sparse coding strategy, which first inferred the probability distributions of the material spectra and then computed the sparse codes of the high-resolution image.
Methods in the third category usually assume that the HSI is composed of a series of pure spectra (named as endmembers) with corresponding proportion (named as abundance) maps. Therefore, matrix decomposition [
24,
25,
26] and tensor factorization algorithms [
27] have been used to decompose both LR-HSI and HR-MSI into endmembers and abundance maps to generate HR-HSI. For example, Kawakami et al. [
24] introduced a matrix factorization algorithm to estimate the endmember-basis of the HSI and fuse it with a RGB image. In Refs [
25,
26], coupled non-negative matrix fraction (CNMF) had been used to estimate endmembers and abundances for HSI-MSI fusion. Dian et al. [
27] proposed a non-local sparse tensor decomposition approach to transform the fusion problem as the estimation of dictionaries in three modes and corresponding core tensors.
In recent years, deep learning methods have been presented and successfully applied in the field of computer vision. Since the deep learning methods have great ability to extract embedded features and represent complex nonlinear mapping, they have been widely used for various remote sensing image procedures, including HSI super-resolution. The thought of HSI fusion based on deep learning can be divided into pan-sharpening [
28] and HSI-MSI fusion [
29,
30,
31,
32,
33,
34,
35]. For example, Dian et al. [
28] proposed a deep HSI sharpening method which used priors learnt via CCN-based residual learning. Recently, some unified image fusion frameworks such as U2Fusion [
36] and SwinFusion [
37] have been proposed for various fusion issues, including multi-modal, multi-exposure tasks. These frameworks might be modified and utilized for pan-sharpening. The related works about HSI-MSI are detailed in
Section 2.
In this paper, a novel unsupervised multi-attention GAN is proposed to solve the HSI–MSI fusion problem with unknown spectral response function (SRF) and point spread function (PSF). Based on the linear unmixing theory, two autoencoders and one constraint network are jointly coupled in the proposed generator net to reconstruction HR-HSI. The model offers an end-to-end unsupervised learning strategy, which is driven by a joint-loss function, to obtain the desired outcome. The main contributions of this study can be summarized as follows.
An unsupervised GAN, which contains one generator network and two discriminator networks, is developed for HSI-MSI fusion based on the degradation model and the spectral unmixing model. The experiments conducted on four data sets demonstrate that the proposed method outperforms state-of-the-art methods.
In the generator net, two streams of autoencoders are jointly connected through a degradation-generation (DG) block to perform spectral unmixing and image fusion. The endmembers of DG block are made up of one convolution layer’s parameters that are shared by two autoencoder networks. Also, in order to increase the consistency of these networks, a learnt PSF layer acts as a bridge connecting the low- and high-resolution abundances.
Our encoder network adopts an attention module called coordinate multi-attention net (CMAN) to extract deeper features from the input data, which consists of a pyramid coordinate channel attention module and a non-local spatial attention module. The channel attention module is factorized into two parallel feature encoding strings to alleviate the inter-positional information among spectral channels.
This article is organized as follows.
Section 2 briefly reviews the deep-learning-based HSI-MSI fusion methods and some attention modules.
Section 3 describes the degradation relationships between HR-HSI, LR-HSI, and HR-MSI based on the linear spectral mixing model.
Section 4 details the proposed generative adversarial network (GAN) framework including the network architecture of generator and discriminator, the structure of the attention module and the loss functions.
Section 5 includes the ablation experiments and comparison experiments. Finally, conclusions of our work are drawn in
Section 6.
3. Problem Formulation
The HSI–MSI fusion problem is to estimate the HR-HSI datacube, which has both high spectral and high spatial resolution and is denoted as , where M, and N are the spatial dimensions, while L is the number of spectral bands. Similarly, an LR-HSI is denoted as , where m and n are the width and height of . And an MSI datacube with high spatial resolution is denoted as , where l is the number of spectral bands in , and when an RGB image is employed as the MSI data. To simplify the mathematical derivation, we unfold these 3-D datacubes to 2-D matrices as , , , respectively.
The relationships among
,
and
are illustrated in
Figure 1. According to the linear mixing model (LMM), each pixel of the HSI is assumed to be a linear combination of a set of pure spectral bases called endmembers. The coefficient of each endmember is called abundance. The HR-HSI
can be described as,
where
p is the number of endmembers, the
column of abundance matrix
consists of columns representing mixing coefficients
of the
endmember at the
pixel, and the endmember matrix
is made up of
p endmembers with
L spectral bands.
The LR-HSI
can also be expressed as a linear combination of the same endmembers
of
as following equation,
where the matrix
consists of the coefficients
of low spatial resolution.
Similarly, the HR-MSI data
is given by,
where the matrix
is made up of
p endmembers with
l spectral bands.
The abundance coefficients should satisfy the sum-to-one and nonnegative constraints given by following the respective equations,
The spectral bases of endmembers should also satisfy the nonnegative property, which is given by,
where
is the element representing the
band of the
endmember.
The LR-HSI
can be considered as a spatially degraded version of HR-HSI
as,
where the matrix
is the degradation matrix representing the spatial blurring and downsampling operation on
. Meanwhile, the HR-MSI
can be noted as a spectrally degraded version of
,
where the spectral degradation matrix
is determined by the SRF, which describes the spectral degradation mapping from HSI to MSI. Comparing Equations (1) and (7), it is obvious that the LR-HSI
preserves the fine spectral information, which is highly consistent with the target spectral endmembers matrix
. Meanwhile, Equations (1) and (8) also illustrate that the HR-MSI provides detailed spatial contextual information, which is highly correlated with high spatial resolution abundance matrix
. The key point of the HSI–MSI fusion problem is to estimate
and
from
and
, respectively, for reconstructing
.
Furthermore, the ideal LR-MSI
can either be expressed as a spectrally degraded version of
or a spatially degraded version of
, respectively,
This is added in the model as a consistency constraint of the network.
4. Proposed Method
In this paper, we propose a GAN that consists of one generator network (G-Net) and two discriminator networks (D-Net1 and D-Net2), which is based on the models described in
Section 4. The whole architecture of the adversarial training is shown in
Figure 2. The HR-HSI
and LR-MSI
are fed and processed in the separated network streams as 3D data without unfolding.
The generator network employs two streams of autoencoder-decoder networks to perform spectral unmixing and data reconstruction. The discriminator nets are employed to extract multi-dimensional features of the input and output from generator networks to obtain the corresponding authenticity probability. A joint loss function incorporated with multiple constraints of the entire network is also presented.
4.1. Generator Network
As shown in
Figure 3, the G-net is composed of two main autoencoder networks (AENet1 and AENet2), which are correlated with each other by sharing endmembers. The desired HR-HSI
is embedded in one layer of the decoder in the AENet2 as a hidden variable.
The AENet1 is designed to learn the LR-HSI identity function
. The endmembers
and abundances
are extracted from the input LR-HSI
by the AENet1. The encoder module is designed to learn a nonlinear mapping
which transforms the input
to its abundances
as in following equation,
The overall structure of the encoder is shown in
Figure 3. It consists of a 3 × 3 convolution layer followed by a ReLU layer, three cascaded residual blocks (ResBlock) and CMAN blocks, and a 1 × 1 convolution layer. The detailed description of CMAN is in
Section 4.3.
The decoder
reconstructs data
from
, and its function is noted as,
Meanwhile, the AENet2 is designed to learn the HR-MSI identity function
. The encoder structure of the AENet2 is the same as AENet1, it can transform
to the HR abundance matrix
by following equation,
The decoder
of AENet2 is different from that of AENet1, and the function is given as,
The decoder
consists of two parts, a convolution layer
which contains the parameters of the endmember matrix
shared by AENet1, and a spectral degradation module which adaptively learns the spectral response function
. The decoder
generates the desired HR-HSI
, while
transform
to HR-MSI
. The relationship is given as the following equation,
The function
represents the spectral downsampling from HSI to MSI, and it can be defined as,
where
is the spectral radiance of the
ith band of the MSI data, [
,
] is the wavelength range of the
ith band,
is the spectral response of the MSI sensor, and
is the spectral radiance of the HSI data. In order to implement the SRF function in the neural network, a convolution layer and a normalization layer are employed to adaptively learn the numerator and denominator of Equation (
15), respectively.
Furthermore, as show in
Figure 3, the AENet1 and AENet2 are not only connected by sharing the endmember
, but also connected through a DG block. As given by the hyperspectral linear unmixing model given in Equations (1) and (2),
and
are composed of the same endmember matrix
. Meanwhile, a low-resolution abundance
can be generated by applying a convolution layer to perform spatial degradation
, and
. Therefore, in the DG block, we can acquire another LR-HSI data
from
and
, by using the same decoding function of AENet1,
The generated is another approximation of input LR-HSI .
In addition, the spectral degradation module is shared to generate LR-MSI as
. Meanwhile, the spatial degradation module is shared to acquire another version of the LR-MSI as
. According to Equation (
9), they should be approximately the same. Therefore, the constraint of LR-MSI is formed as,
4.2. Discriminator Network
For autoencoder nets, and normalizations are usually used to define loss functions, which both adopt the Euclidean metric to evaluate the degree of similarity of data. However, such a pixels-level evaluation standard cannot take advantage of the semantic information and spatial features of images. Therefore, D-nets are adopted to further strengthen the semantic and spatial feature similarity of data.
As shown in
Figure 4, two classification D-nets are employed to distinguish the authenticity of the LR-HSI datacube and the HR-MSI pairs, respectively. The D-net is composed of three cascaded convolution layers, normalization layers, and ReLU layers. Both D-nets are expected to correctly classify the input data and output data of the G-net, while the G-net is expected to generate the output data to deceive the D-nets. According to the definition of the objective function of GAN, the loss functions of the two D-nets are defined as,
where,
represents the operation of the AENet1,
is the operation of the discriminator. In order to stabilize the training process, the negative log likelihood loss (NLL) in the above formula is replaced by the mean square error (MSE), therefore the loss functions in this research are given as,
4.3. Coordinate Multi-Attention Net (CMAN)
Recently, various attention modules have been proposed to capture channel and spatial information of high-dimension data, such as CBAM [
36], DANet [
37], and EPSANet [
38]. As shown in
Figure 5, we propose a multi-attention module called CMAN, which consists of a pyramid coordinate channel attention (CCA) module and a global spatial attention (GSA) module. It extrapolates the attentional maps along the spectral channels and global spatial dimensions, and then multiplies the attentional maps with the input for adaptive feature optimization to obtain deep spatial and spectral features of the input data.
4.3.1. Coordinate Channel Attention Module
In this research, we propose the CCA mechanism to acquire spectral channel weights embedded with positional information. A pyramid structure is adopted to extract feature information of different sizes and increase the pixel-level receptive field. In order to alleviate the positional information loss, we factorize channel attention into two parallel feature encoding strings which acquire average pooling and standard deviation pooling in the H (horizontal) coordinate and V (vertical) coordinate separately. The CCA module can effectively integrate spatial coordinate information into the generated attention maps. Given an arbitrary input
for each channel,
H and
W are the spatial dimensions,
C is the channel dimension. The conventional average pooling and standard deviation pooling steps can be formulated as follows,
In the proposed attention module, we use two spatial extents of pooling kernels to encode each channel along the horizontal coordinate and the vertical coordinate, respectively. Thus, the average pooling and standard deviation pooling at fixed horizontal position
h can be formulated as,
Similarly, the average pooling and standard deviation pooling at given vertical position
w can be written as,
The two strings can capture long-range dependencies along one spatial direction and preserve precise positional information along the other spatial direction. This allows the module to aggregate features along the two spatial directions, respectively, and generate a pair of direction-aware feature maps.
Given the aggregated feature maps, we concatenate them and then send them to a shared convolutional transformation function
F,
where [] denotes the concatenation operation along the spatial dimension,
is a non-linear activation function. Then,
is divided into two distinct parameters along the spatial dimension. Another two convolutional transformations
and
are utilized to separately transform
and
to parameters with the same channel number to the input
,
where,
is the sigmoid function. Then, the output of each channel can be written as,
4.3.2. Global Spatial Attention Module
We adopt a non-local attention module to model the global spatial context and capture the internal dependency of features. The input feature
is convolved to generate two new feature maps
and
, where
. Then we reshape
and
to
and
, where
is the number of spatial pixels. The transpose of feature map
is multiplicated with the feature map
, and a softmax layer is applied to calculate the global spatial attention map
.
where
is the
column of
and
is the
column of
.
Meanwhile, we feed the feature into a convolution layer to generate a new feature map and reshape it to , then we perform a matrix multiplication between the third feature map and the transpose of and reshape the result to to obtain the global spatial attention weights.
4.4. Joint Loss Function
We adopt
normalization to construct the loss function of the G-net. The G-net included sub-loss function of four generated constraint parts: (1) generation constraint of AENet1
, (2) generation constraint of DG block
, (3) generation constraint of AENet2
, (4) generation constraint of LR-MSI
. The corresponding loss function is given as follows,
The sum-to-one of abundances are satisfied by following loss function,
where
j indicates the
endmember, and
is the
row of the abundance matrix
.
Based on the spectral mixing model, each pixel of the HSI is composed of a small number of pure spectral bases. Therefore, the abundance matrices should be sparse. To guarantee the sparsity of the abundance, the Kullback-Leibler (KL) divergence is used to ensure that most of the elements in the abundance matrices are close to a small number,
where
s is the number of pixels,
p is the number of endmembers,
is a sparsity parameter (0.001 in our network), and
is the element of the abundance. This loss function constrains all the generation abundances mentioned above.
Ultimately, the fusion problem is solved by constructing a deep learning GAN framework which can optimize the following objective function,