An Unmixing-Based Multi-Attention GAN for Unsupervised Hyperspectral and Multispectral Image Fusion

: Hyperspectral images (HSI) frequently have inadequate spatial resolution, which hinders numerous applications for the images. High resolution multispectral image (MSI) has been fused with HSI to reconstruct images with both high spatial and high spectral resolutions. In this paper, we propose a generative adversarial network (GAN)-based unsupervised HSI-MSI fusion network. In the generator, two coupled autoencoder nets decompose HSI and MSI into endmembers and abundances for fusing high resolution HSI through the linear mixing model. The two autoencoder nets are connected by a degradation-generation (DG) block, which further improves the accuracy of the reconstruction. Additionally, a coordinate multi-attention net (CMAN) is designed to extract more detailed features from the input. Driven by the joint loss function, the proposed method is straightforward and easy to execute in an end-to-end training manner. The experimental results demonstrate that the proposed strategy outperforms the state-of-art methods.

In the first category, pan-sharpening image fusion algorithms are extended to fusing low-resolution (LR) HSI and HR-MSI.For example, Gomez et al. [16] first extended a wavelet-based pan-sharpening algorithm to fuse HSI with MSI.Zhang et al. [17] introduced a 3D wavelet transform for HSI-MSI fusion.Chen et al. [18] divided the HSI into several regions and fused the HSI and MSI in each region using a pan-sharpening method.Aiazzi et al. [19] proposed a component substitution fusion method, which took the spectral response function (SRF) as part of the model.
In the second category, Eismann et al. [20] proposed a Bayesian fusion method based on a stochastic mixing model of the underlying spectral content to achieve resolution enhancement.Wei et al. [21] proposed a variational-based fusion method by incorporating a sparse regularization using trained dictionaries and optimization the problem through the split augmented Lagrangian shrinkage algorithm.Simões et al. [22] formulated the fusion problem as a minimization of a convex objection containing two quadratic terms and an edge-preserving term.Akhtar et al. [23] proposed a nonparametric Bayesian sparse coding strategy, which first inferred the probability distributions of the material spectra and then computed the sparse codes of the high-resolution image.
Methods in the third category usually assume that the HSI is composed of a series of pure spectra (named as endmembers) with corresponding proportion (named as abundance) maps.Therefore, matrix decomposition [24][25][26] and tensor factorization algorithms [27] have been used to decompose both LR-HSI and HR-MSI into endmembers and abundance maps to generate HR-HSI.For example, Kawakami et al. [24] introduced a matrix factorization algorithm to estimate the endmember-basis of the HSI and fuse it with a RGB image.In Refs [25,26], coupled non-negative matrix fraction (CNMF) had been used to estimate endmembers and abundances for HSI-MSI fusion.Dian et al. [27] proposed a non-local sparse tensor decomposition approach to transform the fusion problem as the estimation of dictionaries in three modes and corresponding core tensors.
In recent years, deep learning methods have been presented and successfully applied in the field of computer vision.Since the deep learning methods have great ability to extract embedded features and represent complex nonlinear mapping, they have been widely used for various remote sensing image procedures, including HSI super-resolution.The thought of HSI fusion based on deep learning can be divided into pan-sharpening [28] and HSI-MSI fusion [29][30][31][32][33][34][35].For example, Dian et al. [28] proposed a deep HSI sharpening method which used priors learnt via CCN-based residual learning.Recently, some unified image fusion frameworks such as U2Fusion [36] and SwinFusion [37] have been proposed for various fusion issues, including multi-modal, multi-exposure tasks.These frameworks might be modified and utilized for pan-sharpening.The related works about HSI-MSI are detailed in Section 2.
In this paper, a novel unsupervised multi-attention GAN is proposed to solve the HSI-MSI fusion problem with unknown spectral response function (SRF) and point spread function (PSF).Based on the linear unmixing theory, two autoencoders and one constraint network are jointly coupled in the proposed generator net to reconstruction HR-HSI.The model offers an end-to-end unsupervised learning strategy, which is driven by a joint-loss function, to obtain the desired outcome.The main contributions of this study can be summarized as follows. 1.
An unsupervised GAN, which contains one generator network and two discriminator networks, is developed for HSI-MSI fusion based on the degradation model and the spectral unmixing model.The experiments conducted on four data sets demonstrate that the proposed method outperforms state-of-the-art methods.

2.
In the generator net, two streams of autoencoders are jointly connected through a degradation-generation (DG) block to perform spectral unmixing and image fusion.The endmembers of DG block are made up of one convolution layer's parameters that are shared by two autoencoder networks.Also, in order to increase the consistency of these networks, a learnt PSF layer acts as a bridge connecting the low-and highresolution abundances.

3.
Our encoder network adopts an attention module called coordinate multi-attention net (CMAN) to extract deeper features from the input data, which consists of a pyramid coordinate channel attention module and a non-local spatial attention module.The channel attention module is factorized into two parallel feature encoding strings to alleviate the inter-positional information among spectral channels.
This article is organized as follows.Section 2 briefly reviews the deep-learning-based HSI-MSI fusion methods and some attention modules.Section 3 describes the degradation relationships between HR-HSI, LR-HSI, and HR-MSI based on the linear spectral mixing model.Section 4 details the proposed generative adversarial network (GAN) framework including the network architecture of generator and discriminator, the structure of the attention module and the loss functions.Section 5 includes the ablation experiments and comparison experiments.Finally, conclusions of our work are drawn in Section 6.

Deep Leaning (DL) HSI-MSI Fusion Methods
DL HSI-MSI fusion methods can be divided into two types, one is based on the degradation models [29][30][31][32] and another is based on the spectral mixing model [33][34][35].In the first category, the fusion networks were constructed to reconstruct desired HR-HSI by using the observation models to depict the spatial degradation relationship between HR-HSI and LR-HSI, as well as the spectral degradation relationship between HR-HSI and HR-MSI.For example, Han et al. [29] present a multi-scale spatial and spectral fusion network for HSI-RGB fusion.Yang et al. [30] proposed a fusion network to extract features from LR-HSI and HR-MSI, and a spatial attention network to recover the high frequency details.Xiao et al. [31] proposed a physical-based GAN, which used the degradation model to generate spatial and spectral degraded images for the discriminators.The GAN used a multiscale residual channel attention fusion module and a residual spatial attention fusion module for fusion.Liu et al. [32] construct an unsupervised multi-attention-guided network, which includes a multi-attention encoding network for extracting sematic features of MSI and a multiscale feature guided network as a regularizer.
In the second category, the networks perform spectral unmixing on LR-HSI and HR-MSI based on the linear mixing model to extract spectral bases and high resolution spatial information for HR-HSI fusion.Qu et al. [33] presented an unsupervised encoder-decoder architecture which used a sparse Dirichlet constraint.Zheng et al. [34] proposed an unsupervised coupled network which consists of autoencoders to extract spectral information from the LR-HSI and spatial-contextual information from the HR-MSI.Yao et al. [35] proposed a coupled convolution autoencoder network which implanted a cross-attention module to transfer the spectral and spatial information between two branches.A closed-loop spatial-spectral consistency regularization was employed in the network to achieve local optimum.
Inspired by the above works, an unsupervised GAN network is developed by incorporating the degradation models with the spectral mixing model, in order to associate the HR-HSI with both the LR-HSI and the HR-MSI.The proposed network has the ability to learn the spatial and spectral degradation across LR-HSI and HR-MSI in an adaptive manner.

Attention Mechanisms
Recently, attention mechanisms have been deployed for boosting the performance of various deep learning networks in computer vision tasks.Hu [38] designed the squeezeand-excitation (SE) block to model interdependencies between channels, which could bring notable improvement in performance of CNNs on classification tasks.Sanghyun [39] presented a convolutional block attention module (CBAM) which sequentially exploited the inter-channel and inter-spatial relationships of features, and demonstrated the performance in various applications, i.e., image classification, visualization and object detection.Fu [40] proposed a dual attention network (DANet) for scene segmentation by introducing the position attention module and a channel attention module to capture global dependencies in the spatial and channel dimensions.Zhang [41] proposed an efficient pyramid squeeze attention network (EPSANet) to extract multi-scale spatial information and the crossdimension channel information, and verified the effectiveness on computer vision task in image classification and object detection.
In this work, in order to more effectively extract spatial-spectral information from HSI and MSI for the fusion task, a multi-attention module that consists of a pyramid channel attention and a global spatial attention is present.

Problem Formulation
The HSI-MSI fusion problem is to estimate the HR-HSI datacube, which has both high spectral and high spatial resolution and is denoted as Y ∈ R M×N×L , where M, and N are the spatial dimensions, while L is the number of spectral bands.Similarly, an LR-HSI is denoted as X s ∈ R m×n×L , where m and n are the width and height of X s .And an MSI datacube with high spatial resolution is denoted as X m ∈ R M×N×l , where l is the number of spectral bands in X m , and l = 3 when an RGB image is employed as the MSI data.To simplify the mathematical derivation, we unfold these 3-D datacubes to 2-D matrices as The relationships among X s , X m and Y are illustrated in Figure 1.According to the linear mixing model (LMM), each pixel of the HSI is assumed to be a linear combination of a set of pure spectral bases called endmembers.The coefficient of each endmember is called abundance.The HR-HSI Y can be described as, where p is the number of endmembers, the j th column of abundance matrix A ∈ R MN×p consists of columns representing mixing coefficients a ij of the j th endmember at the i th pixel, and the endmember matrix E ∈ R p×L is made up of p endmembers with L spectral bands.The LR-HSI X s can also be expressed as a linear combination of the same endmembers E of Y as following equation,

MN
where the matrix A s ∈ R mn×p consists of the coefficients a s ij of low spatial resolution.Similarly, the HR-MSI data X m is given by, where the matrix E m ∈ R p×l is made up of p endmembers with l spectral bands.
The abundance coefficients should satisfy the sum-to-one and nonnegative constraints given by following the respective equations, The spectral bases of endmembers should also satisfy the nonnegative property, which is given by, 0 ≤ e ij ≤ 1, ∀kj where e kj is the element representing the k th band of the j th endmember.
The LR-HSI X s can be considered as a spatially degraded version of HR-HSI Y as, where the matrix S ∈ R nm×MN is the degradation matrix representing the spatial blurring and downsampling operation on Y.Meanwhile, the HR-MSI X m can be noted as a spectrally degraded version of Y, where the spectral degradation matrix R ∈ R L×l is determined by the SRF, which describes the spectral degradation mapping from HSI to MSI.Comparing Equations ( 1) and ( 7), it is obvious that the LR-HSI X s preserves the fine spectral information, which is highly consistent with the target spectral endmembers matrix E. Meanwhile, Equations ( 1) and ( 8) also illustrate that the HR-MSI provides detailed spatial contextual information, which is highly correlated with high spatial resolution abundance matrix A. The key point of the HSI-MSI fusion problem is to estimate E and A from X s and X m , respectively, for reconstructing Y. Furthermore, the ideal LR-MSI Z ∈ R mn×l can either be expressed as a spectrally degraded version of X s or a spatially degraded version of X m , respectively, This is added in the model as a consistency constraint of the network.

Proposed Method
In this paper, we propose a GAN that consists of one generator network (G-Net) and two discriminator networks (D-Net1 and D-Net2), which is based on the models described in Section 4. The whole architecture of the adversarial training is shown in Figure 2. The HR-HSI X s and LR-MSI X m are fed and processed in the separated network streams as 3D data without unfolding.
The generator network employs two streams of autoencoder-decoder networks to perform spectral unmixing and data reconstruction.The discriminator nets are employed to extract multi-dimensional features of the input and output from generator networks to obtain the corresponding authenticity probability.A joint loss function incorporated with multiple constraints of the entire network is also presented.

Generator Network
As shown in Figure 3, the G-net is composed of two main autoencoder networks (AENet1 and AENet2), which are correlated with each other by sharing endmembers.The desired HR-HSI Y is embedded in one layer of the decoder in the AENet2 as a hidden variable.
The AENet1 is designed to learn the LR-HSI identity function G 1 (X s ) = Xa s .The endmembers E and abundances A s are extracted from the input LR-HSI X s by the AENet1.The encoder module is designed to learn a nonlinear mapping f en (•) which transforms the input X s to its abundances A a s as in following equation, The overall structure of the encoder is shown in Figure 3.It consists of a 3 × 3 convolution layer followed by a ReLU layer, three cascaded residual blocks (ResBlock) and CMAN blocks, and a 1 × 1 convolution layer.The detailed description of CMAN is in Section 4.3.
The decoder f de (•) reconstructs data Xa s from A a s , and its function is noted as, Meanwhile, the AENet2 is designed to learn the HR-MSI identity function G 2 (X m ) = Xm .The encoder structure of the AENet2 is the same as AENet1, it can transform X m to the HR abundance matrix A by following equation, The decoder h de (•) of AENet2 is different from that of AENet1, and the function is given as, The decoder h de (•) consists of two parts, a convolution layer f de (•) which contains the parameters of the endmember matrix E shared by AENet1, and a spectral degradation module which adaptively learns the spectral response function SRF(•).The decoder f de (•) generates the desired HR-HSI Ŷ = f de (A), while SRF(•) transform Ŷ to HR-MSI Xm .The relationship is given as the following equation, The function SRF(•) represents the spectral downsampling from HSI to MSI, and it can be defined as, where φ i is the spectral radiance of the ith band of the MSI data, [λ i1 , λ i2 ] is the wavelength range of the ith band, ρ is the spectral response of the MSI sensor, and ε is the spectral radiance of the HSI data.In order to implement the SRF function in the neural network, a convolution layer and a normalization layer are employed to adaptively learn the numerator and denominator of Equation (15), respectively.
Furthermore, as show in Figure 3, the AENet1 and AENet2 are not only connected by sharing the endmember E, but also connected through a DG block.As given by the hyperspectral linear unmixing model given in Equations ( 1) and ( 2), Y and X s are composed of the same endmember matrix E. Meanwhile, a low-resolution abundance A b s can be generated by applying a convolution layer to perform spatial degradation d(•), and Therefore, in the DG block, we can acquire another LR-HSI data Xb s from E and A, by using the same decoding function of AENet1, The generated Xb s is another approximation of input LR-HSI X s .In addition, the spectral degradation module is shared to generate LR-MSI as Z 1 = SRF(X s ).Meanwhile, the spatial degradation module is shared to acquire another version of the LR-MSI as Z 2 = d(X m ).According to Equation ( 9), they should be approximately the same.Therefore, the constraint of LR-MSI is formed as,

Discriminator Network
For autoencoder nets, l 2 and l 1 normalizations are usually used to define loss functions, which both adopt the Euclidean metric to evaluate the degree of similarity of data.However, such a pixels-level evaluation standard cannot take advantage of the semantic information and spatial features of images.Therefore, D-nets are adopted to further strengthen the semantic and spatial feature similarity of data.
As shown in Figure 4, two classification D-nets are employed to distinguish the authenticity of the LR-HSI datacube and the HR-MSI pairs, respectively.The D-net is composed of three cascaded convolution layers, normalization layers, and ReLU layers.Both D-nets are expected to correctly classify the input data and output data of the G-net, while the G-net is expected to generate the output data to deceive the D-nets.According to the definition of the objective function of GAN, the loss functions of the two D-nets are defined as, where, G 1 (•) represents the operation of the AENet1, D 1 (•) is the operation of the discriminator.In order to stabilize the training process, the negative log likelihood loss (NLL) in the above formula is replaced by the mean square error (MSE), therefore the loss functions in this research are given as, 20)

Coordinate Multi-Attention Net (CMAN)
Recently, various attention modules have been proposed to capture channel and spatial information of high-dimension data, such as CBAM [36], DANet [37], and EPSANet [38].As shown in Figure 5, we propose a multi-attention module called CMAN, which consists of a pyramid coordinate channel attention (CCA) module and a global spatial attention (GSA) module.It extrapolates the attentional maps along the spectral channels and global spatial dimensions, and then multiplies the attentional maps with the input for adaptive feature optimization to obtain deep spatial and spectral features of the input data.

Coordinate Channel Attention Module
In this research, we propose the CCA mechanism to acquire spectral channel weights embedded with positional information.A pyramid structure is adopted to extract feature information of different sizes and increase the pixel-level receptive field.In order to alleviate the positional information loss, we factorize channel attention into two parallel feature encoding strings which acquire average pooling and standard deviation pooling in the H (horizontal) coordinate and V (vertical) coordinate separately.The CCA module can effectively integrate spatial coordinate information into the generated attention maps.Given an arbitrary input U ∈ R H×W×C for each channel, H and W are the spatial dimensions, C is the channel dimension.The conventional average pooling and standard deviation pooling steps can be formulated as follows, In the proposed attention module, we use two spatial extents of pooling kernels to encode each channel along the horizontal coordinate and the vertical coordinate, respectively.Thus, the average pooling and standard deviation pooling at fixed horizontal position h can be formulated as, Similarly, the average pooling and standard deviation pooling at given vertical position w can be written as, The two strings can capture long-range dependencies along one spatial direction and preserve precise positional information along the other spatial direction.This allows the module to aggregate features along the two spatial directions, respectively, and generate a pair of direction-aware feature maps.
Given the aggregated feature maps, we concatenate them and then send them to a shared convolutional transformation function F, where [] denotes the concatenation operation along the spatial dimension, δ is a non-linear activation function.Then, Γ is divided into two distinct parameters along the spatial dimension.Another two convolutional transformations F h (•) and F w (•) are utilized to separately transform Γ h and Γ w to parameters with the same channel number to the input U, where, σ is the sigmoid function.Then, the output of each channel can be written as,

Global Spatial Attention Module
We adopt a non-local attention module to model the global spatial context and capture the internal dependency of features.The input feature U ∈ R H×W×C is convolved to generate two new feature maps B and C, where {B, C} ∈ R H×W×C .Then we reshape B and C to V 1 ∈ R N×C and V 2 ∈ R N×C , where N = H × W is the number of spatial pixels.The transpose of feature map V 1 is multiplicated with the feature map V 2 , and a softmax layer is applied to calculate the global spatial attention map T ∈ R N×N .
where V 1i is the i th column of V 1 and V 2i is the j th column of V 2 .Meanwhile, we feed the feature U into a convolution layer to generate a new feature map D ∈ R H×W×C and reshape it to V 3 ∈ R N×C , then we perform a matrix multiplication between the third feature map D and the transpose of T and reshape the result to S ∈ R H×W×C to obtain the global spatial attention weights.

Joint Loss Function
We adopt l 1 normalization to construct the loss function of the G-net.The G-net included sub-loss function of four generated constraint parts: (1) The corresponding loss function is given as follows, The sum-to-one of abundances are satisfied by following loss function, where j indicates the j th endmember, and A j is the j th row of the abundance matrix A.
Based on the spectral mixing model, each pixel of the HSI is composed of a small number of pure spectral bases.Therefore, the abundance matrices should be sparse.To guarantee the sparsity of the abundance, the Kullback-Leibler (KL) divergence is used to ensure that most of the elements in the abundance matrices are close to a small number, where s is the number of pixels, p is the number of endmembers, β is a sparsity parameter (0.001 in our network), and a ij is the element of the abundance.This loss function constrains all the generation abundances mentioned above.Ultimately, the fusion problem is solved by constructing a deep learning GAN framework which can optimize the following objective function,

Experiments and Analysis
To demonstrate the effectiveness and performance of the proposed GAN architecture on HSI-MSI fusion, we perform ablation analysis of the proposed network and compare the method with other fusion methods.

Data Sets
The following experiments are conducted on four widely used HSI data sets, Pavia University, Indian Pines, Washington DC, and University of Houston.The Pavia University data were acquired by the ROSIS-3 optical airborne sensor in 2003.This image consists of 610 × 340 pixels with a ground sampling distance (GSD) of 1.3 m and spectral range of 430-840 nm in 115 bands.The University of Houston data were used in the 2018 IEEE GRSS Data Fusion Contest, and consist of 601 × 2384 pixels with a 1 m GSD.The data cover the spectral range 380-1050 nm with 48 bands.The Indian Pines data were acquired by the AVIRIS in 1992.This image consists of 145 × 145 pixels with a 20 m GSD and the spectral range is 400-2500 nm covering 224 bands.The Washington DC data were acquired by the HYDICE sensor in 1995.This image has an area of 1280 × 307 pixels and a GSD of 2.5 m.The spectral range is 400-2500 nm, consisting of 210 bands.
In the experiment, we selected and cropped these hyperspectral datasets, which are adopted as the original HR-HSI data sets.The LR-HSI is synthesized by spatially downsampling the original HSI data sets by using Gaussian filters.For all datasets, the scaling ratio was set to 4. To synthesize the HR-MSI, the SRF characteristics of the Landsat 8 were used.According to the spectral range of the HSI data sets, the blue-green-red bands SRFs of the Landsat 8 were used to synthesize the RGB images of Pavia University and University of Houston data sets.And the blue to SWIR2 part SRFs of the Landsat 8 were used to form 4-band MSIs of Indian Pines and Washington DC data sets.Table 1 summarizes the parameters of the data sets used in following experiment.

Model Training
The proposed network is implemented under PyTorch framework.The model is trained by using an Adam optimizer with the default parameters β 1 = 0.9, β 2 = 0.999, and ε = 10 −8 .The learning rate is initialized with 5 × 10 −4 , which applied a linear decay drop-step schedule to adjust the learning rate during training.The batch size is set to 1.And the input images can be randomly cropped to form mini-batch and sent to the model training in turn.

Performance Metrics
Six different objective metrics are adopted to compare the fusion results Ŷ and the ground truth Y.They are the root mean square error (RMSE), mean relative absolute error (MRAE), peak signa noise ratio (PSNR), average structural similarity (aSSIM), spectral angle mapper (SAM), and erreur relative globale adimensionnelle de synthèse (ERGAS).The RMSE is defined as, where j is the jth band, I is the spatial location of pixels, K is the number of bands, and N is the number of spatial pixels.The MRAE is given as, The PSNR is given as, For HSI data, we employ the average of channel-wised SSIMs to quantitatively evaluate the spatial consistency, and it is given as, where C 1 and C 2 are constants, σ Y and σ Ŷ are the standard deviations of images Y and Ŷ, andσ Y, Ŷ is the covariance.The spectral angle distance (SAD) is used to describe the similarity between a restored spectrum and the ideal spectrum of a single pixel, and it is given as, The SAM is the average value of the SADs of all the pixels in the scene, and it can be given as following, The ERGAS is given as, where h/l is the ratio of high resolution to low resolution.

Ablation Experiments
To examine the necessity of various aspects of the method, multiple ablation studies on the proposed technique were conducted.

Generation Constraints
As described in Section 4.4, the definition of loss function L 3 is closely correlated with the four data reconstruction modules of the G-net.In this section, we remove one sub-loss function at a time to demonstrate the effectiveness of the corresponding module.
Case 1: removing generation constraint of AENet1 L g1 , and the loss function is given as, Case 2: removing generation constraint of DG block L g2 , and the loss function is rewritten as, Case 3: removing generation constraint of AENet2 L g3 , and the loss function is given as, Case 4: removing generation constraint of LR-MSI L g4 , and the loss function is rewritten as, Case 5: using the complete generation constraint of G-net with loss function given by Equation (32).
The results of all the cases on all four datasets are illustrated in Figure 6.It can be seen that the performance drops as one constraint is removed.Furthermore, in case 2, the removal of the DG block causes drastically performance drop.This indicates the branch of DG strongly affects the overall fusion performance.Moreover, case 4 also shows the advantage of the learnable spatial and spectral degradation module in improving the fusion result.

Attention Mechanism
To investigate the effectiveness of the proposed multi-attention module CMAN, ablation analysis was conducted by removing CMAN module and replacing CMAN with other attention mechanisms.Three multi-attention mechanisms included are the following, (1) CBAM [38]: A multi-attention module combines both channel and spatial attention mechanisms.
(2) DANet [39]: A multi-attention module introduces self-attention mechanism in both channel and spatial attention mechanism.
In this section, we choose one RGB data set (Pavia University) and one MSI data set (Indian Pines) to demonstrate the comparisons on RGB-HSI and MSI-HSI fusion, respectively.Tables 2 and 3 summarize the quantitative results of Pavia University and Indian Pines datasets with/without attention mechanisms.It is obvious that the proposed CMAN performs better than the other attention modules.The results of the CBAM module are even worse than that of the non-attention network.This means that not all attention mechanisms are suitable for the proposed GAN fusion framework.In order to enforce the nonnegative constraints of abundance A, a nonnegative constraint function is applied to the output of the last convolution layer of both the encoder nets.In addition, the weights of the convolution layer containing the endmember E, the spatial degradation layer, and the spectral degradation layer should also meet the nonnegative constraints.Since the weights of these layers may be updated to a negative value after the backpropagation, nonnegative constraint functions are also applied to these layers after the weights are updated.Both the softmax function and the clamp function can restrict the property of the nonnegative.
The clamp function used in the proposed model is set as, where a ij is the element of the abundance coefficient.In contrast, the gradient of the clamp function is updated faster in the range [0, 1].The two functions are tested in the network separately.The convergence behavior over the training epochs is shown in Figure 7.It can be observed that the clamp function leads to a better reconstruction accuracy with lower training epochs than the softmax function does.Therefore, the clamp function is adopted in the proposed network.

Ablation Study of GAN
The discriminators of GAN are designed to make the output of the autoencoder closer to the input in feature and semantic information.In order to show the effectiveness of the adversarial training of the GAN framework, the discriminator networks with corresponding loss functions L 1 and L 2 are removed to acquire a Non-GAN network for HSI-MSI fusion.Meanwhile, we also test the GAN framework with either D-Net1 or D-Net2, respectively.Figure 8 shows the convergence behaviors without/with different discriminator nets over the Pavia University data set.The results demonstrate that the GAN frameworks outperform the Non-GAN network.In addition, we chose the Pavia University data set and the Indian Pines data set to compare the performance of GAN architecture on RGB-HSI and MSI-HSI fusion, respectively.As shown in Table 4, the proposed GAN can achieve much better fusion results in all metrics.

Comparison Experiments
In this section, we make comprehensive comparisons to verify the reliability and validity of the proposed method.Four state-of-the-art deep-learning HSI-MSI fusion methods used for comparison are the following: (1) CUCA [35] consists of a two-stream convolutional autoencoder with a crossattention module.
(4) PGAN [31] is a physical-based GAN with a multiscale channel attention and a spatial attention fusion module.
Since it is hard to visually discern the differences among false-color images of fused results, we use heatmaps of RMSE, MRAE and SAD to visually demonstrate the performance of the fusion methods.The RMSE heatmap and the MRAE heatmap can be considered to show the pixelwise error for the reconstructed image cube.The SAD heatmap represents the spectral consistency of each pixel in the fused HSI.We also use PSNR, aSSIM, SAM and ERGAS to quantitatively compare these methods.The PSNR and aSSIM are the measures of the spatial quality.The SAM is used to evaluate overall spectral consistency of the reconstructed HSI.And the ERGAS is a global statistical measure used to evaluate the dimensionless global error for fused data.

Pavia University
We first conducted HSI-RGB fusion on the Pavia University and Houston University datasets.The visual representation of the performance of each fusion method on the Pavia University dataset is shown in Figure 9. From the visual perspective, the proposed method generates results with much less spatial errors and spectral distortions than the other four methods.Among the other four methods, PGAN is visually better on RMSE heatmap, but worse on MRAE and SAD heatmaps.According to the quantitative metrics summarized in Table 5, the proposed method produces the best results in all the indicators.PGAN performs worse than the other methods do.Meanwhile, CUCA performs second best on the Pavia University dataset.

Houston University
The comparison on Houston University dataset is shown in Figure 10 and Table 6, and our proposed method achieves the best results.HYCO performs second best on both visual perspective and quantitative indicators.And CUCA performs worst on Washington DC dataset.

Indian Pines
Then we conducted HSI-MSI fusion on Indian Pines and Washington DC datasets.On Indian Pines dataset, Figure 11 shows that CUCA, HYCO, UMAG and the proposed method are similar in terms of visual effects.In terms of quantitative indicators given in Table 7, our method is superior to the other four methods, and HYCO is slightly better than the other three methods.It is obvious that the differences among fusion results are small.The reason may be that the distribution of ground objects in Indian Pines dataset is relatively simple.

Washington DC
Figure 12 shows the comparison on the Washington DC dataset.From the perspective of visual performance, the performance of the four comparison algorithms on this dataset is relatively poor.The quantitative indicators are summarized in Table 8.Our method is significantly better than the other four methods in both visual effects and evaluation metrics.PGAN is visually second best on RMSE heatmap and PSNR indicator, while CUCA performs second best on the rest quantitative indicator.
In conclusion, the proposed method achieves best performance on all four datasets when compared with the other four methods.The other methods may perform well on a specific dataset, but fail on the other datasets.This also demonstrates the consistent superiority of the proposed methods.

Conclusions
In this article, we proposed a novel unsupervised GAN to address the HSI and MSI fusion problem with arbitrary PSFs and SRFs.This GAN consists of one generator network and two discriminator networks which employ the spatial and spectral degradation models.In order to extract spectral information from the LR HSI and spatial-contextual information from the MSI, the generator network employs two streams of autoencoders.In parallel, we use DG Block to reconstruct another HSI to do subsequent discrimination.Through the attention module CMAN designed in encoder nets, we also allocate the weight of feature importance.The discriminator nets extract multi-dimensional features of the input and output from generator networks to evaluate the authenticity.Using the joint loss function, the proposed method provides a simple and straightforward end-to-end training approach.Four open datasets were used for the comparison experiments, which demonstrate that the proposed method performs better overall.

Figure 1 .
Figure 1.Illustration of the relationships among the HR-MSI, the LR-HSI and the desired HR-HSI based on the linear mixing model.

Figure 2 .
Figure 2. Schematic framework of the AE-based GAN.

Figure 3 .
Figure 3. Architecture of the G-net with two coupled autoencoder networks.

Figure 6 .
Figure 6.Performance of generation constraint modules of the G-net over different data sets.

Figure 7 .
Figure 7. Convergence curves of PSNR with two different constrained functions.

Figure 8 .
Figure 8. Performance of generation constraint modules of the G-net.

Figure 9 .
Figure 9. Visual comparison on Pavia University dataset.

Figure 10 .
Figure 10.Visual comparison on Houston University dataset.

Figure 11 .
Figure 11.Visual comparison on Indian Pines dataset.

Figure 12 .
Figure 12.Visual comparison on Washington DC dataset.

Table 1 .
Original HR I Data Sets Used In The Experiments.

Table 2 .
Comparisons of different attention modules (Pavia University).

Table 3 .
Comparisons of different attention modules (Indian Pines).

Table 4 .
Ablation experiments on adversarial network.

Table 5 .
Objective evaluation metrics on Pavia University dataset.

Table 6 .
Objective evaluation metrics on Houston University dataset.

Table 7 .
Objective evaluation metrics on Indian Pines dataset.

Table 8 .
Objective evaluation metrics on Washington DC dataset.