Attention-Enhanced Generative Adversarial Network for Hyperspectral Imagery Spatial Super-Resolution

: Hyperspectral imagery (HSI) with high spectral resolution contributes to better material discrimination, while the spatial resolution limited by the sensor technique prevents it from accurately distinguishing and analyzing targets. Though generative adversarial network-based HSI super-resolution methods have achieved remarkable progress, the problems of treating vital and unessential features equally in feature expression and training instability still exist. To address these issues, an attention-enhanced generative adversarial network (AEGAN) for HSI spatial super-resolution is proposed, which elaborately designs the enhanced spatial attention module (ESAM) and reﬁned spectral attention module (RSAM) in the attention-enhanced generator. Speciﬁcally, the devised ESAM equipped with residual spatial attention blocks (RSABs) facilitates the generator that is more focused on the spatial parts of HSI that are difﬁcult to produce and recover, and RSAM with spectral attention reﬁnes spectral interdependencies and guarantees the spectral consistency at the respective pixel positions. Additionally, an especial U-Net discriminator with spectral normalization is enclosed to pay more attention to the detailed informations of HSI and yield to stabilize the training. For producing more realistic and detailed super-resolved HSIs, an attention-enhanced generative loss is constructed to train and constrain the AEGAN model and investigate the high correlation of spatial context and spectral information in HSI. Moreover, to better simulate the complicated and authentic degradation, pseudo-real data are also generated with a high-order degradation model to train the overall network. Experiments on three benchmark HSI datasets illustrate the superior performance of the proposed AEGAN method in HSI spatial super-resolution over the compared methods


Introduction
Hyperspectral remote sensing imagery provides rich spectral information with tens to hundreds of continuous and narrow electromagnetic spectra of ground object pixels and spatial structure characteristics of an imaging scene [1] . Owing to the physical constraints of sensors and exceedingly high acquisition costs in practice, hyperspectral imagery (HSI) with both high spatial resolution and high spectral resolution cannot be achieved concurrently in remote sensing, which causes difficulties in terms of objective and accurate land cover identification and analysis in HSIs. Hyperspectral image super-resolution (HSI-SR) is conducive to spatial-spectral resolution improvement and the recovery of HSIs from their corresponding low resolution (LR) observations, which is an economic, efficient, and promising signal post-processing technology in remote sensing. And super-resolved HSIs can be widely applied in many fields of computer vision and remote sensing, including object detection [2,3], target recognition [4,5], and land cover classification [6][7][8].
As a result, HSI-SR has attracted increasing attention in recent years and considerable achievements have been achieved. An extensive body of HSI-SR methods [9][10][11] are trained • An attention-enhanced generative adversarial network (AEGAN) is proposed for hyperspectral imagery spatial super-resolution. The designed AEGAN model excavates and enhances the deeper hierarchical spatial contextual features of HSIs via the enhanced spatial attention module (ESAM) with residual spatial attention blocks (RSABs) in the attention-enhanced generator. Additionally, a refined spectral attention module (RSAM) with spectral attention is also established to explore and refine the interdependencies between neighboring spectral bands. • To stabilize the training and enhance the discriminative ability, an especial U-Net discriminator with spectral normalization is enclosed to the proposed AEGAN model. The special design directs the attention-enhanced generator to lay emphasis on more valuable information and estimates the discriminative probability that the realistic HR HSI is relatively more similar than the fake produced image using pseudo-real data. • An attention-enhanced generative loss, containing the pixel-wise-based spatial loss, perceptual loss, adversarial loss, attention loss, and spectral-angle-mapper based loss, is devised to train the proposed AEGAN model and investigate the high correlation of spatial context and spectral information in HSI, producing more realistic and detailed super-resolved HSIs. • The pseudo-real data generation module with a high-order degradation model is utilized to simulate LR HSIs, which are fed into the proposed AEGAN to testify to its performance under the condition of complicated and authentic degradation. The experimental results illustrate the effectiveness and superiority of the proposed method relative to several existing state-of-the-art methods.
The remaining sections of this article are organized as follows. Related work describes the traditional and deep learning-based super-resolution methods in Section 2. The newly proposed AEGAN framework for HSI spatial SR is detailedly elaborated on in Section 3. Section 4 presents extensive experiment evaluation results and the corresponding analysis. Finally, Section 5 draws conclusions.

Related Work
The traditional super-resolution methods are sensitive to various errors (model error, noise error, image registration error, etc.) and are mostly based on exemplars or dictionaries. Therefore, they are easily constrained by the size of datasets or dictionaries, which further limits their practical applications. With the booming advancement of deep neural networks and the graphic processing unit, the above mentioned deficiencies can be greatly mitigated through exploiting deep learning techniques in HSI SR.

Traditional Super-Resolution Methods
The early traditional super-resolution methods mainly contain the interpolationbased method [13], the reconstruction-based method [19][20][21] and the learning-based method [22][23][24]. The interpolation-based approach takes the selection of operator functions as the key, utilizes the pixel value of the spatially adjacent positions of HSI as the numerical calculation object, and then quickly achieves high-resolution HSI reconstruction by inserting the estimated pixel value of the operator. The most classical interpolation approaches contain nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation. The interpolation-based methods only consider adjacent pixels in the SR reconstruction of HSI and fail to make use of the abundant spatial and spectral details contained in the entire HSI data cube, normally leading to blurred edges and artifacts. Therefore, in modern practical applications, they are seldom employed.
The reconstruction-based methods generate HR HSI based on the image degradation model and prior information and exhibit excellent performance for images with low complexity, while for images with abundant texture structure, the performance is limited. The representative methods includes iterative back-projection (IBP) [19], projection onto convex sets (POCS) [20], and maximum a posterior (MAP) [21]. Reconstruction-based methods require multiple images with small offsets of the same scene and usually have good stability and adaptability. However, there are still unresolved issues in them, such as laborious optimization, long solution time, and high computational complexity.
Learning-based SR methods adopt machine learning algorithms to study the tanglesome mapping relationship between LR HSI and HR HSI in the training stage and then leverage on the acquired mapping function to achieve target HR HSI from the input LR HSI in the testing phase. Freeman et al. [22] employed the belief propagation of the Markov network to learn the parameters and synthesize super-resolution images. Chang et al. [23] used the method of local linear embedding to reconstruct super-resolution images through the linear combination of neighbors. Timofte et al. [24] developed an anchored neighborhood regression approach in which LR/HR images are represented with LR/HR dictionary and corresponding coefficients. These traditional learning methods take advantage of the prior information of images and generally require fewer LR images to obtain a satisfied reconstruction result. However, their performance depends heavily on the selection of training samples.

Deep Learning-Based Super-Resolution Methods
In recent years, deep learning-based super-resolution methods [25][26][27] have attracted extensive attention due to their remarkable performance. The groundbreaking superresolution convolutional neural network (SRCNN) proposed by Dong et al. [26] utilized only three convolutional layers to research a mapping function from LR images to HR images, giving rise to significant SR performance improvement compared with the traditional approaches, whereafter FSRCNN [27], ESPCN [28], VDSR [29], and EDSR [30] were successively presented by increasing the network depth or the output feature number of each layer for effective feature detail excavation. SRResNet [31], SRDenseNet [32], and RDN [33] with various residual connections between the shallow layer and deep layer were also proposed to investigate feature reusage and mitigate the gradient vanishing or exploding problems in deep network training. Then, a compact channel-wise attention, called squeeze-and-excitation (SE) [17], is explored to emphasize informative features and suppress unimportant ones by exploiting the interdependencies of different feature channels. Woo et al. [18] presented an efficient convolutional block attention module, which exploited both spatial and channel-wise attention using convolutional layers to concentrate on informative areas in each feature map. However, when applied to HSI SR, these methods lack the capacity to accurately capture the long-range interrelationships and hierarchical characteristics of HSI signatures.
Therefore, more deep learning-based SR methods are specifically designed for HSI [34][35][36][37][38][39]. The convolutional neural network (CNN) was first introduced into the HSI SR task by Yuan et al. [34], which transferred the mapping from an RGB image to HSI to the mapping from LR HSI to HR HSI with transfer learning. To take advantage of the spectral correlation of HSIs, Hu et al. [35] presented an HSI SR approach based on spectral difference learning and spatial error correction, in which the mapping of spectral differences between LR HSI and HR HSI is learned with CNN. An efficient 3D-FCNN structure was put forward by Mei et al. [36] to quickly learn an end-to-end mapping relationship between LR HSI and HR HSI, in which three-dimensional (3D) convolution is adopted to excavate and represent deep spatial-spectral features. Then, spatial-spectral joint SR using CNN [40] was presented to concurrently improve the spatial and spectral resolution. Yang et al. [41] presented a new multi-scale wavelet 3D-CNN for HSI SR to preserve the details through predicting a sequence of wavelet coefficients of potential HR HSI, whereafter Jiang et al. [42] established a group convolution and progressive upsampling network architecture to learn the spatial-spectral prior of HSIs, which can effectively improve the spatial resolution of HSI and generate excellent SR results. Zhao et al. [43] proposed a recursive, dense CNN with a spatial constraint strategy to boost spatial resolution of HSIs, in which recursion learning, dense connection, and spatial constraint are combined. Yang et al. [25] built a new hybrid local and nonlocal 3D attentive CNN to investigate the spatial-spectral-channel attention feature and long-range interdependency through embedding the local attention and nonlocal attention jointly into a residual 3D CNN. Wang et al. [44] developed a dualchannel network framework containing 2D CNN and 3D CNN, which collectively made use of the information of a single band and its neighbouring bands in HSIs and exhibited superior performance.
A series of novel GAN-based deep network models for HSI SR have also been raised and proven to be effective in image quality improvement. Li et al. [45] presented a 3D-GAN-based HSI SR to effectively mine spectral and spatial characteristics from HSIs. Huang et al. [46] integrated the knowledge of GAN and residual learning to learn effective features, attaining high metrics and spectral fidelity. Jiang et al. [47] designed a GAN model containing spectral and spatial feature-extraction blocks with residual connection in a generator to extract spatial-spectral features. Wang et al. [9] constructed an especial spatial feature enhanced network and a particular special spectral refined network to capture the spatial context information and refine the correlation of spectral bands. Li et al. [10] proposed the adversarial learning method with a band attention mechanism to explore spectral relationships and keep valuable texture details so as to further restrain spectral disorder and texture blurring. However, the existing GAN-based deep learning frameworks for HSI SR frequently suffer from training difficulties, as well as a lack of further exploration on spatial and spectral contextual information, leading to spatial-spectral distortion.

Proposed Method
LR HSIs can be degraded from HR HSI with different degradation models. To mimic the authentic degradation procedure and acquire HR HSIs, an attention-enhanced generative adversarial network (AEGAN) is proposed for hyperspectral imagery spatial superresolution (as depicted in Figure 1), which consists of pseudo-real data generation and an attention-enhanced generative adversarial network. The pseudo-real data generation part contains a variety of complex degradation models, e.g., blurring, downsampling, noise, and so on, leading to more sufficient LR training samples acquired from HR HSIs. An attentionenhanced generative adversarial network is made up of an attention-enhanced generator architecture with an ESAM and an RSAM, and a special U-Net discriminator with spectral normalization. It lays emphasis on the spatial and spectral contextual features and further effectively excavates spatial-spectral features and measures the probability that the realistic HR image is comparatively more similar than the generated image using pseudo-real data.

Model Formulation
The objective of HSI SR is to investigate an end-to-end mapping so that a superresolved HR HSI can be estimated from an input LR HSI. Denote the input LR HSI as I LR ∈ R h×w×L , in which w, h, and L represent the width, height, and number of bands, respectively. Its corresponding HR HSI is indicated as I HR ∈ R H×W×L , where W = s × w and H = s × h, with spatial scale factor s. Generally, the super-resolved HSI I SR can be obtained from the learned mapping function G Θ parametrized by Θ: in which I SR ∈ R H×W×L , similar to I HR . The parameter Θ represents the overall network weights and biases, and D denotes the complex degradation procedure.

Pseudo-Real Data Generation
Since a large amount of high-frequency information is lost in the LR HSI, it can be degraded from HR HSI using different degradation models. Traditional degradation models [48,49] comprising blurring, downsampling, and noise addition are frequently utilized to generate the input LR images. Motivated by the particularity of HSI and real-ESRGAN [50], a high-order degradation model containing diverse degeneration estimation is adopted to simulate an authentic and complex degradation process in actual datasets and obtain LR HSIs I LR from HR HSI I HR . Here, the degeneration model in the spatial domain is represented as blurring (i.e., a convolution operation) followed by resizing (downsampling) and additive gaussian-possion noise. When applied twice, the high-order degradation model can be mathematically expressed as where D 1 and D 2 represent the first-order degradation and second-order degradation process, respectively. * is the convolution operation, k 1 and k 2 denote the blur kernel, ↓ s stands for the downsampling operation with spatial scale factor s, and n 1 and n 2 represent the gaussian-possion noise, respectively. Blurring: Gaussian blurring is the most commonly employed one in image degradation. In order to better mimic the authentic blur degradation of the HS imager, sufficient gaussian kernels are considered to expand the degradation blur space and cover more diversified blur kernel shapes. The employed blur kernels include generalized Gaussian blur kernels [51] with a range of (0.5, 4) of the shape parameter and a plateau-shaped distribution with a range of (1, 2) of the shape parameter from the HR space and LR space. The probability of all the blur kernels is 0.15 and the kernel size is allocated as 15 × 15. In addition, the standard deviation of gaussian blur kernel is arranged within (0.2, 3) for the first application and (0.2, 1.5) for the second application.
Resizing: In this work, resizing mainly denotes the downscaling of HR HSI with bilinear or bicubic downsampling methods to acquire adequate training samples (LR HSIs). Furthermore, considering the pixel misalignment in nearest neighbor interpolation and to maintain the HSI resolution in a reasonable range, bilinear and bicubic downsampling methods are randomly selected.
Additive noise: Noise is inevitable in practical hyperspectral imaging scenarios, where gaussian noise and poisson noise are the most common ones. Following Ref. [50], additive gaussian and poisson noise caused by camera sensors are jointly employed for degradation. Specifically, the probability for each type of noise is set as 0.5. The standard deviation of gaussian noise is assigned within (1,30) for the first-order degradation and (1, 25) for the second-order degradation. The scale of poisson noise is arranged within (0.05, 3) and (0.05, 2.5) for the first-order and second-order degradation, respectively. Additionally, sometimes overshoot artifacts might be produced in practical degradations, an idealized 2D sinc filter (available online at https://dsp.stackexchange.com/questions/58301/2-dcircularly-symmetric-low-pass-filter, accessed on 6 May 2022) with a probability of 0.1 is also employed to simulate the real over-sharpened artifacts.
The pseudo-real data generated by the high-order degradation of the original HR HSI will serve as training samples for the proposed network. Some examples of the produced LR HSIs are displayed in Figure 2, indicating the diversity of unknown degradation models in HSI SR training sample construction.

Attention-Enhanced GAN
Currently, adversarial learning [14] is quite popular for coping with image-generation work, which has been utilized in the image SR task. In this work, a new attention-enhanced generative adversarial network (AEGAN) is specifically designed to improve the quality of HSI, as shown in Figure 3. In the proposed AEGAN, most of the computation is concentrated in a smaller resolution space, so that the consumption of GPU memory and other computing resources can be greatly reduced. Therefore, an inverse operation of pixel-shuffle [28], called pixel unshuffle, is firstly applied to the generated pseudo-real data, to increase its channel size and reduce its space size: where I LR and F 0 are the pseudo-real LR HSIs generated from original HR HSIs and the initial features obtained by pixel unshuffle operation H pix_uns , respectively.  The designed AEGAN consists of an attention-enhanced generator architecture with an ESAM and an RSAM, and a particular U-Net discriminator with spectral normalization. They are constructed to effectively capture more valuable and representative spatial-spectral characteristics and estimate the probability of an input HSI being real or false relative to the authentic HSI.

Attention-Enhanced Generator Architecture
With respect to HSI SR, the purpose of the generator architecture is to recover or generate the corresponding HR HSI with more authentic texture information derived from LR HSI. As depicted in Figure 3, according to the spatial-spectral characteristics of HSI, the attention-enhanced generator structure is designed with an ESAM and an RSAM to extract the spatial texture details and explore the spectral dependencies. The initial shallow features are extracted from the generated pseudo-real data through two convolutional layers with the convolution kernels of 9 × 9 and 1 × 1, followed by the Leaky rectified linear unit (LReLU) activation function [52]: in which F 1 represents the extracted shallow features, δ stands for the ReLU activation function, H 1 CONV , and H 2 CONV indicate the shallow feature extraction function of the first and the second convolutional layers, respectively.
Next, the extracted shallow feature F 1 is transmitted into ESAM to further excavate and enhance more meaningful and informative characteristics. The designed ESAM with a dense connection aggregates the spatial attention mechanism into residual spatial attention blocks (RSABs) that can perform advanced feature extraction and comprises n cascading RSABs with an identical layout. The architecture of RSABs is displayed in Figure 4. A RSAB consists of CONV-ReLU-CONV with a kernel size of 3 × 3, a global pooling layer, a sigmoid layer, and a 1 × 1 transition convolutional layer. Two RSABs are cascaded to explore the local and global spatial characteristics of HSIs. Accordingly, the output-enhanced spatial attention features F ESAM can be formulated as follows: where F n−1 and F n are the input of the n-th RSAB and the corresponding output of enhanced spatial attention features, respectively. H RSAB (·) and H GP (·) indicate the mapping function of RSAB and global pooling. σ(·) represents the sigmoid activation function that maps the enhanced spatial features into the range of [0, 1], and ⊗ denotes the elementwise multiplication. Further, the spatial attention map is obtained (see Figure 5a), which emphasizes more informative spatial characteristics of HSI and suppresses useless features.  With their acquired enhanced spatial attention features, a 3 × 3 convolutional layer and a sub-pixel convolution [28] layer are utilized to augment the spatial resolution of the input HSI to a desired size: in which H pixelsh denotes the mapping function of upsampling layer and F up represents the feature maps after upsampling. Analogously, to make full use of the spectral dependencies of HSI and reduce spectral distortion, an RSAM with spectral attention is also constructed to explore the spectral interrelationships and refine the holistic characteristics information of HSI. It is made up of CONV-ReLU-CONV with a kernel size of 1 × 1, a global pooling layer, a sigmoid layer, and a 1 × 1 transition convolutional layer (as depicted in Figure 5b). The generated feature map using pseudo-real data F RSAM can be represented as follows:

U-Net Discriminator with Spectral Normalization
In general GAN, the discriminator aims to simulate data distribution and learn the difference between the authentic image and the generated one through the penalizing generator to 0 or 1. It plays an indispensable role in the proposed method and directly affects the SR performance. Motivated by the work in [53], an U-Net structure with skip connections is employed as discriminator of the proposed network (the bottom part of Figure 3), to handle issues related to the spatial SR of HSI. The U-Net discriminator can not only furnish detailed pixel-by-pixel feedback to the attention-enhanced generator but also output the genuineness value of each pixel in HSI. Considering the possible instability in training caused by U-Net architecture and diverse degenerations, spectral normalization regularization [54] is introduced to the U-Net discriminator for training stabilization, oversharp, and artifacts mitigation.
In this paper, following [50,55], the U-Net discriminator with spectral normalization is composed of an encoder and a decoder, which are concatenated via skip connections. The encoder employs one 3 × 3 convolutional layer with a stride of 1, and three convolutional layers with a kernel size of 4 × 4, each followed by a spectral normalization and a LReLU activation with a negative coeffient α = 0.2. It is notable that, for the last two max-pooling layers, a stride of 2 is employed for downsampling to decrease the spatial size of HSI feature map and enlarge the receptive fields. To improve the representative capacity of the proposed network, the channel number is also increased accordingly. Similar to the conventional discriminator in [16], the relativistic discriminator estimates the holistic probability of an input HSI being actual or false relative to the authentic HSI.
in which I HR represents the ground-truth HR HSI, G(I LR ) stands for the generated SR HSI, and E[·] means taking an average of all real or false HSIs. D Ra (·) and T(·) denote the function of the relativistic discriminator and the output of the non-transformed discriminator, respectively. The decoder increases the spatial resolution of the acquired feature map with the upsampling operator and propagates the context information to layers with a higher resolution. It contains three 3 × 3 convolutional layers and two upsampling layers. Each convolutional layer is regularized with spectral normalization. To facilitate the information flow between low-level and high-level and promote the ability of discrimination, the output of encoder is concatenated with the input of decoder. Finally, two 3 × 3 convolutional layers followed by LReLU and a 3 × 3 convolutional layer with only one kernel followed by one sigmoid active function are employed to produce a binary mask (i.e., classification score) M(i, j, k) ∈ R H×W×L , which calculates the pixel-wise difference between the authentic and forged pixels of an input image. Accordingly, the loss function of decoder L D U dec can be represented as in which M r (i, j, k) and M f (i, j, k) denote the score maps of the ground-truth HR HSI and the generated SR HSI, respectively. That is, M r (i, j, k) = 1 denotes the ground-truth HSI at pixel (i, j); M r (i, j, k) = 0 indicates the forged HSI at pixel (i, j); and M f (i, j, k) is opposite. Consequently, the total loss function for the U-Net discriminator L D U can be represented as: The size of final output feature of the U-Net discriminator is the same as that of the super-resolved HSI, indicating the similarity between corresponding pixels in the generated HSI and the realistic one. The closer the pixel similarity value is to 1, the closer the generated HSI is to the realistic HSI, and vice versa.

Attention-Enhanced Generative Loss
For HSI spatial SR, the selection of loss function is of considerable importance in optimizing the reconstruction performance of the GAN framework [56]. In this work, in order to achieve more realistic and detailed super-resolved HSI, an attention-enhanced generative loss function is devised to collectively train the proposed AEGAN and investigate the high correlation of spatial context and spectral information in HSI. It is composed of the pixel-wise-based spatial loss term L spa , the perceptual loss term L per , the adversarial loss term L G , the attention loss term L atten , and the spectral-angle-mapper based loss term L SAM and can be expressed as follows: where λ, µ, η, and ϕ denote the trade-off parameters to balance different loss terms, respectively. These hyperparameters determine the contribution of different loss terms in the attention-enhanced generative loss. Pixel-wise-based spatial loss term. With respect to the image SR, the mean-squared-error (MSE) is always employed as the loss of the neural network. Although the MSE-based loss function can gain higher peak signal-to-noise ratio (PSNR) and structural similarity index measurement (SSIM) values, there is still a large difference between the distribution of the super-resolved image and that of the actual one, such as edge smoothing, missing high-frequency details, etc. To make the restored HSI as close as possible to the actual HR HSI, the least absolute deviation ( 1 -norm) is adopted as a pixel-wise spatial loss term to restrict the content of the recovered HSI and guarantee the restoration accuracy of each pixel. Thereby, the least-absolute-deviation-based pixel-wise spatial loss term L spa can be expressed as: where H = s × h, W = s × w, and L indicate the length, width, and spectral band number of I HR , respectively. G(I LR ) refers to the reconstructed HSI generated by pseudo-real data, and · 1 denotes the 1 -norm. Perceptual-based loss term. Taking the particularity of HSI into consideration, the perceptual loss term is designed to make the reconstructed HSI perceptually approximate the actual HR HSI according to high-level characteristics obtained from a pre-trained deep network. Similar to [15,16], the recovered HSI and the ground-truth HR HSI are used together as the input of pre-training VGG19 to mine the features of VGG19-54 layer. The perceptual-based loss term L per can be defined as in which H VGG (·) and G(·) denote the function of VGG and the generator, respectively. Adversarial-based loss term. The adversarial loss term represents the difference between the actual HR HSI and produced super-resolved HSI, which is committed to the selfoptimization of generator with the return parameters of the discriminator and further facilitates more authentic HSI recovery. In this paper, the more powerful U-Net discriminator is applied to direct the attention-enhanced generator to lay emphasis on more valuable information, to improve restored textures and achieve better visual quality. Therefore, according to the loss function of discriminator, the corresponding adversarial loss for the attention-enhanced generator can be symmetrically expressed as: It is notable that the higher the adversarial loss value is, the worse the reconstructed HSI is. Attention-based loss term. Although the adversarial loss of the attention-enhanced generator can deceive the discriminator by continuously updating the weights in the direction of generating sample distribution, it cannot guarantee accurate reconstruction for spectral information at the respective pixel positions. Inspired by [53], the classification score obtained by the U-Net discriminator is employed as the weighted attention loss of the attention-enhanced generator. This attention-based loss term is capable of evaluating the real or fake degree of each pixel position of HSI, leading to an attention-enhanced generator more focused on the parts of generated HSI that are difficult to produce and recover. The attention-based loss term can be defined as in which M f (i, j, k) represents the classification score of the generated HSI derived by the U-Net discriminator. The SAM-based loss term. In order to constrain the spectral structure and reduce spectral distortion, the SAM loss is employed to predict the spectral similarity between the recovered spectrum and the authentic spectrum: where z i,j andẑ i,j stand for the spectral vector of the authentic HSI and the recovered image at the same spatial position (i, j), respectively. A smaller SAM value indicates that the two spectra are more similar. The combination of the above loss terms with different weighting hyperparameters as shown in Equation (12) constrains the devised AEGAN model to produce more visually realistic results with fewer artifacts. The training procedure of the AEGAN model is thus summarized in Algorithm 1.

Algorithm 1 Training procedure of our proposed method
Initialization: All network parameters are initialized [57] Sample {I LR , I HR } N n=1 from pseud-real I LR and authentic I HR m, p, r are the batch size, iteration, and learning rate, respectively for p = 1, 2, ...P do for n in N do n = n + m I SR ⇐ G(I LR ) I SR , M SR ⇐ D(I SR ) Calculate the loss function L enh according to Equation (12) update the parameters of attention-enhanced generator: Θ G ⇐ OptimizerG(L enh , r) I HR , M HR ⇐ D(I HR ) Calculate the loss function L D U according to Equation (11) update the parameters of U-Net discriminator: Θ D ⇐ OptimizerD(L D U , r) end for end for save the attention-enhanced generative adversarial network.

Experiments
In this section, numerous experiments are performed to evaluate the performance of the proposed AEGAN method. First, the experimental hyperspectral remote sensing datasets, implementation details, and quantitative evaluation criteria are introduced. Then, several relevant ablation studies, as well as comparative experiments with existing state-ofthe-art , are conducted on both pseudo-real datasets and benchmark datasets.

Experimental Datasets and Implementation Details
In the experiment, three publicly available real datasets of hyperspectral remote sensing scenes (available online at http://www.ehu.eus/ccwintco/index.php/Hyperspectral_ Remote_Sensing_Scenes, accessed on 9 January 2022) (Pavia University dataset, Pavia Center dataset and Cuprite dataset) are employed as original HSIs for validation. The Pavia University dataset covers 103 bands of the spectrum from 430 nm to 860 nm after discarding the noisy band and bad-band, containing 610 × 340 pixels in each spectral band with a geometric resolution of 1.3 m. The Pavia Center dataset collects 102 spectral bands after discarding bad-bands, comprising a total of 1096 × 1096 pixels in each spectral band. While in the Pavia Center scene, information on partial areas is not available, and only 1096 × 715 valid pixels remained in each spectral band. The cuprite dataset captures 202 valid spectral bands in the spectrum from 370 nm to 2480 nm after the corrupted bands' removal, and each band consists of 512 × 614 pixels.
The input LR samples I LR (the pseudo real data) are generated from original HSI samples by the high-order degeneration model with scaling factors of 2 and 4. The original HSI serves as the reference authentic image I HR . For the employed three data sets, to testify to the effectiveness of the proposed AEGAN approach, some patches in size of 150 × 150 × L pixels with the richest texture details are cropped as testing images, and the remaining parts are utilized as training samples. To cope with the issue of inadequate training images, the training samples are augmented by flipping horizontally and rotating for 90 • , 180 • , 270 • . Therefore, the size of the LR HSI samples could be 36 × 36 × L or 72 × 72 × L, and the corresponding output HR HSIs are 144 × 144 × L.
The adaptive moment estimation (Adam) optimizer [58] with default exponential decay rates β 1 = 0.9 and β 2 = 0.999 is adopted for network training. All of the network training and testing is implemented on four NVIDIA GeForce GTX 1080Ti GPU adopting Pytorch (available online at https://pytorch.org, accessed on 9 January 2022) frameworks. Due to the constraint of GPU memory, following [55], the batch size is assigned to 16 and the initial learning rate is allocated to 0.0001, while the learning process is terminated in 2500 epochs. Coefficients of attention-enhanced generative loss function are empirically allocated as λ = 0.005, µ = 0.02, η = 0.02, and ϕ = 0.01, respectively. Moreover, during training, to retain the spatial size of feature map after convolution, zero-padding operation is applied in all convolutional layers.

Evaluation Metrics
To comprehensively evaluate the performance of the developed HSI SR approach, several commonly used evaluation metrics are adopted, including the mean peak signalto-noise ratio (MPSNR), the mean structural similarity index (MSSIM), the erreur relative global adimensionnelle de synthese (ERGAS), and the spectral angle mapper (SAM). MP-SNR describes the similarity match based on the mean-square-error, and the MSSIM represents the structural consistency between the recovered HSI and the ground-truth one. In this work, both MPSNR and MSSIM are measured by the mean values in all spectral bands. ERGAS is a global image quality indicator measuring the band-wise normalized root of MSE between the super-resolved image and the authentic one. For spectral fidelity, and SAM investigates the spectral recovery quality by estimating the average angle between spectral vectors of the recovered HSI and the actual HSI. MPSNR close to +∞ and MSSIM close to 1, and SAM and ERGAS close to 0, denote a better reconstructed HR HSI.
Given a super-resolved HR HSIŷ and an authentic HSI y, the above-mentioned evaluation metrics can be respectively formulated as follows: in which MAX k indicates the maximum intensity of the k-th band of HSI. ρ y , ρŷ and ξ y , ξŷ denote the average values and variances of y andŷ, respectively. ξŷ y stands for the covariance between y andŷ. c 1 and c 2 are two constants to improve stability, which are set to 0.01 and 0.03, respectively. ·, · stands for the dot product of two spectra vectors, · 2 denotes the 2 -norm, and s represents the scaling factor.

Ablation Studies
In this work, ablation studies are conducted on the Pavia Center, Pavia University, and Cuprite datasets to illustrate the effectiveness of several specific designs in the proposed approach, including ESAM or RSAM in the attention-enhanced generator structure, the U-Net discriminator, the pseudo-real data generation, the pre-training and fine-tuning model, and the attention-enhanced generative loss function. The effectiveness of one component is testified and validated by removing it from the proposed method, while other components remain unchanged. Specifically, the baseline is set as employing bicubic downsampled HR HSI as LR HSI, replacing ESAM, RSAM, and the U-Net discriminator with ordinary convolutional layers.
(1) Ablation study on ESAM or RSAM: To demonstrate the positive effect of ESAM and RSAM in the proposed method, ESAM or RSAM are removed from the attentionenhanced generator structure, denoted as w/o ESAM and w/o RSAM, respectively. It can be observed from Table 1 that both of them contribute to the performance improvement of the proposed approach. Particularly, without ESAM or RSAM in the attention-enhanced generator, the super-resolution performance severely deteriorates, i.e., MPSNR and MSSIM decrease by 0.3413dB and 0.0345, 0.1812dB and 0.0164, and 0.9498dB and 0.0553 on three HSI datasets, respectively. Moreover, the visual results of w/o ESAM and w/o RSAM are exhibited in Figure 6, as well as the reference ground truth. The result of w/o ESAM appears to be a little blurry, while the result of w/o RSAM seems too sharp, leading to a loss of detailed texture. The experimental results illustrate that ESAM is effective at improving and enhancing the feature representation ability of both low-frequency and high-frequency information, and RSAM is dedicated to finer texture detail reconstruction.  (2) Ablation study on U-Net discriminator: The U-Net discriminator (w/o U-Net) component is substituted with the simple convolutional layer. As shown in Table 1, it can be obviously seen that the performance of the proposed AEGAN approach is slightly better than that without the U-Net discriminator.
(3) Ablation study on Pseudo-real data generation: The pseudo-real data generation employs a high-order degradation model to better mimic the complicated authentic degradation procedure. To testify to its effectiveness, LR inputs obtained by bicubic interpolation are employed rather than the generated pseudo-real data (the corresponding method is denoted as w/o pseudo-real). From the results represented in Table 1, it can be observed with pseudo-real data generation that MPSNR and MSSIM are improved by 0.1397 dB and 0.0214, 0.1205 dB and 0.0095, and 0.0220 dB and 0.0217 for the three datasets, respectively.
(4) Ablation study on spatial attention and spectral attention: The core structures of ESAM and RSAM are the spatial attention block and spectral attention block in the residual connection of AEGAN. The influence of spatial attention and spectral attention of the proposed AEGAN model is also explored by removing spatial attention in ESAM (denoted as w/o SpaA) and removing spectral attention in RSAM (denoted as w/o SpeA), respectively. The corresponding experimental results on the Pavia Centre dataset at scale factor 4 are reported in Table 2  (5) Ablation study on pre-training and fine-tuning: To demonstrate the effect of the pre-trained network model and fine-tuned network model, the pre-training and fine-tuning strategies are removed from the proposed network, respectively (denoted as w/o pretraining and w/o fine-tuning). The experimental results are tabulated in Table 3. It can be easily observed that the proposed AEGAN approach with the pre-training and fine-tuning strategies attains the superior results, avoiding the deterioration in both spatial and spectral evaluation. Concretely, with fine-tuning and pre-training strategies, the performance of the proposed AEGAN method is boosted by 0.7585 dB and 0.1874 dB in MPSNR and 0.0581 and 0.0294 in MSSIM, respectively. This reveals the fact that the pre-training and fine-tuning strategies are more beneficial for HSI SR and can remarkably improve the performance of the proposed approach. (6) Ablation study on attention-enhanced generative loss function: To explore the effectiveness of different combination-based loss terms on the proposed network, ablation studies are carried out on the pixel-wise spatial-based loss term (L spa ), the perceptual-based loss term (L per ), the adversarial-based loss term (L G ), the attention-based loss term (L atten ), and the SAM-based loss term (L SAM ). For HSI, the pixel-wise spatial-based loss term and SAM-based loss term can guarantee the fidelity and consistency of spatial-spectral structure information; therefore, the combination of L spa and L SAM is regarded as the base in this paper. The quantitative experiment results using different combination-based loss terms on the Pavia Centre dataset at scale factor 4 are evaluated in Table 4. Clearly, combining the base with the adversarial-based loss function can promote the super-resolved performance of HSI by increasing 0.1457 dB and 0.0052 in MPSNR and MSSIM, respectively. The addition of perceptual loss term slightly improves the reconstruction performance of the network by boosting MPSNR with 0.0322 dB, making the reconstructed HSI perceptually approximate to the actual HR HSI. As shown in Table 4, coupled with attention-based loss term, the proposed method can acquire high-quality HR HSI estimation. This can be attributed to the accurate reconstruction of spectral information for each pixel and the finer expression of high-frequency spatial texture details spatially.
Tables 5 and 6 list the evaluation metrics of different SR approaches for the three benchmark hyperspectral datasets with a spatial upsampling factor of 2 and 4, respectively. It can be observed that the proposed AEGAN method outperforms all the other compared spatial SR approaches, producing SR results with the highest MPSNR and MSSIM and the lowest SAM and ERGAS. These demonstrate that the proposed AEGAN approach can well reconstruct the spatial texture and structural detail information of HSI.  In order to better distinguish the reconstructed differences between the super-resolved HSI perceptually, parts of the super-resolved HSI generated by different methods are depicted in Figures 7 and 8. It can be clearly observed that the proposed approach is capable of producing visually better HR HSI with finer textures and less blurring artifacts as well as distortion. Compared with the ground truth, the results generated by Bicubic, SRCNN, and 3DFCN suffer from severe blurring artifacts. SSJSR, ERCSR, and HLNACNN produce results lacking detailed information in some locations and introduce undesired noise. The results of ESRGAN are extreme and exhibit obvious distortion. By contrast, the proposed AEGAN approach can not only preserve the central structural information but also mitigate this distortion.  In addition, some reconstructed spectral curve graphs of different SR approaches are depicted in Figure 9. One pixel position is randomly selected for each dataset for spectral distortion analysis and discussion. It can be easily observed that all of the reconstructed spectral curves are consistent with the shape of ground-truth. In a few cases, the SRCNN approach and the ESRGAN approach have a certain degree of deviation, i.e., small spectral distortion, while the proposed AEGAN approach is the closest to the ground-truth, indicating its excellent performance in spectral information preservation.  We utilize the original codes of compared methods to calculate the parameter and complexity. Table 7 comprehensively shows the parameters, FLOPs, and inference time for different SR methods. We can see that our proposed method has the smallest computation burden.

Conclusions
In this paper, a new attention-enhanced generative adversarial network (AEGAN) for HSI spatial SR is proposed. The designed AEGAN contains an attention-enhanced generator architecture with an ESAM and an RSAM to effectively focus on and capture the more valuable and representative spatial-spectral characteristics of HSI. A special U-Net discriminator with spectral normalization is enclosed to stabilize the training and estimate the discriminative probability that the actual HR HSI is more similar to the fake generated image. Meanwhile, an attention-enhanced generative loss function is utilized to train the proposed model and investigate the high correlation of spatial context and spectral information, for the purpose of producing more realistic and detailed HSIs. Furthermore, to better simulate authentic degradation procedure, a high-order degradation model consisting of diverse degeneration estimation is also employed to produce the pseudo-real data for training. The experimental results on three benchmark HSI datasets illustrate its effectiveness and superiority in comparison with several existing state-of-theart methods.
Although our proposed method exhibits advantages in hyperspectral image spatial super-resolution, it is still limited by the small number of data samples available for verifying network robustness. In future work, we plan to leverage the strengths of transformer models and integrate them into our network architecture to enhance its robustness and performance through extensive dataset training.