SSANet: An Adaptive Spectral–Spatial Attention Autoencoder Network for Hyperspectral Unmixing

: Convolutional neural-network-based autoencoders, which can integrate the spatial correlation between pixels well, have been broadly used for hyperspectral unmixing and obtained excellent performance. Nevertheless, these methods are hindered in their performance by the fact that they treat all spectral bands and spatial information equally in the unmixing procedure. In this article, we propose an adaptive spectral–spatial attention autoencoder network, called SSANet, to solve the mixing pixel problem of the hyperspectral image. First, we design an adaptive spectral–spatial attention module, which reﬁnes spectral–spatial features by sequentially superimposing the spectral attention module and spatial attention module. The spectral attention module is built to select useful spectral bands, and the spatial attention module is designed to ﬁlter spatial information. Second, SSANet exploits the geometric properties of endmembers in the hyperspectral image while con-sidering abundance sparsity. We signiﬁcantly improve the endmember and abundance results by introducing minimum volume and sparsity regularization terms into the loss function. We evaluate the proposed SSANet on one synthetic dataset and four real hyperspectral scenes, i.e


Introduction
Hyperspectral image (HSI) analysis has attracted a large amount of attention in the domain of remote sensing because of the rich content information contained in HSI [1,2]. Despite this, because of the inadequate spatial resolution of satellite sensors, atmospheric mixed effects, and complex ground targets, a pixel in an HSI typically includes multiple spectral features. Such pixels are known as "mixed pixels". The presence of a large quantity of mixed pixels causes serious issues for further research on HSI [3][4][5]. Hyperspectral unmixing (HU) aims to separate the mixed pixels into a set of pure spectral signatures (endmembers) and relative mixing coefficients (abundances) [6][7][8].
Recently, with its impressive learning ability and data fitting capability, deep learning (DL) has undergone rapid development in the HU domain [9,10]. The autoencoder (AE), which is a typical representation of unsupervised DL, has been extensively applied to HU tasks. The AE framework is mainly divided into two parts: the encoder, which aims to automatically learn the low-dimensional embeddings (i.e., abundances) of input pixels, and the decoder, which aims to reconstruct input pixels with the associated basis (i.e., endmembers) [11,12]. Moreover, to achieve satisfying unmixing performance, numerous refinements have been made to the existing AE-based unmixing framework. For example, Qu and Qi [13] developed a sparse denoising AE unmixing network that introduces denoising constraints and sparsity constraints to the encoder and decoder, respectively. Zhao et al. [14] presented an AE network that uses two constraints to optimize the spectral unmixing task. Min et al. [12] designed a joint metric AE framework, which uses the Wasserstein distance and feature matching as constraints in the objective function. Jin et al. [15] designed a two-stream AE architecture, which introduces a stream to solve the problem of lacking effective guidance for the endmembers. A deep matrix factorization model was developed in [16], which constructs a multilayer nonlinear structure and employs a self-supervised constraint. Ozkan et al. [17] proposed a two-staged AE architecture that combines spectral angle distance (SAD) with multiple regularizers as the final objective. Su et al. adopted stacked AEs to handle outliers and noise, and employed a variational AE to pose the proper constraint on abundances. An end-to-end unmixing framework was proposed in [18,19], which combines the benefits of learning-based and model-based approaches. However, these methods, which receive one mixed pixel at a time during training, only use the spectral information in an HSI, thereby ignoring the spatial correlation between neighboring pixels.
Importantly, an HSI contains both rich spectral feature information and a degree of spatial information [6]. Incorporating spatial correlation in the unmixing process has been confirmed to significantly improve unmixing performance [20,21]. Therefore, many researchers have introduced convolutional neural networks (CNN) into the traditional AE structure to compensate for the absence of spatial features. For instance, Hong et al. [22] proposed a self-supervised spatial-spectral unmixing method, which incorporates an extra sub-network to guide the endmember information to obtain good unmixing results. Gao et al. [23] developed a cycle-consistency unmixing architecture and designed a selfperception loss to refine the detailed information. Rasti et al. [24] proposed a minimum simplex CNN unmixing approach that incorporates the spatial contextual structure and exploits the geometric properties of endmembers. Ayed et al. [25] presented an approach that uses extended morphological profiles, which combines the spatial correlation between pixels. In [26], a Bayesian fully convolutional framework was developed, which considers the noise, endmembers, and spatial information. Most recently, a perceptual loss-constrained adversarial AE was designed in [27], which takes into account factors such as reconstruction errors and spatial information. Hadi et al. [28] presented a hybrid 3-D and 2-D architecture to leverage the spectral and spatial features. A dual branch AE framework was constructed in [29] to incorporate spatial-contextual information.
Although the above CNN-based AE achieves satisfactory unmixing results, how to adaptively adjust the weights of spectral and spatial features that influence the unmixing performance is a new challenge. Humans can distribute their finite resources to the parts that are most significant, informative, or salient. Inspired by visual attention mechanisms, we propose a spectral-spatial attention AE network for HU and introduce a spectral-spatial attention module (SSAM) to strengthen useful information and suppress information that is unnecessary. Additionally, the absence of both abundance sparsity and endmember geometric information are also responsible for limiting unmixing performance. Thus, we combine a minimum volume constraint and sparsity constraint in the loss function. Specifically, the primary contributions of our proposed SSANet are as follows:

1.
We design an unsupervised unmixing network, which is based on a combination of a learnable SSAM and convolutional AE. The SSAM plays two roles. First, the spectral attention module (SEAM) adaptively learns the weights of spectral bands in input data to enhance the representation of spectral information. Second, the spatial attention module (SAAM) adaptively yields the attention weight assigned to each adjacent pixel to derive useful spatial information.

2.
We combine the prior knowledge that two regularizers (minimum volume regularization and sparsity regularization) are applied to endmembers and abundances, respectively. Additionally, to acquire high-quality endmember spectra, we design a new minimum volume constraint.

3.
We apply the proposed unmixing network to one synthetic dataset and four real hyperspectral scenes-i.e., Samson, Jasper Ridge, Houston, and Urban-and compare it with several classical and advanced approaches. Furthermore, we investigate the performance gain of SSANet with ablation experiments, involving the objective functions and network modules.
The remainder of this paper is structured as follows: In Section 2, we describe the theoretical knowledge of the AE-based unmixing approach simply. In Section 3, we explain the SSANet method in detail. In Section 4, we evaluate SSANet using synthetic and real datasets. In Section 5, we summarize the study.

AE-Based Unmixing Model
In the linear mixing model (LMM) [30], the observed spectral reflectance can be given by where Y = {y i |i = 1, 2, ..., P} ∈ R B×P denotes the observed HSI with B bands and P pixels, and y i denotes the ith pixel. N ∈ R B×P denotes an additive noise matrix. E = {e k |k = 1, 2, ..., R} ∈ R B×R denotes the endmember matrix with R endmember signatures and needs to satisfy the nonnegative constraint. A = {a i |i = 1, 2, ..., P} ∈ R R×P is the corresponding abundance matrix, where a i denotes the abundance percentage of the ith pixel, and should be subjected to the abundance nonnegative constraint (ANC) and abundance sum-to-one constraint (ASC)-that is, The fundamental workflow of classic AE unmixing is shown in Figure 1 and is mainly divided into two parts. 2. We combine the prior knowledge that two regularizers (minimum volume regularization and sparsity regularization) are applied to endmembers and abundances, respectively. Additionally, to acquire high-quality endmember spectra, we design a new minimum volume constraint. 3. We apply the proposed unmixing network to one synthetic dataset and four real hyperspectral scenes-i.e., Samson, Jasper Ridge, Houston, and Urban-and compare it with several classical and advanced approaches. Furthermore, we investigate the performance gain of SSANet with ablation experiments, involving the objective functions and network modules.
The remainder of this paper is structured as follows: In Section 2, we describe the theoretical knowledge of the AE-based unmixing approach simply. In Section 3, we explain the SSANet method in detail. In Section 4, we evaluate SSANet using synthetic and real datasets. In Section 5, we summarize the study.

AE-Based Unmixing Model
In the linear mixing model (LMM) [30], the observed spectral reflectance can be given by where = | = 1,2, . . . , ∈ ℝ × denotes the observed HSI with bands and pixels, and denotes the ith pixel. ∈ ℝ × denotes an additive noise matrix. = | = 1,2, . . . , ∈ ℝ × denotes the endmember matrix with endmember signatures and needs to satisfy the nonnegative constraint. = | = 1,2, . . . , ∈ ℝ × is the corresponding abundance matrix, where denotes the abundance percentage of the ith pixel, and should be subjected to the abundance nonnegative constraint (ANC) and abundance sum-to-one constraint (ASC)-that is, The fundamental workflow of classic AE unmixing is shown in Figure 1 and is mainly divided into two parts. (1) An encoder (•) transforms the input data ∈ ℝ into a hidden representation , which can be described as where ( ) and ( ) denote the weight and bias of the eth encoder layer, respectively. (•) denotes the nonlinear activation function. (1) An encoder En(·) transforms the input data {y i } P i=1 ∈ R B into a hidden representation h i , which can be described as where W (e) and b (e) denote the weight and bias of the eth encoder layer, respectively. f (·) denotes the nonlinear activation function.
(2) A decoder De(·) reconstructs the data {ŷ i } P i=1 ∈ R B using h i , which is formalized aŝ where W (d) is a matrix that denotes the weights of the hidden and output layers. Because of the characteristic of Equation (4), the output of the En(·) result is considered as the predicted abundance vector, that is,â i ← h i , and the estimated endmember is represented by the weights of De(·), that is,Ê ← W (d) . In this framework, the reconstruction loss of the training process is mathematically formulated as

Spectral-Spatial Attention Unmixing Network
To leverage the spectral and spatial information in HSI, we first divide the HSI Y into a set of 3-D neighboring patches M = {m i |i = 1, 2, ..., P} ∈ R s×s×B , where s is the width of patches. In SSANet, each patch m i in M is fed into the proposed network. In each patch m i , the central pixel y i is the target pixel to be unmixed. The framework of SSANet is shown in Figure 2. Its structure consists of three core components: the SSAM, encoder, and decoder. The SSAM, which aims to provide meaningful spectral-spatial priors, helps to solidify feature extraction at later stages. The encoder is designed to extract features and reduce dimensionality. The role of the decoder is to reconstruct the learned features according to the LMM. We provide details on the aforementioned components in Section 3.1, Section 3.2, and Section 3.3, respectively. (2) A decoder (•) reconstructs the data ∈ ℝ using , which is formalized as where ( ) is a matrix that denotes the weights of the hidden and output layers. Because of the characteristic of Equation (4), the output of the (•) result is considered as the predicted abundance vector, that is, ← , and the estimated endmember is represented by the weights of (•), that is, ← ( ) . In this framework, the reconstruction loss of the training process is mathematically formulated as

Spectral-Spatial Attention Unmixing Network
To leverage the spectral and spatial information in HSI, we first divide the HSI into a set of 3-D neighboring patches = | = 1,2, . . . , ∈ ℝ × × , where is the width of patches. In SSANet, each patch in is fed into the proposed network. In each patch , the central pixel is the target pixel to be unmixed. The framework of SSANet is shown in Figure 2. Its structure consists of three core components: the SSAM, encoder, and decoder. The SSAM, which aims to provide meaningful spectral-spatial priors, helps to solidify feature extraction at later stages. The encoder is designed to extract features and reduce dimensionality. The role of the decoder is to reconstruct the learned features according to the LMM. We provide details on the aforementioned components in Sections 3.1, 3.2, and 3.3, respectively.

Spectral-Spatial Attention Module
The SSAM contains two core modules-that is, the SEAM and SAAM-which are arranged sequentially to perform the selection of spectral bands and spatial features in the HSI, respectively. We describe the SEAM and SAAM in the following.

Spectral Attention Module
The SEAM [31] is introduced into the SSANet, aiming to adaptively learn the weights of spectral bands in the HSI in an end-to-end manner. It generates a spectral weight vector that reflects the significance of different spectral bands. The spectral bands modulated by this vector can significantly improve unmixing performance. The framework of the SEAM is shown in Figure 3.

Spectral-Spatial Attention Module
The SSAM contains two core modules-that is, the SEAM and SAAM-which are arranged sequentially to perform the selection of spectral bands and spatial features in the HSI, respectively. We describe the SEAM and SAAM in the following.

Spectral Attention Module
The SEAM [31] is introduced into the SSANet, aiming to adaptively learn the weights of spectral bands in the HSI in an end-to-end manner. It generates a spectral weight vector that reflects the significance of different spectral bands. The spectral bands modulated by this vector can significantly improve unmixing performance. The framework of the SEAM is shown in Figure 3.
Given the input m i ∈ R s×s×B , first, global max pooling (GMP) and global average pooling (GAP) are used to acquire spectral feature vectors α i ∈ R 1×1×B and β i ∈ R 1×1×B , respectively. Next, the corresponding weight vectors γ i ∈ R 1×1×B and δ i ∈ R 1×1×B can be derived using a multilayer perceptron (MLP) that can extract the weight information of each band. γ i and δ i are then summed, and the sigmoid function is applied to obtain the spectral weight coefficients v i ∈ R 1×1×B . The spectral attention formulation can be defined as v i = σ(MLP(GMP(m i )) + MLP(GAP(m i ))) (6) where σ(·) denotes the sigmoid function. Finally, the output of SEAM m i is calculated by the following equation: Remote Sens. 2023, 15, x FOR PEER REVIEW 5 of 22 Given the input ∈ ℝ × × , first, global max pooling (GMP) and global average pooling (GAP) are used to acquire spectral feature vectors ∈ ℝ × × and ∈ ℝ × × , respectively. Next, the corresponding weight vectors ∈ ℝ × × and ∈ ℝ × × can be derived using a multilayer perceptron (MLP) that can extract the weight information of each band. and are then summed, and the sigmoid function is applied to obtain the spectral weight coefficients ∈ ℝ × × . The spectral attention formulation can be defined where (·) denotes the sigmoid function. Finally, the output of SEAM is calculated by the following equation:

Spatial Attention Module
In this part, we design the SAAM to evaluate the adjacent dependence between pixels. Similar to the SEAM, the SAAM also learns in an end-to-end manner and adaptively selects spatial features from the pixels in the neighborhood. The module generates a spatial weight matrix that expresses the importance of adjacent pixels. The recalibration of spatial features using this matrix leads to an obvious improvement in the unmixing accuracy. The framework of the SAAM is shown in Figure 4. Specifically, given the input ∈ ℝ × × , in order to facilitate the calculation of the similarity between neighboring pixels and the central pixel, the input is reshaped into

Spatial Attention Module
In this part, we design the SAAM to evaluate the adjacent dependence between pixels. Similar to the SEAM, the SAAM also learns in an end-to-end manner and adaptively selects spatial features from the pixels in the neighborhood. The module generates a spatial weight matrix that expresses the importance of adjacent pixels. The recalibration of spatial features using this matrix leads to an obvious improvement in the unmixing accuracy. The framework of the SAAM is shown in Figure 4. Given the input ∈ ℝ × × , first, global max pooling (GMP) and global average pooling (GAP) are used to acquire spectral feature vectors ∈ ℝ × × and ∈ ℝ × × , respectively. Next, the corresponding weight vectors ∈ ℝ × × and ∈ ℝ × × can be derived using a multilayer perceptron (MLP) that can extract the weight information of each band. and are then summed, and the sigmoid function is applied to obtain the spectral weight coefficients ∈ ℝ × × . The spectral attention formulation can be defined where (·) denotes the sigmoid function. Finally, the output of SEAM is calculated by the following equation:

Spatial Attention Module
In this part, we design the SAAM to evaluate the adjacent dependence between pixels. Similar to the SEAM, the SAAM also learns in an end-to-end manner and adaptively selects spatial features from the pixels in the neighborhood. The module generates a spatial weight matrix that expresses the importance of adjacent pixels. The recalibration of spatial features using this matrix leads to an obvious improvement in the unmixing accuracy. The framework of the SAAM is shown in Figure 4. Specifically, given the input ∈ ℝ × × , in order to facilitate the calculation of the similarity between neighboring pixels and the central pixel, the input is reshaped into ∈ ℝ × ( = × ). The center pixel ∈ ℝ × × is extracted from the center of ; then, is reshaped into ∈ ℝ × . Next, both and are fed into the scoring function (·) to compute the spatial similarity scores between them. The (·) is produced as follows: Specifically, given the input m i ∈ R s×s×B , in order to facilitate the calculation of the similarity between neighboring pixels and the central pixel, the input m i is reshaped into g i ∈ R ss×B (ss = s × s). The center pixel g center ∈ R 1×1×B is extracted from the center of m ' i ; then, g center is reshaped into g tag ∈ R 1×B . Next, both g i and g tag are fed into the scoring function ρ(·) to compute the spatial similarity scores between them. The ρ(·) is produced as follows: where h i is used to compute the correlation between g i and g tag . ρ(·) is implemented by a full connection layer, parameterized by a weight matrix W ∈ R ss×ss . The spatial similarity scores are derived by multiplying all the h i with W and the results are activated by a rectified linear unit (ReLU) function ϕ(·). Subsequently, a sigmoid function is adopted to compute the spatial weight matrix ω i ∈ R s×s×1 . Finally, we perform elementwise multiplication of ω i with m i to implement the recalibration of spatial information: where m i represents the output of SAAM.

Encoder
As shown in Figure 2, the encoder consists of four convolutional layers, and the number of convolution kernels diminishes with the depth of the layer, which can be formulated as where W e and b e denote the weights and biases, respectively, at the eth level of the encoder for e = 1, 2, 3, 4. denotes the convolution operation. BN(·) represents batch normalization, which is used to enhance the performance and stability of the network, and speed up the learning of the network. LR(·) denotes the leaky ReLU (LReLU) function, which aims to promote nonlinearity. DO(·) represents the dropout function, which is currently the key technique for preventing network overfitting. The purpose of the softmax function is to satisfy two physical constraints on abundance: ANC and ASC.

Decoder
The decoder contains a 1 × 1 convolutional layer and uses LReLU as the activation function. It is formulated as where W and b denote the weights and biases of the decoder, respectively. It should be noted that, in our experiments, to help the training of the decoder, we used the endmembers extracted using the vertex component analysis (VCA) [32] approach to initialize the weights W.

Objective Functions
The overall loss function of SSANet consists of the following three terms. Numerous AE-based works have adopted the SAD with the scale invariance as the reconstruction loss [33,34]. Therefore, we apply the SAD measurement as the reconstruction loss of SSANet, which is denoted as follows: The softmax function does not yield sparse abundance maps. Qian et al. [35] demonstrated that using the l 1/2 norm yields more accurate and sparser abundance results than using the l 1 norm. We apply the l 1/2 norm to the abundance vectorâ ik , which is formulated as whereâ ik represents the reference abundance fractional proportion of the kth endmember at the ith pixel in the HSI. The minimum volume regularizer has already been proven to be beneficial for extracting endmembers [36]. Moreover, to make the estimated endmembers close to the observed spectrum, we design a more reasonable minimum volume constraint, denoted by where − e = (1/R)∑ R k=1 e k denotes the centroid vector. A geometrical explanation of this concept is shown in Figure 5. During each iteration, by minimizing Loss Mv , the endmembers are pulled from the initial values (i.e., the vertices of the initial data simplex) to the vertices of the real data simplex.
using the norm. We apply the ⁄ norm to the abundance vector , which is formulated as where represents the reference abundance fractional proportion of the kth endmember at the ith pixel in the HSI.
The minimum volume regularizer has already been proven to be beneficial for extracting endmembers [36]. Moreover, to make the estimated endmembers close to the observed spectrum, we design a more reasonable minimum volume constraint, denoted by where = (1⁄ )∑ denotes the centroid vector. A geometrical explanation of this concept is shown in Figure 5. During each iteration, by minimizing , the endmembers are pulled from the initial values (i.e., the vertices of the initial data simplex) to the vertices of the real data simplex. To summarize, the overall loss function of SSANet is expressed as where and represent the regularization parameters.

Synthetic Data
We created simulated data according to the approach adopted by Fang et al. [39]. Its size was 104 × 104 pixels, distributed over 200 spectral bands, with four endmembers. Each pixel in this image was a mixture that consisted of four endmembers. We generated these mixed pixels by multiplying four endmembers and four abundance maps according to the LMM. First, we created abundance maps that we decomposed into 8 × 8 homogeneous blocks, which we randomly chose as one of the endmember categories. Then, we degraded blocks by adopting a spatial low-pass filter of 9 × 9. Next, we added zero-mean Gaussian noise with various signal-to-noise ratios (SNRs) to the obtained synthetic dataset. Because of the different noise variances in different bands, we assigned different SNR values to different bands and obtained band-related SNR values from the baseline Indian Pines image. We assumed that the obtained SNR vector s was centralized and normalized; then, we could acquire the synthetic SNR n based on the rule n = βs + r, where β is the fluctuation amplitude of band-related SNR values and r is the center value that defines the total SNR of all bands. To investigate the robustness of our approach to various noise levels, we simulated three datasets with various noise values (SNR = 20, 30, 40 dB) by fixing β = 5 and varying r.

Samson Data
Samson data have three constituent materials: soil, trees, and water. This dataset was captured by the Samson sensor. The image contains 156 spectral channels ranging from 0.4-0.9 µm. Because the original image is large, we selected a subimage of the original data with a size of 95 × 95 pixels.

Jasper Ridge Data
Jasper Ridge data have four main materials: trees, water, soil, and roads. This dataset was obtained by the AVIRIS sensor. The original HSI covers 512 × 614 pixels in size and is spread over 224 spectral channels, covering wavelengths from 0.38 to 2.5 µm. It has a spatial resolution of 20 m/pixel. We selected an area of interest of 100 × 100 pixels and removed bands (1-3, 108-112, 154-166, and 220-224) to alleviate the influences of the atmosphere and water vapor. Finally, the Jasper Ridge dataset had 198 remaining bands.

Houston Data
Houston data have four dominant materials: parking lot 1, running track, healthy grass, and parking lot 2. The data were originally used in the 2013 IEEE GRSS data fusion competition. The original HSI contains 349 × 1905 pixels, distributed over 144 channels ranging from 0.35 to 1.05 µm. Its spatial resolution is 2.5 m/pixel. We selected a subimage containing 170 × 170 pixels. The subimage is centered on Robertson Stadium on the Houston campus.

Urban Data
Urban data have four constituent materials: asphalt, grass, tree, and roof. This dataset, collected by the HYDICE sensor, is characterized by a complex distribution. Its pixel resolution is 307 × 307, and there are 210 spectral bands ranging from 0.4 to 2.5 µm. It has a spatial resolution of 2 m/pixel. After we removed the contaminated bands, 162 bands remained.

Evaluation Metrics
We selected two commonly used evaluation metrics, the root mean square error (RMSE) and SAD, to assess the proposed method. These two indices are defined as where e k andê k are the real endmember and extracted endmember, respectively, and a i and a i are the real abundance and predicted abundance, respectively. For both evaluation metrics, the lower the value, the better the corresponding unmixing results.

Hyperparameter Settings
In our experiments, we assumed that the number of endmembers R was known in advance, as determined by HySime [46]. In the training phase, we initialized the decoder with the endmembers extracted by VCA. We implemented our proposed SSANet in the environment of PyTorch 1.6 with an i7-8550U CPU. We applied the Adam optimizer to optimize the parameters. The selection of specific parameters for the proposed SSANet is displayed in Table 1. Figure 7 shows the convergence curves of the proposed SSANet during the learning process.

Evaluation Metrics
We selected two commonly used evaluation metrics, the root mean square error (RMSE) and SAD, to assess the proposed method. These two indices are defined as where and are the real endmember and extracted endmember, respectively, and and are the real abundance and predicted abundance, respectively. For both evaluation metrics, the lower the value, the better the corresponding unmixing results.

Hyperparameter Settings
In our experiments, we assumed that the number of endmembers was known in advance, as determined by HySime [46]. In the training phase, we initialized the decoder with the endmembers extracted by VCA. We implemented our proposed SSANet in the environment of PyTorch 1.6 with an i7-8550U CPU. We applied the Adam optimizer to optimize the parameters. The selection of specific parameters for the proposed SSANet is displayed in Table 1. Figure 7 shows the convergence curves of the proposed SSANet during the learning process.

Experiments with Synthetic Data
To study the robustness of SSANet to noise, we added zero-mean Gaussian noise with SNRs of 20, 30, and 40 dB to the synthetic dataset. Figure 8 shows the quantitative analysis results with varying SNR levels. Generally, SSANet achieved better (i.e., lower) SAD and RMSE results than the other methods, at both a low and high SNR. SGSNMF performed well when the noise intensity was relatively low. At high noise levels, the performance of SGSNMF deteriorated severely. CNNAEU and CyCU-Net could not obtain the desired performance at various noise levels. The reason is that, despite the introduction of spatial information, CNNAEU and CyCU-Net led to a noise-sensitivity problem because of insufficient spectral feature representation capability. For MiSiCNet, the image prior aimed to solve the degradation problem. As a result, MiSiCNet achieved relatively good results under low noise conditions. Other methods, such as DAEU and MTAEU, often obtained satisfactory results because of the introduction of abundance sparsity and spectralspatial priors, respectively. The performance of SSANet did not degrade severely as noise levels increased. The overall performance at various noise levels verified the robustness of SSANet to noise, which mainly resulted from the advantage of the combination of the attention mechanism and associated physical properties. The visualization results of the abundances and endmembers for the synthetic data (SNR 40 dB) are shown in Figures 9  and 10, respectively. The experimental results indicated that our method successfully obtained relatively good results.
analysis results with varying SNR levels. Generally, SSANet achieved better (i.e., lower) SAD and RMSE results than the other methods, at both a low and high SNR. SGSNMF performed well when the noise intensity was relatively low. At high noise levels, the performance of SGSNMF deteriorated severely. CNNAEU and CyCU-Net could not obtain the desired performance at various noise levels. The reason is that, despite the introduction of spatial information, CNNAEU and CyCU-Net led to a noise-sensitivity problem because of insufficient spectral feature representation capability. For MiSiCNet, the image prior aimed to solve the degradation problem. As a result, MiSiCNet achieved relatively good results under low noise conditions. Other methods, such as DAEU and MTAEU, often obtained satisfactory results because of the introduction of abundance sparsity and spectral-spatial priors, respectively. The performance of SSANet did not degrade severely as noise levels increased. The overall performance at various noise levels verified the robustness of SSANet to noise, which mainly resulted from the advantage of the combination of the attention mechanism and associated physical properties. The visualization results of the abundances and endmembers for the synthetic data (SNR 40 dB) are shown in Figures 9 and 10, respectively. The experimental results indicated that our method successfully obtained relatively good results.

Experiments with Samson Data
The quantitative results for Samson are shown in Tables 2 and 3. Notably, our proposed SSANet outperformed the other methods in terms of the mean SAD and mean RMSE. Additionally, compared with the suboptimal results, these two metrics lowered by 16% and 69%, respectively. Figures 11 and 12 show the abundances and endmembers estimated by all the methods. Figure 11 shows that VCA-FCLS and SGCNMF performed relatively poorly, confusing soil and trees. By contrast, the DL-based methods confused nothing and distinguished each material more accurately, which demonstrates the advantage of the DL methods. However, the abundance results of these methods at the junction of two different materials were not ideal, whereas our method retained rich edge information and appeared much clearer visually. This may be the result of a moderate application of sparsity regularization, in addition to spatial attention. As shown by Figure 12, all methods achieved good performance. However, because SSANet took into account the geometric information of endmembers, in addition to the utilization of spectral attention to enhance the effective spectral bands, it made the extracted water endmember greatly superior to that of the competing methods. The superior performance further validated the effectiveness and reliability of SSANet.  proaches on Samson data. Annotation: bold red text indicates the best results and bold blue text indicates the suboptimal results.  Table 3. SAD (×100) and mean SAD (×100) of endmembers acquired by various unmixing approaches on Samson data. Annotation: bold red text indicates the best results and bold blue text indicates the suboptimal results.    Tables 4 and 5 show the quantitative results for Jasper Ridge. The visualization results of abundances and endmembers are presented in Figures 13 and 14, respectively. As shown in Table 4, for RMSE of each material, our SSANet lowered by 56%, 51%, 45%, and 57%, respectively, compared with the suboptimal results. Table 5 shows that although SSANet did not achieve the best results for each material, it ranked first with respect to the mean SAD. Figure 14 also shows that the endmembers obtained by SSANet were close to the GT. In Figure 13, the abundance maps generated by SSANet look much sharper. In the Jasper dataset, roads occupy a small portion of the scene. For material roads, estimating the abundances and endmembers is more challenging than for other materials because of the complex distribution. Numerous methods estimate unsatisfactory abundances and  Tables 4 and 5 show the quantitative results for Jasper Ridge. The visualization results of abundances and endmembers are presented in Figures 13 and 14, respectively. As shown in Table 4, for RMSE of each material, our SSANet lowered by 56%, 51%, 45%, and 57%, respectively, compared with the suboptimal results. Table 5 shows that although SSANet did not achieve the best results for each material, it ranked first with respect to the mean SAD. Figure 14 also shows that the endmembers obtained by SSANet were close to the GT. In Figure 13, the abundance maps generated by SSANet look much sharper. In the Jasper dataset, roads occupy a small portion of the scene. For material roads, estimating the abundances and endmembers is more challenging than for other materials because of the complex distribution. Numerous methods estimate unsatisfactory abundances and fail to completely separate roads, whereas SSANet separated roads more accurately because of the application of the abundance sparsity and the geometric feature of endmembers. Additionally, in both a heavily mixed area (soil) and homogeneous area (water), SSANet obtained superior separation results because of its powerful learning capability that fully integrated useful spectral and spatial information. Table 4. RMSE (×100) and mean RMSE (×100) of abundances acquired by various unmixing approaches on Jasper Ridge data. Annotation: bold red text indicates the best results and bold blue text indicates the suboptimal results.  Table 5. SAD (×100) and mean SAD (×100) of endmembers acquired by various unmixing approaches on Jasper Ridge data. Annotation: bold red text indicates the best results and bold blue text indicates the suboptimal results.  Table 5. SAD (×100) and mean SAD (×100) of endmembers acquired by various unmixing approaches on Jasper Ridge data. Annotation: bold red text indicates the best results and bold blue text indicates the suboptimal results.

Experiments with Houston Data
The qualitative analysis results for the Houston dataset are shown in Tables 6 and 7. Figures 15 and 16 show the qualitative analysis results of the abundance maps and endmembers acquired, respectively. Clearly, with respect to both the RMSE and SAD, the results obtained by methods based on spectral-spatial information (MTAEU, MiSiCNet, and SSANet) were better than those obtained by methods that used only spectral information (DAEU and CyCU-Net). These results provide further confirmation that the full utilization of spectralspatial features is advantageous for enhancing the precision of HU. Although SSANet did not acquire the best SAD results for each endmember, its mean SAD was the optimal result. Moreover, SSANet achieved the best results for all abundances with respect to the RMSE. Importantly, Figure 15 shows that all other methods performed poorly in terms of distinguishing similar materials (i.e., parking lot1 and parking lot2); however, it was relatively easier for our method to distinguish spectrally similar materials, which was facilitated by the attention mechanism selecting useful spectral-spatial features and suppressing useless features. In conclusion, we demonstrated the good performance of SSANet in real scenes with similar substances based on the combined RMSE and SAD evaluation.     Tables 8 and 9 show the quantitative metric comparisons for the Urban dataset. Figures 17 and 18 visualize the results of the abundances and endmembers, respectively. A feature of this dataset is its complex distribution, and mixed pixels are broadly distributed in this scene. It is worth noting that SSANet had the finest mean and individual RMSE, and the mean RMSE was 11% lower than that of the suboptimal method. Additionally, the individual SAD obtained by SSANet was also competitive. Figure 17 shows that the endmember mixed phenomenon appeared for VCA-FCLS and SGCNMF, which resulted in poor results. CyCU-Net and MiSiCNet achieved poor qualitative and quantitative performance. Although DAEU, MTAEU, and CNNAEU were able to distinguish each material, there were some errors in the details, which were related to the absence of useful adjacency information and a sparsity prior. Therefore, SSANet adopted a spatial attention that assigned weights to neighboring pixels, in addition to the sparsity regularizer to make the abundance maps look smooth and realistic. Figure 18 shows that the proposed SSANet acquired similar visual endmember maps to GT. However, because the roof endmember accounted for a small percentage of this large-scale scene, there were some gaps in the roof endmember obtained by SSANet; however, the overall results remained competitive. The superior unmixing results confirmed the reliability of SSANet in highly mixed scenes.   Tables 8 and 9 show the quantitative metric comparisons for the Urban dataset. Figures 17 and 18 visualize the results of the abundances and endmembers, respectively. A feature of this dataset is its complex distribution, and mixed pixels are broadly distributed in this scene. It is worth noting that SSANet had the finest mean and individual RMSE, and the mean RMSE was 11% lower than that of the suboptimal method. Additionally, the   Table 9. SAD (×100) and mean SAD (×100) of endmembers acquired by different unmixing approaches on Urban data. Annotation: bold red text indicates the best results and bold blue text indicates the suboptimal results.

Discussion
Through the qualitative and quantitative analysis of four real hyperspectral scenes, our SSANet vastly improved the unmixing performance. Because the distribution of real scenes may not have fulfilled the prior distribution assumption, VCA-FCLS and SGCNMF performed relatively poorly on real datasets compared with the DL-based methods, which also indicates the advantage of using the DL methods for the unmixing task. DAEU is an AE framework that does not contain spatial information; therefore, the overall performance of DAEU was not favorable; however, DAEU obtained satisfactory results in the abundance estimation because its special design took advantage of abundance sparsity in the form of adaptive thresholds. Additionally, the lack of ASC led to the poor performance of CyCU-Net in the reconstruction process. MTAEU and CNNAEU used spatial correlation, but their objective functions simply used the SAD reconstruction term and did not impose regularizers on endmembers and abundances, which led to greater variances in

Discussion
Through the qualitative and quantitative analysis of four real hyperspectral scenes, our SSANet vastly improved the unmixing performance. Because the distribution of real scenes may not have fulfilled the prior distribution assumption, VCA-FCLS and SGCNMF performed relatively poorly on real datasets compared with the DL-based methods, which also indicates the advantage of using the DL methods for the unmixing task. DAEU is an AE framework that does not contain spatial information; therefore, the overall performance of DAEU was not favorable; however, DAEU obtained satisfactory results in the abundance estimation because its special design took advantage of abundance sparsity in the form of adaptive thresholds. Additionally, the lack of ASC led to the poor performance of CyCU-Net in the reconstruction process. MTAEU and CNNAEU used spatial correlation, but their objective functions simply used the SAD reconstruction term and did not impose regularizers on endmembers and abundances, which led to greater variances in endmember extraction and abundance estimation. MiSiCNet considered spatial information and used the geometric information of endmembers. The utilization of geometric properties allowed MiSiCNet to achieve competitive performance in endmember estimation, but it did not leverage the relevant properties of abundance, thus limiting unmixing performance. Although MTAEU, CNNAEU, and MiSiCNet combined spectral-spatial priors to make the unmixing performance relatively good, their limited performance can be attributed to their inability to combine useful spectral-spatial priors and the failure to consider both the geometric property of the endmember and the abundance sparsity. For the aforementioned problem, in our approach, we used SSAM to enhance useful information and weaken useless information, in addition to imposing a minimum volume regularizer and sparse regularizer on the endmembers and abundances, respectively. Therefore, our unmixing method obtained good unmixing accuracy. In conclusion, the overall experimental performance on four real-world HSIs illustrated the effectiveness and superior performance of our method.

Ablation Study on Objective Functions
We selected the Jasper Ridge scene as an example to evaluate the contribution of the various parts of the objective function. Table 10 shows the results of the quantitative analysis of the ablation study. We observed that using the SAD reconstruction loss solely ensured the fulfillment of the HU task, but with limited accuracy. Incorporating appropriate regularization greatly improved the unmixing performance. Using the sparsity term exploited an inherent property of real scenes and guaranteed the sparsity of the abundance results. Moreover, we introduced the minimum simplex volume constraint to exploit the geometric information of the HSI. This term was beneficial for endmember extraction. To summarize, all these regularizations appear to be associated with achieving the best results, and the optimal performance was obtained by combining all of them.

Ablation Study on Network Modules
In order to test whether both SSAM and SEAM improve the results, ablation experiments in the Jasper Ridge scene are shown in this section. We compared SSANet with SSANet without SSAM (SSANet-None), SSANet only with SEAM (SSANet-SEAM), and SSANet only with SAAM (SSANet-SAAM). The results are shown in Table 11. It can be seen from Table 11 that the SSANet after removing SEAM and SAAM yielded the worst unmixing performance. By introducing either SEAM or SAAM into the proposed AE model, the integrated SSANet had a certain improvement in the estimation of endmembers and abundances. Consequently, it was necessary to combine SEAM and SAAM to achieve superior performance. 4.6. Processing Time Table 12 shows the consumption time of all the unmixing approaches applied to the Jasper Ridge dataset in seconds. We ran all the experiments on a computer with a 3.6 GHz Intel Core i7-7820X CPU and NVIDIA GeForce RTX 1080 16GB GPU. We implemented VCA-FCLS and SGCNMF in MATLAB R2016a; implemented DAEU, MTAEU, and CNNAEU on the TensorFlow platform; and implemented CyCU-Net, MiSiCNet, and SSANet on the PyTorch platform. The proposed SSANet is not the quickest, but its time consumption was relatively satisfactory.

Conclusions
In this article, we present a convolutional AE unmixing network called SSANet, which effectively uses spectral-spatial information in HSIs. First, we propose a learnable SSAM, which refines spectral-spatial features by sequentially overlaying the SEAM and SAAM. This module strengthens high-information features and weakens low-information features by weighting the learning of features. Second, we use the sparsity of abundances and the geometric properties of endmembers by adding a sparsity constraint term and a minimum volume constraint term to the loss function to achieve sparse abundance results and accurate endmembers. We verify the effectiveness and robustness of SSANet in experiments by comparing it with several classical and advanced HU approaches in synthetic and real scenes.