Attention-Based Residual Network with Scattering Transform Features for Hyperspectral Unmixing with Limited Training Samples

: This paper proposes a framework for unmixing of hyperspectral data that is based on utilizing the scattering transform to extract deep features that are then used within a neural network. Previous research has shown that using the scattering transform combined with a traditional K-nearest neighbors classiﬁer (STFHU) is able to achieve more accurate unmixing results compared to a convolutional neural network (CNN) applied directly to the hyperspectral images. This paper further explores hyperspectral unmixing in limited training data scenarios, which are likely to occur in practical applications where the access to large amounts of labeled training data is not possible. Here, it is proposed to combine the scattering transform with the attention-based residual neural network (ResNet). Experimental results on three HSI datasets demonstrate that this approach provides at least 40% higher unmixing accuracy compared to the previous STFHU and CNN algorithms when using limited training data, ranging from 5% to 30%, are available. The use of the scattering transform for deriving features within the ResNet unmixing system also leads more than 25% improvement when unmixing hyperspectral data contaminated by additive noise.


Introduction
Hyperspectral image unmixing aims at extracting information about multiple target endmembers from spectral curves that include hundreds of wavebands, which efficiently removes the limit of low spatial resolution of current hyperspectral satellite sensors and retrieves the information of interest in complex ground object conditions, leading to extensive application prospects [1][2][3][4][5][6]. Unmixing algorithms based on supervised learning are one of the key research directions for hyperspectral image unmixing [7][8][9]. With the rapid development of artificial intelligence, the structure of deep neural networks based on supervised learning has been recently applied to hyperspectral images [10,11]. However, existing deep learning approaches require a large amount of prior training data to achieve accurate results, while it is difficult to obtain the ground truth for the composition of mixtures of hyperspectral remote sensing images [12]. Therefore, a crucial challenge faced currently is to increase the accuracy of hyperspectral unmixing based on supervised learning when utilizing limited training data.
Recently, deep neural networks that possess great advantages in terms of gaining deep structural features have become the mainstream algorithm in the field of computer vision and image processing, achieving excellent results in various tasks [13][14][15][16][17]. Compared with conventional methods, the deep which achieves high-order features of the hyperspectral data in a cascade manner and has an explicit physical meaning. Additionally, neural network using attention mechanism has the ability to focus on specific parts of information in the feature space [30,33], which is helpful in learning both spatial and spectral features in HSIs.
In this paper, we construct a novel attention-based residual network with scattering transform features (ARNST) learning architecture for hyperspectral unmixing with limited training samples. The major contributions in this paper include three aspects, as follows:

1.
A novel network model is proposed, which is a combination of the scattering transform and a deep neural network, such as the CNN and the ResNet. The scattering transform extracts deep-level features from hyperspectral images and the resulting high-order information is processed through neural networks. 2.
Hyperspectral unmixing using ResNet and attention-based ResNet are introduced. The attention mechanism is helpful in paying attention to important features in HSIs during learning. 3.
Under the condition of limited training data, the proposed approach with only a few parameters to be configured can achieve more accurate results than state-of-the-art methods. 4.
When unmixing HSI images corrupted by additive noise, the proposed approach utilizes the scattering transform combined with deep learning to reduce the effect of noise, which is shown to be more robust in terms of suffering a smaller reduction in accuracy compared to the CNN applied directly to the HSI images, which requires retraining with noisy data to achieve satisfactory results.
The rest of this paper is organized as follows. In Section 2, the proposed ARNST method is introduced in detail. In Section 3, we present and discuss the experimental results. Finally, conclusions and future work are described in Section 4.

Methods
The proposed ARNST network architecture, as shown in Figure 1, consists of three parts: (1) scattering transform layers, which obtain scattering transform features from original hyperspectral images; (2) deep neural network feature extraction layers, which extract features based on residual networks and attention modules; (3) fully connected layers, which predict the abundance maps for each pixel.
In our work, the scattering transforms are capable of extracting stable and enhanced high-order features in a process that has explicit physical meanings, so that the distinguishable information of original limited hyperspectral samples is enriched. Based on the scattering transform feature maps, the low-complexity residual network is utilized for training. The combination of features and the lower layers has been performed to capture fine features. Moreover, attention modules including channel attention and spatial attention are also utilized to obtain information of interest, which is to enhance relevant features and restrain irrelevant features, so that the key parameters of feature information are kept. Finally, fully connected layers with a soft-max activation function are used to obtain the abundance maps. Thus, this paper combines the above advantages, proposing a method that constructs a network to further extract feature information, improve the accuracy of trained models, and ensure the stability of the network when the resources for training are limited.

Scattering Transform Module
The scattering transform structure with known filters has shown a great potential in lots of different applications, such as feature extraction, classification, unmixing and recognition. Given the vth pixel spectrum as 1 l v r × ∈ℜ , the spectral mixture model f can be simply described as:    According to the literature [32], the scattering transform expression of the spectral mixture model can be written as:

Scattering Transform Module
The scattering transform structure with known filters has shown a great potential in lots of different applications, such as feature extraction, classification, unmixing and recognition. Given the vth pixel spectrum as r v ∈ 1×l , the spectral mixture model f can be simply described as: , . . . , a vk , . . . , a vn ] ∈ 1×n is the abundance fractions and a vk denotes the abundance of the kth endmember. X = [x 1 , x 2 , . . . , x n ] T ∈ n×l is the endmember matrix of n endmembers and l is the number of bands of hyperspectral data. ε v = [ε v1 , . . . , ε vk , . . . , ε vn ] ∈ 1×n represents the error vector. a vk ≥ 0 is the abundance non-negativity constraint, while n k=1 a vk = 1 is the sum-to-one constraint of the abundances. According to the literature [32], the scattering transform expression of the spectral mixture model can be written as: where S(r) ∈ 1×1×L is a collection of the scattering transform coefficient outputs from the zero-order to the mth-order and represents the main information at low frequency bands of the input signal. L is the length of S(r), and r is the spectral vector input.
In summary, as shown in Figure 1, S 0 (r), S 1 (r), . . . , S m (r) are calculated using wavelet or Fourier functions in the scattering transform layers, and then the scattering transform feature maps S(r) can be obtained, which will be used as the input of following deep learning neural networks. Translation invariance, local deformation stability, energy conservation and strong antinoise ability are the main advantages of the scattering transform.

Deep Neural Network Feature Extraction Module
Due to limited obtainable labels of the ground truth in practical applications of hyperspectral remote sensing images, it is hard for existing algorithms to achieve ideal results. The scattering transform has been shown to extract deep features in a similar way to that of the CNN but utilizes a fixed transform rather than a learnt transform (kernel) as used in the CNN. Moreover, scattering transforms have few characteristic parameters and high stability, which brings better unmixing results than the CNN under conditions of limited training samples. However, state-of-the-art deep learning algorithms usually require a large amount of labeled data for training to have good results. The single scattering transform features also cannot lead to ideal unmixing results when the training data are extremely limited. For example, when unmixing Urban hyperspectral data using 5% data for training, the summation of the root-mean-square error (RMSE) of CNN is 0.5717, while the summated RMSE of scattering transform is 0.4738, both of which are relatively inaccurate. Therefore, there is a need for deeply exploring the capability of hyperspectral feature extraction when the training data are limited, so that the performance of hyperspectral unmixing can be improved.
Previous research [26,30,34] has shown that the ResNet [35] with hundreds of layers can provide effective features and achieve state-of-the-art performance in hyperspectral image classification. This paper focuses on the combination of the scattering transform and the deep neural network, aiming at sufficiently making use of the advantages of both to extract the feature information of interest, realizing accurate abundance estimation under limited training samples.
(1) Residual network based on scattering transform features Let scattering transform coefficients of vth pixel S(r) ∈ 1×1×L be reshaped to three dimensions, i.e., S ∈ H×W×C , where H, W and C represent the three dimensions. S ∈ H×W×C is the input of the deep neural network.
One of the key parts within feature extraction layers is the convolution model. The input S or the feature maps F S are convolved with convolution kernels to obtain the feature maps as follows: where F S l−1 and F S l represent the input and output of the lth convolution layer, respectively. When l = 0, F S l = S. Additionally, * refers to the convolution operator. W l and b l are the weights and biases of the lth convolution layer.
However, CNN can cause the gradient vanishing problem as the depth of the network increases, particularly for the high-frequency space where scattering transform coefficients become smaller and smaller, which leads to higher probabilities of occurring gradient vanishing. In order to remove this effect, this paper adopts the residual network, which is trained in the network to ensure the efficiency of the network. Defining the residual function to be learnt as χ(F S l−1 ) = F S l − F S l−1 , we have: (2) Scattering transform attention mechanism module Due to limited obtainable ground truth of hyperspectral images, the algorithm needs to pay much attention to key feature information in the network, so that the accuracy of training can be improved. Thus, the attention mechanism is introduced to address this issue. The convolution block attention module (CBAM) [36] is utilized to recalibrate scattering transform feature maps in the ResNet model. In the attention blocks, channel-wise attention and spatial-based attention are both utilized to train scattering transform features in a three-dimensional structure. Therefore, this method is called scattering transform attention mechanism, which is targeted at enhancing useful scattering transform features and suppressing less useful information. The basic structure of the scattering transform attention mechanism is illustrated in Figure 2.
Remote Sens. 2020, 12, 400 6 of 18 (4) (2) Scattering transform attention mechanism module Due to limited obtainable ground truth of hyperspectral images, the algorithm needs to pay much attention to key feature information in the network, so that the accuracy of training can be improved. Thus, the attention mechanism is introduced to address this issue. The convolution block attention module (CBAM) [36] is utilized to recalibrate scattering transform feature maps in the ResNet model. In the attention blocks, channel-wise attention and spatial-based attention are both utilized to train scattering transform features in a three-dimensional structure. Therefore, this method is called scattering transform attention mechanism, which is targeted at enhancing useful scattering transform features and suppressing less useful information. The basic structure of the scattering transform attention mechanism is illustrated in Figure 2.  , which is beneficial for the input of the spatial attention network. The 2D spatial attention map is defined as can calculate the attention weights in two independent dimensions (spatial and channel), which are then multiplied with 1 S l F − to implement the detailing of the feature map. The 2D spatial attention map is used for searching for the mostly focused information in the spatial dimension, while the channel attention is to search for the focal point along the channel axis.
According to [36], average pooling and max pooling operations are applied to the spatial module along the channel axis, and these outputs are concatenated to generate a feature descriptor, which is then used to obtain the spatial attention map Sp M using convolution layers with a filter 7 7 f × , whose size is 7 × 7, and the sigmoid activation functionσ .
where 1 S avg F and 1 max S F represent the average pooling operator and max pooling operator, respectively. Therefore, the output of the spatial attention module can be described as: where ⊗denotes element-wise multiplication.
For the channel module, both max-pooling outputs and average-pooling outputs are utilized with a shared network that is composed of the multi-layer perceptron (MLP) with one hidden layer. The two output feature vectors are merged using element-wise summation, and then the sigmoid activation function is utilized to gain the channel attention map: F S l−1 will be reshaped to an appropriate dimension F S1 ∈ h×w×c , which is beneficial for the input of the spatial attention network. The 2D spatial attention map is defined as M Sp ∈ h×w×1 , while the 1D channel attention map is defined as M Ch ∈ 1×1×c . Considering the feature map F S l−1 , the CBAM module can calculate the attention weights in two independent dimensions (spatial and channel), which are then multiplied with F S l−1 to implement the detailing of the feature map. The 2D spatial attention map is used for searching for the mostly focused information in the spatial dimension, while the channel attention is to search for the focal point along the channel axis.
According to [36], average pooling and max pooling operations are applied to the spatial module along the channel axis, and these outputs are concatenated to generate a feature descriptor, which is then used to obtain the spatial attention map M Sp using convolution layers with a filter f 7×7 , whose size is 7 × 7, and the sigmoid activation function σ.
where F S1 avg and F S1 max represent the average pooling operator and max pooling operator, respectively. Therefore, the output of the spatial attention module can be described as: where ⊗ denotes element-wise multiplication. For the channel module, both max-pooling outputs and average-pooling outputs are utilized with a shared network that is composed of the multi-layer perceptron (MLP) with one hidden layer. The two Remote Sens. 2020, 12, 400 7 of 18 output feature vectors are merged using element-wise summation, and then the sigmoid activation function is utilized to gain the channel attention map: where ω is the shared MLP weights. After that, the output of the channel attention module can be described as: Finally, the output of the whole scattering transform attention mechanism is: Comparing Equation (4) with Equation (9), the proposed attention-based residual network with scattering transform features can be achieved, and the residual function can be expressed as: In most cases, the input of the scattering transform attention mechanism is not required to be reshaped, which means that F S1 = F S l−1 , and thus the residual function can be finalized as: By calculating the ARNST module, we can realize the goal of further extracting high-order features and eliminating inefficient feature information when there are few training samples.
After executing the last feature extraction layer, the final feature maps F S k can be obtained, followed by achieving the fully connected network. In addition, in order to make sure that the final output can satisfy the abundance non-negativity constraint and sum-to-one constraint, a soft-max activation function is used in the final output layer. The operation principle is illustrated in Figure 3, in which the input is the feature maps based on scattering transform deep residual network features, and the output is the abundance maps. where ω is the shared MLP weights.
After that, the output of the channel attention module can be described as: Finally, the output of the whole scattering transform attention mechanism is: (9) Comparing Equation (4) with Equation (9), the proposed attention-based residual network with scattering transform features can be achieved, and the residual function can be expressed as: In most cases, the input of the scattering transform attention mechanism is not required to be reshaped, which means that 1 1 , and thus the residual function can be finalized as: By calculating the ARNST module, we can realize the goal of further extracting high-order features and eliminating inefficient feature information when there are few training samples.
After executing the last feature extraction layer, the final feature maps S k F can be obtained, followed by achieving the fully connected network. In addition, in order to make sure that the final output can satisfy the abundance non-negativity constraint and sum-to-one constraint, a soft-max activation function is used in the final output layer. The operation principle is illustrated in Figure 3, in which the input is the feature maps based on scattering transform deep residual network features, and the output is the abundance maps.

Experimental Results
In this section, we introduce three public datasets used in our experiments. The hyperspectral unmixing performance of the proposed attention-based residual network method with scattering transform features is presented by comparing with other approaches, which are based on the deep network architecture. All the experiments are implemented with NVIDIA Quadro P4200 GPU, Tensorflow-gpu [37], scikit-learn [38] and Keras [39] with Python 3.6.

Experimental Results
In this section, we introduce three public datasets used in our experiments. The hyperspectral unmixing performance of the proposed attention-based residual network method with scattering Remote Sens. 2020, 12, 400 8 of 18 transform features is presented by comparing with other approaches, which are based on the deep network architecture. All the experiments are implemented with NVIDIA Quadro P4200 GPU, Tensorflow-gpu [37], scikit-learn [38] and Keras [39] with Python 3.6.

Description of Hyperspectral Datasets
To evaluate the effectiveness of the proposed method, three widely-used real-world hyperspectral data sets, namely Urban, Jasper Ridge and Samson, are selected. These datasets can be openly accessed online [40,41].
(2) Jasper Ridge: the second dataset has 100 × 100 pixels, while each pixel contains 198 effective spectral bands with the wavelength ranging from 0.38 to 2.5 µm. There are four endmembers latent in this dataset, including "Road", "Dirt", "Water" and "Tree".
(3) Samson: the third dataset contains 95 × 95 pixels and 156 channels covering the wavelengths from 401 to 889 nm. There are three target endmembers in the dataset, including "Rock", "Tree" and "Water".

Experimental Setup
(1) Setup for limited training samples In our experiment, the training set and cross-validation set are generated from the ground truth data, while the testing set is composed of the remaining ground truth samples. As our work focuses on the problem of limited training samples, small training ratios of ground truth data are considered, which are selected from approximately 30% to 0.5% downwards. In this paper, each of the selected datasets corresponding to each ratio is divided into a training set (80% of this set of samples) and a cross-validation set (remaining 20% of these sample).
The details of the training, cross-validation and testing pixels in the Urban dataset are listed in Table 1. As shown in the table, there is a total of 33,153 pixels corresponding to the "Asphalt road" endmember (last column of Table 1) among all the 307 × 307 = 94,249 pixels of the training dataset. For example, when choosing 5% as the training ratio, it means that 3438 and 860 pixels out of the total of 94,249 pixels are utilized for training and cross-validation, respectively, among which 1378 pixels contain the endmember "Asphalt road", and 3256 pixels include the endmember "Grass". Meanwhile, the other 94,249 -3438 -860 = 89,951 pixels are used for testing. The training and testing parameters of the datasets of Jasper Ridge and Samson are also set in alignment with the same principle. In all comparative experiments, the training data selected are identical.
(2) Performance evaluation approach To assess the unmixing performance of the proposed method, the root-mean-square error (RMSE) [42] is used to evaluate the difference between ground truth abundance maps and predicted results.
The performance of the proposed ARNST is compared with two state-of-the-art contrastive methods, namely, the scatter transform framework for hyperspectral unmixing (STFHU) [29] and the CNN [10], which are also based on deep network and perform better than other supervised methods. Meanwhile, the scattering transform plus CNN (STCNN), ResNet, attention-based ResNet (AResN) and ResNet with scattering transform features (RNST) are used for unmixing for the first time.
For CNN and STCNN, four convolution layers and four sampling sublayers are used in the network. The sizes of the convolution layers are set to be 1 × 5, 1 × 4, 1 × 5 and 1 × 4, while their feature maps are configured to be 3, 6, 12 and 24, respectively [10]. The scattering transform parameters of the Urban dataset are set to be J = 2 and m = 3, while the parameters of the Jasper Ridge and Samson datasets are set as J = 3 and m = 2. The parameters of CNN and scattering transform are selected the same as the previous references, where the effectiveness of parameters has been proven in different experiments. The training phase for CNNs in our paper is 500 epochs, while the training phases for ResNet-based approaches are 100 epochs. For the parameter of the number of epochs, the larger value usually obtains the better accuracy. In fact, the training phases for the proposed approaches with 500 epochs can only achieve slightly higher results than 100 epochs. Therefore, we selected 100 epochs in this paper. This also shows that our proposed method can achieve better results using a smaller number of training epochs.
(3) Computational cost of proposed method The main structures of the ARNST consist of scattering transform, attention mechanism and ResNet. The computational cost should be computed with three parts.
According to [43], the complexity of scattering coefficients is O(n log n), and the computational cost of self-attention in each layer is O(n 2 * d) [44], where n is the cardinality of the training set and d is the dimension of each sample. In addition, the complexity of neural networks [45] is usually considered as O(n 5 ). When combining them together, the ResNet plays a major role in complexity calculation. Therefore, the computational cost of proposed ARNST is O(n 5 ).
(4) ARNST network implementation details Now we consider the Urban data with 162 spectral bands as an example. The architecture details of the proposed ARNST are described in Table 2.
K refers to the size of the convolving kernel. Forming the ARNST framework, the input of the proposed network is the spectrum with the size of 1 × 1 × 162. Firstly, the scattering transform layer with parameters m = 2, j = 3 is employed to extend the input and extract the scattering features with the size of 1 × 1 × 648, which is then reshaped to 9 × 9 × 8. Next, several residual network blocks are used to extract deep features from scattering transform feature maps. Each residual block includes four parts: the module of convolutions with different kernel filters with batch normalization and ReLU (Conv-BN-ReLU), channel attention module, spatial attention module and addition module. The kernel filters of convolutions of residual blocks are set as 3 × 3 in this paper. The number of filters in the residual network layers of Block1, Block2 and Block3 are set to 16, 32 and 64, respectively. It is worth mentioning that "1 × 1 × 16, 81" represents that an attention weight of 81 is obtained by channels, while "9 × 9 × 1, 16" means that an attention weight of 16 is obtained by the spatial attention module. Finally, a flattened layer and fully connected layers transform the previous feature maps into an output feature vector with a size of 6, which is the number of endmembers. According to [43], the complexity of scattering coefficients is ( log ) O n n , and the computational cost of self-attention in each layer is (4) ARNST network implementation details Now we consider the Urban data with 162 spectral bands as an example. The architecture details of the proposed ARNST are described in Table 2. K refers to the size of the convolving kernel. Forming the ARNST framework, the input of the proposed network is the spectrum with the size of 1 × 1 × 162. Firstly, the scattering transform layer with parameters m = 2, j = 3 is employed to extend the input and extract the scattering features with the size of 1 × 1 × 648, which is then reshaped to 9 × 9 × 8. Next, several residual network blocks are used to extract deep features from scattering transform feature maps. Each residual block includes Conv-BN-ReLU 9 × 9 × 16 -Channel attention 1 × 1 × 16, 81 7 × 7 Spatial attention 9 × 9 × 1, 16 -Add 9 × 9 × 16 Residual Block 2 According to [43], the complexity of scattering coefficients is ( log ) O n n , and the computational cost of self-attention in each layer is (4) ARNST network implementation details Now we consider the Urban data with 162 spectral bands as an example. The architecture details of the proposed ARNST are described in Table 2. K refers to the size of the convolving kernel. Forming the ARNST framework, the input of the proposed network is the spectrum with the size of 1 × 1 × 162. Firstly, the scattering transform layer with parameters m = 2, j = 3 is employed to extend the input and extract the scattering features with the size of 1 × 1 × 648, which is then reshaped to 9 × 9 × 8. Next, several residual network blocks are used to extract deep features from scattering transform feature maps. Each residual block includes Residual Block 3 The main structures of the ARNST consist of scattering transform, attention mechanism and ResNet. The computational cost should be computed with three parts.
According to [43], the complexity of scattering coefficients is ( log ) O n n , and the computational cost of self-attention in each layer is (4) ARNST network implementation details Now we consider the Urban data with 162 spectral bands as an example. The architecture details of the proposed ARNST are described in Table 2. K refers to the size of the convolving kernel. Forming the ARNST framework, the input of the proposed network is the spectrum with the size of 1 × 1 × 162. Firstly, the scattering transform layer with parameters m = 2, j = 3 is employed to extend the input and extract the scattering features with the size of 1 × 1 × 648, which is then reshaped to 9 × 9 × 8. Next, several residual network blocks are used to extract deep features from scattering transform feature maps. As the whole proposed network is carried out in an end-to-end manner, the parameters and effective paths can be learnt automatically. Therefore, the implementation details will be carried out in a similar way for different datasets.
For a fair comparison, the same training and testing samples are utilized for all methods, and all algorithms are executed five times. The average results can reduce the error involved by random selection effects. In addition, the network structures are set to the same width and depth.

Results for the Urban Dataset
To validate the robustness of the algorithms to noise, Gaussian white noise with a power of 20 dB is added to the samples. Figure 4 shows the comparison of the spectral curves of the original and noisy data, with the blue curve being the original and the red being the one with additive noise. The effects of the noise are evident from the random spikes appearing in the spectrum. Results are evaluated for noisy test samples by utilizing the trained model, which is based on the original data. The kernel filters of convolutions of residual blocks are set as 3 × 3 in this paper. The number of filters in the residual network layers of Block1, Block2 and Block3 are set to 16, 32 and 64, respectively. It is worth mentioning that "1 × 1 × 16, 81" represents that an attention weight of 81 is obtained by channels, while "9 × 9 × 1, 16" means that an attention weight of 16 is obtained by the spatial attention module. Finally, a flattened layer and fully connected layers transform the previous feature maps into an output feature vector with a size of 6, which is the number of endmembers.
As the whole proposed network is carried out in an end-to-end manner, the parameters and effective paths can be learnt automatically. Therefore, the implementation details will be carried out in a similar way for different datasets.
For a fair comparison, the same training and testing samples are utilized for all methods, and all algorithms are executed five times. The average results can reduce the error involved by random selection effects. In addition, the network structures are set to the same width and depth.

Results for the Urban Dataset
To validate the robustness of the algorithms to noise, Gaussian white noise with a power of 20 dB is added to the samples. Figure 4 shows the comparison of the spectral curves of the original and noisy data, with the blue curve being the original and the red being the one with additive noise. The effects of the noise are evident from the random spikes appearing in the spectrum. Results are evaluated for noisy test samples by utilizing the trained model, which is based on the original data.  Table 3 shows the summations of the RMSEs for different training ratios applied to the original and noisy Urban dataset. For the original data, it is clear that the CNN, STFHU and STCNN set of methods result in a lower performance than the set of four ResNet-based approaches. For training ratios of 20%, 10% and 5%, the average difference of the two sets of methods in terms of the summation of the RMSEs is approximately 0.2. When 1% or 0.5% of data are utilized for training, this  Table 3 shows the summations of the RMSEs for different training ratios applied to the original and noisy Urban dataset. For the original data, it is clear that the CNN, STFHU and STCNN set of methods result in a lower performance than the set of four ResNet-based approaches. For training ratios of 20%, 10% and 5%, the average difference of the two sets of methods in terms of the summation of the RMSEs is approximately 0.2. When 1% or 0.5% of data are utilized for training, this difference tends to be larger, ranging from 0.4 to 0.7. To be more specific, the CNN method performs the worst for training ratios of 5% and higher, while the STFHU method results in the worst performance for training ratios less than 5%. The combination of the scattering transform and the CNN (STCNN) results in some improvement for training ratios of less than 5% but the resulting RMSEs are still much higher than the ResNet based methods. The separate inclusion of the attention mechanism to ResNet (AResN) and the scattering transform (RNST) both result in lower errors, while the proposed ARNST, which combines both, presents the best hyperspectral unmixing results among all seven considered algorithms across all conditions of training ratios. Moreover, the improvement caused by ARNST shows increasing trends as the training ratio decreases, with the 0.1715 using 20% data for training being the smallest error tested on the original data. When adding noise to the original data, the robustness of the algorithms to noise can be validated. It is axiomatic that the CNN and STCNN result in summations of RMSE that are larger than one under all considered conditions, with the 1.7159 resulting from the CNN being the largest error in the whole table. The STFHU previously proposed by our team shows accurate results when the training ratio is no less than 5%, which are equivalent to the performance of the ResNet-based approaches. However, due to the extremely small training ratio, the normal scattering transform method cannot unmix hyperspectral images properly either, and thus the errors increase to be similar to those of the CNN-based methods. The ResNet leads to large improvements of the unmixing accuracy, and the difference of the summation of RMSE to the CNN-based approaches ranges from approximately 0.5 to 0.9. The AResN and RNST methods further enhance the results by incorporating the attention mechanism and the scattering transform,respectively. The ARNST that includes both the attention mechanism and the scattering transform achieves the most satisfied performance under all considered training ratios, with 0.2227 being the smallest summation of RMSE in all conditions of noisy hyperspectral data unmixing. In addition, the increase in the ARNST performance compared to other comparative methods becomes larger when the training ratio is smaller. Figure 5 shows the ground truth and estimated abundance maps for corresponding endmembers using the proposed method and comparative methods with 5% training ratio. It can be clearly observed that the ResNet-based methods (images of rows 4 to 7 of Figure 5) always achieve abundance maps that are visually much more similar to the original Urban hyperspectral abundance maps with no noise (images of row 1 of Figure 5) compared with the CNN-based approaches (images 2 to 3 of Figure 5). This corresponds with the RMSE results observed in Table 3.  Figure 6 shows the estimated maps for the endmember "Metal" using the noisy data when the training ratio is 5%. There are large differences between the ground truth results (Figure 6a and the results from the CNN-based methods (Figure 6b,d)). This corresponds with the results of Table 3, where RMSE results for the noisy data are far greater than 1. It can also be seen that abundance maps resulting from the STResN method (Figure 6e) are visually more similar to the original ground truth abundance maps (Figure 6a) than the abundance maps resulting from the ResNet method (Figure 6g). This indicates that the scattering transform is able to suppress noise and thus is helpful for stable feature representation of noisy data.  Figure 6 shows the estimated maps for the endmember "Metal" using the noisy data when the training ratio is 5%. There are large differences between the ground truth results (Figure 6a and the results from the CNN-based methods (Figure 6b,d)). This corresponds with the results of Table 3, where RMSE results for the noisy data are far greater than 1. It can also be seen that abundance maps resulting from the STResN method (Figure 6e) are visually more similar to the original ground truth abundance maps (Figure 6a) than the abundance maps resulting from the ResNet method (Figure 6g). This indicates that the scattering transform is able to suppress noise and thus is helpful for stable feature representation of noisy data.   The experimental results for the Jasper Ridge hyperspectral dataset are listed in Table 4. It can be found that considering the processing of original data when the training samples are limited, ResNet, AResN and STResN all lead to similar experimental results that are significantly better than the results obtained for the STFHU and CNN approaches. ARNST achieves the best result among all comparative algorithms, and the summations of RMSE improve from 0.1986 to 0.1087 when comparing with the CNN approach when the training ratio is 30% (a 45.26% increase in accuracy). In the meantime, considering the training ratios of 20% and 10%, the performance of ARNST improves by 50% and 59%, respectively, which indicates that the improvement resulting from the proposed solution becomes larger as the percentage of data used for training decreases. Moreover, it is obvious that, compared to the other approaches, the results of ARNST are also more stable when the training ratio changes.     Table 4. It can be found that considering the processing of original data when the training samples are limited, ResNet, AResN and STResN all lead to similar experimental results that are significantly better than the results obtained for the STFHU and CNN approaches. ARNST achieves the best result among all comparative algorithms, and the summations of RMSE improve from 0.1986 to 0.1087 when comparing with the CNN approach when the training ratio is 30% (a 45.26% increase in accuracy). In the meantime, considering the training ratios of 20% and 10%, the performance of ARNST improves by 50% and 59%, respectively, which indicates that the improvement resulting from the proposed solution becomes larger as the percentage of data used for training decreases. Moreover, it is obvious that, compared to the other approaches, the results of ARNST are also more stable when the training ratio changes.  Table 4. It can be found that considering the processing of original data when the training samples are limited, ResNet, AResN and STResN all lead to similar experimental results that are significantly better than the results obtained for the STFHU and CNN approaches. ARNST achieves the best result among all comparative algorithms, and the summations of RMSE improve from 0.1986 to 0.1087 when comparing with the CNN approach when the training ratio is 30% (a 45.26% increase in accuracy). In the meantime, considering the training ratios of 20% and 10%, the performance of ARNST improves by 50% and 59%, respectively, which indicates that the improvement resulting from the proposed solution becomes larger as the percentage of data used for training decreases. Moreover, it is obvious that, compared to the other approaches, the results of ARNST are also more stable when the training ratio changes. Considering the noisy data, Figure 8 illustrates the estimated abundance maps of the "Dirt" endmember with a 5% training ratio. The estimated maps obtained by training the original data using CNN (Figure 8b  Considering the noisy data, Figure 8 illustrates the estimated abundance maps of the "Dirt" endmember with a 5% training ratio. The estimated maps obtained by training the original data using CNN (Figure 8b  When we compare of the summations of RMSE of unmixing noisy images using the Urban and Jasper Ridge datasets under a training ratio of 20%, it can be observed that the CNN is not good at adapting to noise and cannot perform well when unmixing noisy data. Since the scattering transform approaches are robust and stable when representing features, the STFHU and the proposed ARNST have ability to reduce the effects of Gaussian noise, and thus achieve satisfactory results. The proposed ARNST obtains the smallest summations of RMSE, which are 0.2227 for Urban and 0.3055 for Jasper Ridge. Thus, the involvement of the scattering transform in ARNST brings robust performances against noise.

Results for the Jasper Ridge Dataset
When we compare of noisy hyperspectral unmixing performances of the STFHU, AResN and ARNST under different training ratios, the proposed method leads to better performances than STFHU across all conditions, with the average percentage of improvement being more than 25%. Therefore, the ARNST shows better hyperspectral unmixing results than the scattering transform approach after combining the attention mechanism and the ResNet. In addition, the AResN algorithm When we compare of the summations of RMSE of unmixing noisy images using the Urban and Jasper Ridge datasets under a training ratio of 20%, it can be observed that the CNN is not good at adapting to noise and cannot perform well when unmixing noisy data. Since the scattering transform approaches are robust and stable when representing features, the STFHU and the proposed ARNST have ability to reduce the effects of Gaussian noise, and thus achieve satisfactory results. The proposed ARNST obtains the smallest summations of RMSE, which are 0.2227 for Urban and 0.3055 for Jasper Ridge. Thus, the involvement of the scattering transform in ARNST brings robust performances against noise.
When we compare of noisy hyperspectral unmixing performances of the STFHU, AResN and ARNST under different training ratios, the proposed method leads to better performances than STFHU across all conditions, with the average percentage of improvement being more than 25%. Therefore, the ARNST shows better hyperspectral unmixing results than the scattering transform approach after combining the attention mechanism and the ResNet. In addition, the AResN algorithm performs well for the Urban dataset but shows dissatisfactory results for Jasper Ridge. This also proves that the proposed ARNST can obtain more stable results than other comparative algorithms.

Results for the Samson Dataset
The Samson dataset is also used to validate the proposed algorithms. The summations of RMSE are listed in Table 5. It can be seen that ARNST method achieves the best results among all comparative methods, which are 0.0751 using 20% of data for training and 0.5255 for the 10% case. The CNN-based approaches both lead to large errors, while the ResNet and the attention mechanism bring more accurate unmixing results than the two algorithms based on CNN. The STFHU without using the proposed methods cannot work well either.

Results when Training on Noisy Data
In order to validate the robustness of algorithms to different input data, this paper also attempts to train the models using the noisy training data before utilizing the noisy data for testing, leading to the results in Figure 9. All the considered algorithms achieve more accurate unmixing results by training using the noisy data compared to the results obtained when training the original data. It is noted that the greatest improvement is obtained for the CNN-based approaches, which indicates that these methods are not as robust to noise when compared to the other methods. The ARNST achieves the smallest summation of RMSE, proving that the proposed approach possesses the best robustness to noise. It is also noted that the scattering-transform-based approaches (ARNST, RNST and STFHU) achieve results that are most similar to those obtained when testing noisy data using models trained with original (non-noisy) data). This indicates that the scattering transform helps in ensuring that the unmixing system is robust to noise.
From Tables 3-5 and considering all algorithms, the summation of RMSE becomes larger as the amount of training samples becomes smaller, which demonstrates the requirement to have a sufficient amount of training samples to be able to accurately train the models of these approaches. Obviously, the proposed ARNST can provide the best and most stable performance in all cases. There are two key reasons that the proposed method achieves better results than other contrastive methods. Firstly, the scattering transform possesses a multilayer structure which generates highly descriptive features of the hyperspectral images. In addition, the scattering transform is able to distinguish noise from original spectral signal, which reduces the effect of endmember variability. Secondly, the attention mechanism in deep learning networks helps to focus on the interesting and important information of scattering transform features using less training data. This will further suppress the noise information. Therefore, the combination of scattering transform features and attention-based deep learning network within the proposed ARNST approach can effectively address the problem of hyperspectral image unmixing using neural networks trained on limited ground truth data and with endmember variability. these methods are not as robust to noise when compared to the other methods. The ARNST achieves the smallest summation of RMSE, proving that the proposed approach possesses the best robustness to noise. It is also noted that the scattering-transform-based approaches (ARNST, RNST and STFHU) achieve results that are most similar to those obtained when testing noisy data using models trained with original (non-noisy) data). This indicates that the scattering transform helps in ensuring that the unmixing system is robust to noise.

Conclusions
In this paper, a novel attention-based residual network framework with scattering transform features is proposed for hyperspectral data unmixing with limited training samples. The scattering transform is utilized to extract high-order deep features that benefit the robustness of trained models of the residual neural networks. Furthermore, the inclusion of attention mechanism into the model helps to focus the residual neural network on the feature information of interest and results in maximized performance. The proposed framework makes full use of the advantages of the scattering transform, the residual neural network, and the attention mechanism. The resulting ARNST possesses the capability of suppressing noise and provides significantly improved accuracy of hyperspectral unmixing when extremely limited training data are available, which is helpful in real-world applications where the data are usually corrupted by additive noise and the access to lots of labeled training data is not practical. Experiments on public datasets have demonstrated that the proposed framework achieves more accurate and stable results than state-of-the-art algorithms in hyperspectral unmixing when noise is present and the training ratio is extremely low. When comparing with CNN and STFHU approaches, the proposed ARNST approach can obtain at least 40% and 25% increases in performance for original and noisy hyperspectral datasets, respectively.
Future work includes further improving the spatial-spectral joint 3D network structure to provide more accurate hyperspectral unmixing results under extreme conditions. Author Contributions: Conceptualization, Y.Z. and C.R.; methodology, Y.Z. and C.R.; software, Y.Z.; validation, Y.Z., J.Z., and C.R.; writing-original draft preparation, Y.Z. and J.Z.; writing-review and editing, C.R. and J.L.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflicts of interest.