Scattering Transform Framework for Unmixing of Hyperspectral Data

: The scattering transform, which applies multiple convolutions using known ﬁlters targeting di ﬀ erent scales of time or frequency, has a strong similarity to the structure of convolution neural networks (CNNs), without requiring training to learn the convolution ﬁlters, and has been used for hyperspectral image classiﬁcation in recent research. This paper investigates the application of the scattering transform framework to hyperspectral unmixing (STFHU). While state-of-the-art research on unmixing hyperspectral data utilizing scattering transforms is limited, the proposed end-to-end method applies pixel-based scattering transforms and preliminary three-dimensional (3D) scattering transforms to hyperspectral images in the remote sensing scenario to extract feature vectors, which are then trained by employing the regression model based on the k-nearest neighbor (k-NN) to estimate the abundance of maps of endmembers. Experiments compare performances of the proposed algorithm with a series of existing methods in quantitative terms based on both synthetic data and real-world hyperspectral datasets. Results indicate that the proposed approach is more robust to additive noise, which is suppressed by utilizing the rich information in both high-frequency and low-frequency components represented by the scattering transform. Furthermore, the proposed method achieves higher accuracy for unmixing using the same amount of training data with all comparative approaches, while achieving equivalent performance to the best performing CNN method but using much less training data.


Introduction
Hyperspectral image (HSI), covering hundreds of continuous spectral bands, has been widely used in lots of different applications [1][2][3]. Due to spatial resolution, pixels in remote sensing HSI often consist of mixtures of different classes of land covers (known as endmembers) [4,5]. This mixing phenomenon poses great challenges to HSI processing problems, such as segmentation, classification, location estimation, and recognition [6,7]. Therefore, many researchers focus on the field of hyperspectral unmixing, the aim of which is to estimate the endmembers and their abundances [8,9]. The major challenges include scenarios where there is a limited availability of samples for training, different classes of samples with similar spectral features, and various spectral features in the same classes of samples [10,11].
The linear spectral mixture model and nonlinear spectral mixture model are the two approaches for addressing these difficulties and have been discussed in [12]. Many methods, including statistics-based, (2) The proposed method can obtain equivalent performance using less training samples than CNN-based approaches. Meanwhile, the parameter setting of the scattering transform framework is less complicated than that of the CNN. (3) The scattering transform features are well suited to eliminate effects of Gaussian white noise. The model trained using non-noisy data can also achieve satisfactory unmixing results when being applied to noisy data.
The rest of this paper is organized as follows. The proposed scattering transform framework for hyperspectral unmixing is presented in Section 2. Sections 3 and 4 compare the proposed approach with state-of-the-art algorithms, which describe and discuss the experimental results based on simulated hyperspectral data and real-world hyperspectral data. Conclusions are drawn in Section 5.

Methodology
In this section, the scattering transform framework is proposed for hyperspectral image unmixing. Given the vth pixel spectrum as r v ∈ R 1×l , the spectral mixture model f can be simply described as: where A i = [a v1 , . . . , a vk , . . . , a vn ] ∈ R 1×n is the abundance fractions, and a vk denotes the abundance of the kth endmember. X = [x 1 , x 2 , . . . , x n ] T ∈ R n×l is the endmember matrix of n endmembers and l is the number of bands of hyperspectral data. ε v = [ε v1 , . . . , ε vk , . . . , ε vn ] ∈ R 1×n represents the error vector. a vk ≥ 0 is the abundance non-negativity constraint, while n k=1 a vk = 1 is the sum-to-one constraint for the abundances.
The proposed STFHU framework is illustrated in Figure 1, in which the notations are utilized to clearly illustrate the structure of the proposed method and will be discussed in Section 2. The spectral vectors are processed by the scattering transform network, which aims at extracting the high-level feature representation of the spectrum. This is achieved by using a cascade of transforms applied firstly to the input spectral vectors, which include a single pixel vector or neighborhood of the pixel vector, and then sequentially to the outputs of each transform stage. Subsequently, the features are used as the input to the regression model, and the abundance of the spectrum is then obtained. The scattering transforms are typically implemented using wavelet functions (known as wavelet scattering transforms) but other transforms are also possible (for example, the Fourier scattering transform).

Pixel-Based Wavelet Scattering Transform
Let ψ λ be a wavelet cluster, which is scaled by 2 j from the mother wavelet, and it can be shown as: and then a wavelet modular operator |Wr| is defined, which includes the wavelet arithmetic and modular arithmetic, where the first part, S(r), is called scattering coefficient and represents the main information at low frequency bands of the input signal, while the second part, U(r), is the scattering propagator, which is the model of the nonlinear wavelet transform. S(r) is the coefficient output of each different order, and U(r) is the input to the transform of the next order, used to regain the high frequency information. r is the spectral vector input, and λ represents the variables in wavelet modular operator. Moreover, the scaling function is φ J = 2 −2J φ(2 −J y). Figure 1 shows the J = 3. In this way, the scattering transform constructs invariable, stable, and abundant signal information by utilizing iterative wavelet decomposition, a modulus operation, and a low pass filter. The zero-order scattering transform output is: and then U 1 (r, λ) can be used as the input to the first-order transform: In addition, the first-order scattering transform output and the corresponding input are: U 2 (r, λ 1 , λ 2 ) = U 1 (r, λ 1 ) * ψ λ 2 (y) = r * ψ λ 1 (y) * ψ λ 2 (y) Finally, a collection of the scattering transform coefficient outputs from zero-order to mth-order can be obtained as: S(r) = S 0 (r), S 1 (r), . . . , S m (r) (8) which is the scattering transform feature vector for the hyperspectral pixel y. The main advantages of scattering transforms include the translation invariance, local deformation stability, energy conservation, and strong anti-noise ability.

3D-Based Scattering Transform
As the mixture in one pixel usually includes land cover that is around the pixel, the spatial information is crucial for hyperspectral image processing and should be taken into account in spectral unmixing.
The spectral vector at location (i, j) in spatial coordinates is given as: and in order to extract the spectral-spatial information, a 3D filter f (p, q, o) is utilized to compute the data cube of hyperspectral images. Because the spectral information is the major information for unmixing, o = 0 is firstly assumed where f (p,q) f (p,q,o). As shown in Figure 2, the filter f (p, q) with the size of (2P + 1) × (2Q + 1) is represented, and the new data cube after being filtered is: where . . , f (−P, −Q) are the coefficients of the filter. In this paper, the average filter is used to construct the new cube. Compared with r i,j , r i,j includes the spectral information and spatial information. Therefore, it can be figured out that one of the key factors in the 3D scattering transform framework for hyperspectral unmixing is designing the filter function f (p, q, o).
From Equations (2)-(10), the scattering transform coefficients can be retrieved as: It can be seen that the scattering transform processing framework derives features in a similar way to a CNN processing framework.
The hierarchical example spectra of the scattering transform network are demonstrated in Figure 3 and in the "scattering transform" part of Figure 1. The original spectrum of one pixel in one hyperspectral image with 156 bands is shown in the first row, while the other rows illustrate the scattering transform coefficients of this spectrum. In this instance, the parameters are set as J= 3, m = 2.
The coefficients in the second row, which are applied by using the low-pass filter φ J , are very smooth and show consistency compared with the original spectrum. In the third and fourth rows, it can be found that the high frequency information is separated by U(r). For the fourth row, the high-pass scattering transform value is very low; therefore, only the three low-pass information is left. Then, the scattering transform feature vector S(r) with 1092 dimensions, which comprises the seven scattering transform coefficient vectors in the above rows, is shown in the fifth row. Thus, the feature vector has richer, smoother information than the original spectrum, while maintaining a similar spectral envelope to the original spectra.

Regression Model of Scattering Transform Features
According to the scattering transform hyperspectral unmixing structure in Figure 1, the input is the spectral vector of a pixel, and the output are the abundances of endmembers. After the pixel-based or 3D-based scattering transform results S(r) are obtained, the scattering coefficients of each pixel at position (i, j) can be used as scattering transform features. Therefore, the mixture model can be rewritten as: After that, the regression model is used to predict the abundance corresponding to the endmembers of each pixel by utilizing the feature vectors. For the regression model, k-nearest neighbor (k-NN) is applied to learn a regressor based on the scattering transform features in samples for training. k-NN predicts the values of the spectra in test sets based on how closely the features resemble the spectra in the training set. As the endmembers are often considered by using the pure pixels in hyperspectral images, the operation principal of the k-NN regressor can be illustrated in Figure 4. The inputs are scattering transform features, and the outputs are abundance maps.

Experimental Datasets
Both synthetic and real hyperspectral datasets are used for conducting the experiments to verify advantages and efficiency of the proposed scattering transform framework for hyperspectral unmixing.

Synthetic Hyperspectral Dataset
Eight spectra are randomly selected as the endmembers from the United States (U.S.) Geological Survey (USGS), which is available at [41]. The endmembers are shown in Figure 5, in which the abundance matrix of the endmembers is generated by utilizing the method introduced in [42]. The size of the abundance matrix is (250 × 250 × 8) and the spectral band number is 188. The generated abundances of the eight endmembers are shown in the first column of Figure 12. In order to verify the robustness performance of the proposed algorithm, Gaussian white noise is added to the synthetic hyperspectral data. Figure 6 shows the 100th band of the original and noisy synthetic data, and Figure 7 shows their spectra located at the (100, 100) pixel. Gaussian parameters are set by zero mean with 0.001 and 0.005 variance, respectively. The noisy synthetic hyperspectral dataset with 0.001 variance is called Noise1, and noisy dataset with 0.005 variance is called Noise2. It can be seen that Noise2 is more ambiguous than the original data and Noise1, and the spectrum of Noise2 includes a large amount of disturbance.

Real-World Hyperspectral Datasets
The proposed approach is applied to three real-world hyperspectral datasets, which are downloaded from [43], and the detailed explanation is provided in [44].
(1) Urban data The Urban dataset is a very popular hyperspectral dataset for unmixing studies. The images are of 307 × 307 pixels, and the spatial resolution is 2 m. Each pixel includes 162 effective channels with the wavelength ranging from 400 to 2500 nm. There are six endmembers in the dataset, including "Asphalt Road", "Grass", "Tree", "Roof", "Metal", and "Dirt". Figure 8 shows the endmembers of the Urban dataset, and the Urban dataset and the ground truth of abundances of endmembers are shown in Figure 9.  (2) Jasper Ridge data Jasper Ridge is one of the most widely used hyperspectral unmixing datasets, with each image of size 100 × 100 pixels. Each pixel is recorded at 198 effective channels with the wavelength ranging from 380 to 2500 nm. There are four endmembers latent in this dataset, including "Road", "Dirt", "Water", and "Tree". The Jasper Ridge data and the ground truth abundance of endmembers are shown in Figure 10. (3) Samson data In the Samson datasets, each image is of size 95 × 95 pixels and there are 156 channels covering the wavelengths from 401 to 889 nm. There are three target endmembers in the dataset, including "Rock", "Tree", and "Water". Figure 11 shows the Samson data and the corresponding ground truth abundances.

Experimental Setup
All of the experiments are tested on a notebook with the i7-8750H CPU, NVIDIA Quadro P4200 GPU and 32 GB RAM on Ubuntu 18.04 Linux System. The main software used includes Tensorflow [45], scikit-learn [46], and Keras [47].
In order to evaluate the performance of the proposed algorithm, there are three contrastive methods, which are ANN [24], LSU [16], and CNN [29]. Because the proposed STFHU is based on a regression model, in order to achieve effective validation, the selected contrastive methods are also based on a regression model. According to the literature [48], the most widely used methods for abundance estimation in hyperspectral unmixing are least square methods, which belong to linear spectral unmixing (LSU), and artificial neural networks methods, which belong to nonlinear spectral unmixing. CNN is the main state-of-the-art deep learning method for unmixing, and shows a good performance in abundance estimation. In addition, the similar structure between CNN and STFHU help to compare their performance in hyperspectral unmixing.
For ANN, the hidden layer size of MLP is set to 100, and the activation function is the "identity", which is useful to test the bottleneck of linear functions. "Stochastic gradient descent" is utilized as the solver, and the learning rate is set to 0.001. For LSU, the fully constrained least squares LSU is selected. For CNN, there are four convolution layers and four sampling sub-layers in the network. The size of the convolution layers is set to 1 × 5, 1 × 4, 1 × 5, and 1 × 4, while their feature maps are set to 3, 6, 12, and 24, respectively [29]. The size of the pooling layers is assumed to be the maximum pooling operator, and 0.01 is adopted as the learning rate of CNN. The validation split parameter is set as 0.8. The training phase is 500 epochs for synthetic data and Urban data and 200 epochs are set for other real-world data, because there is not a significant performance improvement after 200 epochs. As for STFHU, the scattering transform parameters are set to J = 3 and m = 2. Considering k-NN regressor parameters, the number of neighbors is set as default to k = 5.
In order to quantitatively evaluate performances of the algorithms, the root mean square error (RMSE) and the root mean square of abundance angle distance (rms-AAD) are involved, and the RMSE is expressed as follows: where a i denotes the ground truth abundance for the ith pixel, a ri is the predicted result of abundance in the ith pixel, and N denotes the total number of pixels. The smaller the RMSE, the better the performance of the predicted results. Abundance angle distance (AAD) is used to measure the similarity between the ground truth of abundance and the predicted results of abundance. AAD and rms-AAD are formulated as: To make fair comparisons, all of the methods are based on the same samples for training and testing, and the proportion of samples used for training and testing are also identical.

Noise Data Results
In the simulated experiments, in order to verify the robustness and the performances of different methods in the presence of mixed noise, both the Noise1 and Noise2 data are applied utilizing the same training model coefficients, which are achieved by training the original data without any noise. Figure 12 shows the abundance maps of the Noise2 dataset estimated by the proposed algorithm and the comparative methods, in which a color bar is drawn for showing the scale of color in the abundance maps. The training ratio is approximately 50%, where 31,000 pixels on the synthetic non-noise dataset are selected for learning parameters of the regressor, and the Noise2 dataset with 62,500 pixels is used for testing. The first column is the ground truth of different endmembers, and the other columns are the estimated abundance maps by different contrastive methods, respectively. It can be observed that by using ANN and CNN, not all endmembers in the abundance maps could be identified clearly, which indicates high sensitivity to the Gaussian noise of these two methods. LSU can help to achieve desirable abundance maps of a part of the endmembers, but the results are still not satisfying. In comparison, the proposed STFHU approach obtains unmixing results that are generally closer to the ground truth than the other state-of-the-art algorithms, and all estimated abundance maps of the eight endmembers in noisy hyperspectral data are satisfying and stable. This is because the features extracted by the scattering transform can take the Gaussian white noise in different spectral bands into account with the low-pass and high-pass filters.
In addition, the RMSE values achieved by comparing the abundance maps estimated by each aforementioned algorithm with the ground truth are illustrated in Figure 13. It can be seen that the proposed method has less RMSE than the ANN and CNN. Although LSU results in minimum RMSE values, this method results in more impulse noise, as shown in the visual results for LSU in Figure 12.
Moreover, the CNN approach shows a large fluctuation in the RMSE results of Figure 13, while the proposed STFHU approach achieves stable results for all endmembers.   Table 1 demonstrates the comparison of RMSE and rms-AAD results based on three types of hyperspectral synthetic dataset, which include the original, Noise1, and Noise2, calculated by using the STFHU approach. The training model trains by original data at a training ratio of 50%. It can be found that all of the synthetic data types can achieve stable RMSE values across all eight endmembers (EMs). Moreover, the results of the fourth row in Table 1 correspond to the abundance maps in Figure 12, and they achieve an average unmixing result of 0.0894 and an rms-AAD result of 0.4655. In order to present the results in a more visible way, the eighth endmember is selected as an instance, of which the estimated abundance maps based on the three datasets are shown and compared with the ground truth in Figure 14. It further shows that our method can obtain good results in Gaussian white noise hyperspectral datasets.

Results When Using Different Proportions of Samples Used for Training
In order to investigate the influences of the proportions of samples used for training on the hyperspectral image unmixing performance, experiments are used to test various training ratios with all other initial conditions remain identical. Figure 15 illustrates that the proposed approach achieves accurate abundance map estimation when there is no noise added to the original data. As for the noisy images, the RMSE values remain stable with the decrement of the training ratio, which indicates that the hyperspectral unmixing approach based on the scattering transform can utilize a small proportion (5%) of samples to train, while obtaining approximate results compared with that based on a larger percentage of samples. Thus, the proposed algorithm shows stable performances and is robust against noise, even when the training ratio is small. To further evaluate the performance of the proposed method, performance comparisons are completed considering two contrastive approaches, including the CNN and ANN, which are shown in Table 2. Table 2. RMSE results of the abundance map estimation of eight endmembers based on the original data considering three algorithms at a training ratio of 10% and 5%, separately.

Training Ratio Methods EM-1 EM-2 EM-3 EM-4 EM-5 EM-6 EM-7 EM-8
Avg. To further compare the proposed algorithm with the state-of-the-art CNN in hyperspectral unmixing, the ground truth, as well as the estimated abundance maps, utilizing 5% samples for training based on both algorithms and non-noisy data are shown in Figure 16. Taking Endmember 2, which presents large RMSE differences, as an example, it can be observed that the proposed method based on scattering transform possesses advantages in showing details of the ground truth, thus having closer predictions to the ground truth. The CNN algorithm leads to unsharpness in these abundance map estimates, which means the robustness to noise and the performance stability when using different proportions of samples for training are both unsatisfactory.  Table 3 shows the unmixing results of the proposed pixel-based scattering transform method and contrastive methods by using multiple training ratios, including 50%, 10%, and 5%. The results with blue background correspond to the proposed algorithm, while the orange, pink, and green colored backgrounds represent the RMSE values obtained by utilizing CNN, LSU and ANN separately. Comparative results of abundance map estimation of the Asphalt Road and Dirt endmembers utilizing the proposed STFHU method, ANN, LSU, and CNN are shown in Figure 17. From Table 3 and Figure 17, it can be seen that the proposed algorithm achieves better average RMSE in predicting the abundance maps, which is a considerable improvement over other contrastive methods. It can be seen that it is difficult for the LSU approach to accurately complete the unmixing in real-world data. ANN does not obtain ideal results either, which have large differences compared with the ground truth. Figure 18 makes these disadvantages visible in terms of showing light green color in most parts of the pixels, which should be blue in the ground truth, indicating degradations in the prediction performance and adaptiveness to the real-world data. CNN can help to achieve better RMSE results compared with LSU and ANN, but different network structures in CNN can lead to large variations of the experimental results. The pixel-based CNN obtains an average of RMSE value of 0.0456 considering urban hyperspectral data unmixing, while the pixel-based scattering transform approach proposed in this paper achieves a more accurate average RMSE of 0.0301.  Figure 18 illustrates comparisons of the abundance maps estimated by the proposed STFHU and the CNN method, both of which are pixel-based. It is clear that the estimated result of endmember "metal" using CNN has a large difference to the ground truth, while the method proposed in this paper achieves results that more closely match the ground truth. When analyzing Figure 19 in detail, the square-shaped object in the top right corner should be predicted as part of the Roof endmember, but CNN estimates this as a share of the Asphalt Road and Roof. Overall, the proposed method achieves accurate estimated abundance maps for all endmembers using a low training ratio, which is better than the CNN results.

Results of Experiments Based on the Jasper Ridge and Samson Datasets
Tables 4 and 5 list the RMSE results of hyperspectral unmixing utilizing the proposed method and three contrastive approaches based on the Jasper Ridge and Samson datasets. Due to the small amount of data, the condition of 75% training ratio is analyzed. It can be found that the proposed algorithm achieves better RMSE results in all cases compared with CNN, LSU, and ANN. In particular, the proposed method obtains average RMSE results for the two datasets of 0.0215 and 0.0150, respectively, which are much smaller than the other contrastive values. These comparisons further demonstrate advantages of the proposed approach in hyperspectral unmixing based on a small amount of data for training.

Discussion
The above experimental results show that the proposed pixel-based STFHU method can obtain good performance in HIS unmixing. In this section, we would like to discuss the robustness to noise, effect of limited training samples, computational complexity of the proposed method, and preliminary 3D-based STFHU.
(1) Robustness to noise According the results in Section 3, the method proposed in this paper is more robust to random noise interference in the remote sensing image and more adaptable to the environment, which brings benefits in effectively solving the problem of spectral variability caused by the variety of endmembers. Figure 19 illustrates the advantages of the scattering transform approach in extracting reliable features from noisy hyperspectral data. Figure 19a,c plots the original spectral and Noise2 spectral data, which have a mathematical expectation of zero and a variation of 0.005, separately. Figure 19b,d delineates the curves of scattering transform features extracted from the original data and Noise2, respectively. By comparing Figure 19b,d, it can be seen that the overall information carried by the noisy image and the original image present excellent consistency, which means that the transformed noisy spectral data can reflect the information of the non-noisy component of the mixed signal well. Therefore, the scattering transform features entail benefits in effectively reducing the effects of white noise so that the accuracy of hyperspectral unmixing can be improved.
(2) Effect of limited training samples When the number of samples used for training is relatively high, it is easier to achieve excellent unmixing results, while in real-world scenarios, the samples for training can be difficult to obtain, which makes it meaningful to compare the performances when there are limited numbers of samples for training. Table 2 compares the RMSE values when estimating the abundance maps of eight endmembers using non-noisy hyperspectral simulated data and three algorithms at different training ratios, including 10% and 5%. In both training ratios, the average RMSE of the CNN algorithm is higher than the other two methods, indicating a high demand of large numbers of training data. When 5% samples are utilized for training, the performance of CNN is worse than that of the 10% condition. Moreover, compared with the ANN method, the algorithm proposed in this paper presents lower average RMSE and rms-AAD results at both training ratios. When the training ratio decreases from 10 to 5%, the average RMSE of STFHU has a 20% increase from 0.0340 to 0.0408, while the average RMSE of ANN increases by 43.47% from 0.0421 to 0.0604.
In addition, from the results for 10% and 5% training ratios in Table 3, the proposed method achieves an average RMSE of 0.0790 at a training ratio of 5%, which is better than the CNN average RMSE result of 0.0818 at a training ratio of 10%. Additionally, it also precedes 0.1165 for LSU and 0.1415 for ANN utilizing 50% samples for training. These results demonstrate the unmixing ability of the proposed method when the proportion of samples for training is small, which shows equivalent or better results than the contrastive approaches using 50% of samples for training.
This proves that when there are limited samples for training, the proposed approach shows more apparent advantages in hyperspectral unmixing than other contrastive methods.
(3) Discussion of computational complexity It is necessary to train a considerable number of samples for building regression models, but our proposed method just needs to calculate the feature coefficient. Therefore, this framework can reduce the time cost. In this part, we mainly discuss the computational complexity of STFHU, LSU, ANN, and CNN.
From [49], we can obtain that the scattering coefficients are calculated with O(n log n). In [50], the complexity of k-NN is O(nd + kn), where n is the cardinality of the training set and d is the dimension of each sample. As the proposed STFHU is composed of scattering transforms and k-NN regression, the total computational complexity is O(n log n + nd + kn) ≈ O(n log n).
Based on [51], the computational complexity of least square regression is O(n 3 ).
The time computational complexity of the CNN can be computed by O( D l=1 m 2 l · f 2 l · C l−1 · C l ), where m is the length of feature map side, f is the length of convolution kernel side, C l−1 is the number of input channels, C l is the number of output channels, D is the number of convolution layers, and l represents the lth convolution layer. According to [52], the computational complexity of the CNN can be shown to be O( D l=1 m 2 l · f 2 l · C l−1 · C l ) ≈ O(n 5 ), which is the same as the complexity of the ANN method.
Therefore, it can be seen that the computational complexity of STFHU O(n log n) is much smaller than that of the least square regression,O(n 3 ), and CNN and ANN, O(n 5 ). Hence, the proposed method is significantly more computationally efficient than the comparative methods when using the same amount of training samples.
Furthermore, using the CNN requires modification of the network structure and parameters, such as the convolution layers, pooling layers, learning rate, and epochs, to optimize the performance for different training datasets, resulting in time-consuming training phases. In comparison, the proposed algorithm requires fewer parameters, including J and m, and the implementation of the k-NN regressor utilizing default settings achieves satisfactory performance, whilst different settings do not lead to significant changes in the performance. It means that our STFHU is less dependent on parameter choices. Thus, the method proposed in this paper shows advantages in simplifying the network structure and increasing the efficiency of computation.
(4) Discussion of the preliminary 3D-based STFHU results 3D-based approaches have been validated to be useful for HIS processing, which is preliminarily taken into account in this paper. Table 6 shows the average RMSE (Avg-RMSE) and rms-AAD results of unmixing utilizing both pixel-based and 3D-based scattering transform approaches based on small amounts of urban image samples for training. It can be observed that the average RMSE and rms-AAD of unmixing results become larger as the training ratio decreases using the same methods. To solve problems involved by utilizing a small proportion of samples for training, the 3D-based scattering transform is proposed, especially for real-world hyperspectral unmixing. Considering the effects on single-pixel spectral curve caused by the environment, the 3D spatial information can enrich the information included in the curve, thus being practical for real-world applications. In Table 6, the average RMSE using the 3D-based scattering transform at a training ratio of 0.5% is 0.1584, while the value for the pixel-based approach is 0.2148, which is larger than 0.1912, the result at 0.3% training ratio utilizing the 3D-based method. Likewise, the 3D-based approach can achieve equivalent rms-AAD at 0.5% training ratio compared with the result when using a training ratio of 2% for the pixel-based method. The aforementioned results verify that the 3D spatial information can provide effective cues when using low training ratios.
In future research, this type of information will be further utilized to combine with the scattering transform features to develop new algorithms to further improve the performance of hyperspectral unmixing.

Conclusions
In this paper, a novel scattering transform framework is proposed to improve the accuracy of hyperspectral unmixing. The STFHU method possesses a multilayer structure for extracting features, which is similar to the structure of CNN and can increase the accuracy of describing the desired features. The pixel-based and 3D-based scattering transforms are excellently suited for hyperspectral unmixing since they not only lead to sufficient information in the extracted features, but also suppress the interference of noise. In particular, this approach is robust against Gaussian white noise and a small amount of training samples. These scattering transform techniques are then combined with the k-NN regressor to form a framework for end-to-end feature extraction and deep scattering network for abundance estimation. Compared with CNN, the proposed solution has a clear structure and few parameters, leading to better viability and efficiency of computation. Experimental results based on simulated data and three real-world hyperspectral remote sensing image datasets provide abundant evidences of the robustness and adaptiveness of the proposed approach. Under the interference of Gaussian white noise, whose variance is 0.005, the scattering transform features are helpful to achieve closer abundance map predictions of the ground truth, thus having better unmixing results than the other contrastive approaches. Moreover, based on identical data for training and testing, the proposed algorithm gains the least RMSE and rms-AAD among all investigated methods, and it can also accurately complete the unmixing task when utilizing 5% of the samples for training. In addition, this paper presents preliminary verifications of the performance of 3D-based scattering transform based on a small number of samples, which is better than that of the pixel-based approach.
Although the proposed scattering transform framework can obtain desirable performance for hyperspectral unmixing, there is still a need to further improve the utilization of spatial correlation.
For the preliminary 3D-based framework in this paper, the spatial information is calculated firstly, and then we mainly use the spectral information for scattering transform. In terms of future research, we plan to investigate how to provide a whole joint spectral-spatial scattering transform framework for hyperspectral spectral image unmixing.