Hyperspectral Image Spectral–Spatial Classification Method Based on Deep Adaptive Feature Fusion

Convolutional neural networks (CNNs) have been widely used in hyperspectral image (HSI) classification. Many algorithms focus on the deep extraction of a single kind of feature to improve classification. There have been few studies on the deep extraction of two or more kinds of fusion features and the combination of spatial and spectral features for classification. The authors of this paper propose an HSI spectral–spatial classification method based on deep adaptive feature fusion (SSDF). This method first implements the deep adaptive fusion of two hyperspectral features, and then it performs spectral–spatial classification on the fused features. In SSDF, a U-shaped deep network model with the principal component features as the model input and the edge features as the model label is designed to adaptively fuse two kinds of different features. One comprises the edge features of the HSIs extracted by the guided filter, and the other comprises the principal component features obtained by dimensionality reduction of HSIs using principal component analysis. The fused new features are input into a multi-scale and multi-level feature extraction model for further extraction of deep features, which are then combined with the spectral features extracted by the long short-term memory (LSTM) model for classification. The experimental results on three datasets demonstrated that the performance of the proposed SSDF was superior to several state-of-the-art methods. Additionally, SSDF was found to be able to perform best as the number of training samples decreased sharply, and it could also obtain a high classification accuracy for categories with few samples.


Introduction
A hyperspectral sensor is a spectrometer that can simultaneously image a specific area on consecutive tens or hundreds of bands to obtain a hyperspectral image (HSI).Compared with multispectral images, HSIs have a wide range of bands and higher spectral resolution.Because hyperspectral imaging involves different bands, HSIs can obtain rich spectral information [1], which is conducive to resource exploration [2] and environmental monitoring [3].However, due to its high data dimensions, there is a problem of dimensional disaster in HSI processing.In fact, in the classification of hyperspectral data, many bands are redundant and have little positive effect on the classification result, so they seriously affect the processing results and efficiency.Therefore, feature selection and feature extraction came into being.For example, principal component analysis (PCA) [4,5] and independent component analysis (ICA) [6,7] are typical methods that transform high-dimensional data into low-dimensional data.In traditional HSI classification methods, support vector machine (SVM) [8,9], random forest [10], and other methods have been considered as efficient algorithms.Moreover, a problem is that different spectra presented by HSIs may belong to the same category and similar spectra may belong to different categories, so it is difficult to obtain a high accuracy in classification by only considering spectral information.In recent years, the question of how to make full use of spatial features has become attractive in the field of HSI classification.
Kang et al. [11] used the first principal component or the first three principal components of an HSI as a gray or color guide image to perform an edge-preserving filtering on the probability map obtained by the classifier, and then they selected the largest probability pixel to achieve classification.Additionally, adding texture features can be used to increase the classification accuracy of an HSI [12].In recent years, deep networks have gained widespread attention.A stacked autoencoder (SAE) [13,14], as one of the typical deep learning models, can extract and classify features by encoding and decoding the input vectors.Deep belief networks (DBNs) [15] and convolutional neural networks (CNNs) [16][17][18] have been proposed for spectral-spatial HSI classification.Furthermore, to obtain deep-level features, Zhao et al. [19] used dimensionality reduction methods and 2DCNN models to extract spectral and spatial features.Using the neighborhood block as the input of the network, a 3DCNN [20,21] was used to direct extracted spectral and spatial features from an original HSI to make full use of its spectral-spatial features and improve the classification results.Mou et al. [22] proposed the idea to use the time-series networks such as the RNN (recurrent neural network), LSTM (long short-term memory), and GRU (gated recurrent unit) for HSI classification, but the method only extracted hyperspectral spectral features, thus leading to limited classification accuracy.Xu et al. [23] proposed a multi-scale CNN model.This model first performed PCA on HSIs to extract three principal components as the input of the network, which combined the characteristics of each pooling layer with the spectral characteristics to classify HSIs.In this method, only three principal components were extracted from hyperspectral data as input features, and most HSI information was lost, so it was not good enough to achieve excellent classification results.Zhong et al. [24] designed a spectral spatial residual network (SSRN) for HSI classification, where the input data was a three-dimensional cube and the network used spectral and spatial residual blocks to learn discriminative features from the rich spectral and spatial features in the original HSI.However, this method only used residual alternating learning to obtain fusion features, and the feature fusion was not sufficient, so the extraction of spatial features was not good enough.Mu et al. [25] proposed a multi-scale and multi-level spectral-spatial feature fusion network (MSSN), where neighborhood blocks of different scales were used as the input of the network.The spectral features extracted by the 3D convolutional neural network and the spatial features extracted by the 2D convolutional neural network were combined in the form of 3D-2D alternating residual blocks and a self-mapping method.Song et al. [26] designed a deep feature fusion network (DFFN) for HSI classification by introducing residual learning and simultaneously adding the outputs of different levels of networks to further improve the classification results.The fusion method, however, was only an addition operation of the output features at different levels, the feature fusion of which was too simple and resulted in an insufficient fusion.Guo et al. [27] proposed an efficient deep feature extraction and HSI classification method based on multi-scale spatial features and cross-domain convolutional neural network (MSCNN) that could make full use of the multi-scale spatial features obtained by the guided filter.The cross-domain convolutional neural network was used to reorder the multi-scale spatial features, which were then input into a simple convolutional neural network model for classification.This method only performed a kind of recombination operation on the edge features, and it did not introduce other features.The network model only extracted features simply and did not make full use of the features of the HSI.
To solve the above problems and to adaptively fuse the two different features in a deep network, we propose a spectral-spatial HSI classification method based on deep adaptive feature fusion (SSDF).In this paper, a U-shaped network structure was used to enhance the fusion of deep features.The edge features and the principal component features of the HSIs were fused adaptively to obtain new features.The new features were input into a multi-scale and multi-level feature extraction (MMFE) model, and the output features were then combined with the spectral features for classification.
The contributions of this work are as follows: (1) The authors of this paper propose a U-shaped deep network that can adaptively fuse two different features consisting of convolutional layers, pooling layers, and deconvolution layers.The U-shaped network model was constructed to make sure that the two different features and the new feature after fusion have the same size.The labels of the U-shaped network are not the true labels of the image that are used in most literature; instead, they are the edge feature maps obtained by the guided filtering.The inputs of the U-shaped network are hyperspectral principal component feature maps.The U-shaped network is trained to learn the correlation and complementarity of two different features, adaptively fusing two different features and generating new feature maps.The new feature maps alleviate the problem of the low classification accuracy caused by using single kind of features.
(2) The authors of this paper designed an MMFE model that extracts the feature map of each pooling layer for convolution operation and finally inputs the convolved features to the global average pooling layer to extract the main information.The extraction of multi-level and multi-scale features can deeply extract the edge and abstract features of the image, which is beneficial to the final classification.The proposed deep adaptive feature fusion and spectral-spatial classification network uses advanced and different kinds of features as the input of the classification network, which can realize the multi-scale and multi-level fusion of multiple features, thus resulting in higher classification accuracy.

Materials and Methods
Here, we introduce SSDF.Sections 2.1-2.3 introduce the methods for feature extraction and fusion, and Sections 2.4-2.6 introduce the methods for the classification of the extracted features.Assume there is a hyperspectral dataset X = {x 1 , x 2 , . . . ,x N } ∈ R 1×1×b , where N is the number of labeled pixels and b is the number of spectral bands.Y = {y 1 , y 2 , . . . ,y N } ∈ R 1×1×L represents the corresponding one-hot label vector set, where L is the category of objects.We partition all data available into three sets-the training, validation, and test sets, which are denoted by Z 1 , Z 2 , and Z 3 , respectively.Their corresponding one-hot label vector sets are Y 1 , Y 2 , and Y 3 .First, the SSDF network uses Z 1 and Y 1 to update the network parameters.Then, Z 2 and Y 2 are used to monitor the temporary model generated by the network.Finally, Z 3 and Y 3 are used to evaluate the performance of the optimal training model.

Guided Filter for Edge Feature Extraction
Guided filtering [28] is an edge preservation filter with excellent performance that can make the output image retain the characteristics of the filtered image and better load the edge information of the guided image.In fact, the guided filtering method makes use of a local linear relationship between the output image of the guided filtering and the guided image.Assuming that the guided image is I and the input image is s, the filtered output image c is obtained by a local linear model as follows: where ω k is a square window with pixel k at the center and its length and width is (2r + 1). a k and d k are the coefficients to be estimated of the linear model.Then, to estimate the a k and d k parameters of the linear model, a cost function is established according to the difference between the input image s and the output image c: where ε is a regularization parameter that avoids making a k too large.Finally, the ridge regression technique [29] is used for parameter estimation.By minimizing the cost function (Equation ( 2)), the coefficients a k and d k can be solved as follows: where µ k and σ 2 k are the mean and the variance of the guided image I in the window, respectively; |ω| is the total number of pixels in the window; and s k is the mean of the input image s in the window.The guided filtered output image can be calculated after the a k and d k coefficients are obtained.
It can be seen from Equation ( 1) that the output image and the guided image have a linear relationship in the window, that is ∆c i = a k ∆I i .Therefore, when the guided image I contains edge information, the output image c retains the edge information at the corresponding position.Therefore, the output image c is a feature map with the edge features of the HSI.
The authors of this paper used guided filtering technology to extract the edge features of the image, as shown in Figure 1.
Remote Sens. 2021, 13, x FOR PEER REVIEW 4 of 20 where  is a regularization parameter that avoids making k a too large.Finally, the ridge regression technique [29] is used for parameter estimation.By minimizing the cost function (Equation ( 2)), the coefficients k a and k d  can be solved as follows: where k  and 2 k  are the mean and the variance of the guided image I in the window, respectively;  is the total number of pixels in the window; and k s is the mean of the input image s in the window.The guided filtered output image can be calculated after the The authors of this paper used guided filtering technology to extract the edge features of the image, as shown in Figure 1.In Figure 1, it can be seen that the minimum noise fraction rotation (MNF) [30] is used to denoise the input image first, and then PCA is used to extract the first few principal components of the denoised image as the input image PC1-PCe for guided filtering.The guided image is the first independent component feature map IC1 of the HSI extracted by ICA.Taking PC1-PCe as input images and using IC1 and three different windows [2,4,6] to perform guided filtering operations to obtain 3e filtering feature vectors at various scales, we can stack all vectors to form a multi-scale guided filtering feature set, namely the edge feature image set.

Principal Component Feature Extraction
Because the high-dimensional characteristics of HSIs bring problems such as computational complexity and information redundancy, it was required to use PCA to reduce In Figure 1, it can be seen that the minimum noise fraction rotation (MNF) [30] is used to denoise the input image first, and then PCA is used to extract the first few principal components of the denoised image as the input image PC 1 -PCe for guided filtering.The guided image is the first independent component feature map IC 1 of the HSI extracted by ICA.Taking PC 1 -PCe as input images and using IC 1 and three different windows [2,4,6] to perform guided filtering operations to obtain 3e filtering feature vectors at various scales, we can stack all vectors to form a multi-scale guided filtering feature set, namely the edge feature image set.

Principal Component Feature Extraction
Because the high-dimensional characteristics of HSIs bring problems such as computational complexity and information redundancy, it was required to use PCA to reduce the dimensionality of the spectral information of HSIs and to extract the first e principal component features.The spectral matrix X s of the HSI is obtained according to the spectral information of the samples as follows: where n denotes the number of all pixels in the HSI, p denotes the length of the spectral information of the samples, X s represents the spectral matrix of the HSI with n samples, and each row of X s represents a spectrum sample with length p.We calculate the average value x i of the i-th dimensional spectral information of the sample by the following formula: Further calculations yield the covariance matrix S of the spectral matrix X s : The component at the i-th row and the j-th column of the covariance matrix S is: where x kj represents the j-th dimensional spectral value of the k-th sample, x j represents the average value of the j-th dimensional spectral values of all the samples, and 1 < k ≤ n.
Then, the covariance matrix S is diagonalized, and the feature vectors are orthogonally normalized.The normalized eigenvectors are arranged according to the size of the corresponding eigenvalues, from large to small, to obtain a feature matrix X z .Then the spectral feature matrix X a = X z * X s , where the first c columns of X a are the first c principal component features of the HSI.
Thus far, we used guided filtering to obtain the features that contain the main edge information of the HSI, and we adopted PCA to reduce the dimensions to obtain the features that contain the principal components of an HSI.

Adaptive Feature Fusion
During the linear transformation of guided filtering, due to the difference in the radius of the sliding window, a part of the image information will be lost and the image information will not be fully utilized.In contrast, the principal component features of an HSI are the first few principal components of the image obtained by the PCA dimensionality reduction of the whole image, which can make up for the problem of information loss caused by the different sliding window radii in guided filtering.Therefore, to make more comprehensive use of HSI information, the authors of this paper adaptively fused these two different HSI features so that the HSI information could be fully utilized.As a deep autonomous learning model, deep learning can make a network adaptively learn the correlation and difference between model inputs and model labels by training on a network, thereby generating fusion features that contain both edge features and principal component features.Thus, the authors of this paper designed a U-shaped deep fusion network model with the principal component features as the model input and the edge features as the model label.The final output of the model comprises the fusion features that contain edge features and principal component features.The U-shaped network structure is shown in Figure 2.  In Figure 2, a, 2a, 4a, and 8a represent the size of the feature map; 1 × 1, 3 × 3, 5 × 5, and 7 × 7 represent the size of the convolution kernel; and 1, 64, and 128 represent the dimensions of the image.The fusion model consists of three parts: a convolution layer, a pooling layer (downsampling layer), and a deconvolution layer.This U-shaped structure, using the redefined input data and label data, enhances the fusion of the two features.As a feature extractor, the convolution layer can convert the input image into multi-scale features, which makes the features more abstract.However, the purpose of the designed network is to fuse features with the same size as the input features.Therefore, a deconvolution layer is designed after the convolutional layer, which can generate dense and enlarged feature maps.
Assume that the input data of the U-shaped model is x, and the output of the i-th convolutional layer is represented as: where () i   is the activation function of the i-th layer and i w and i b represent the filters and bias vectors of the i-th layer, respectively.According to the description of the proposed U-shaped architecture, the estimation of the network parameters can be obtained by minimizing the loss between the fusion features and the label features, where M is the number of layers of the model.The loss function is expressed by mean square error as follows: -1 () In Figure 2, a, 2a, 4a, and 8a represent the size of the feature map; 1 × 1, 3 × 3, 5 × 5, and 7 × 7 represent the size of the convolution kernel; and 1, 64, and 128 represent the dimensions of the image.The fusion model consists of three parts: a convolution layer, a pooling layer (downsampling layer), and a deconvolution layer.This U-shaped structure, using the redefined input data and label data, enhances the fusion of the two features.As a feature extractor, the convolution layer can convert the input image into multi-scale features, which makes the features more abstract.However, the purpose of the designed network is to fuse features with the same size as the input features.Therefore, a deconvolution layer is designed after the convolutional layer, which can generate dense and enlarged feature maps.
Assume that the input data of the U-shaped model is x, and the output of the i-th convolutional layer is represented as: where θ i (•) is the activation function of the i-th layer and w i and b i represent the filters and bias vectors of the i-th layer, respectively.According to the description of the proposed Ushaped architecture, the estimation of the network parameters Θ = {w i , b i |i ∈ (1, 2, . . ., M)} can be obtained by minimizing the loss between the fusion features and the label features, where M is the number of layers of the model.The loss function is expressed by mean square error as follows: where F(x h , Θ) represents the output of the network, x GF h represents the label feature map, and H represents the number of pixels on each feature map.The authors of this paper made use of multiple feature maps for feature fusion, so after each training gets a fused feature, the network parameters are initialized and the next feature map is retrained until all feature maps have been trained.All the obtained fusion feature maps are stacked in the spectral dimension to obtain the adaptive fusion features with the same dimensions and sizes as the original two features.These features include the edge and principal component features of the HSI, and the fusion features obtained by the adaptive method are more beneficial to the HSI classification.

Multi-Scale and Multi-Level Feature Extraction
The method shown in Figure 2 merely fuses the two kinds of features adaptively.To achieve better classification results, we needed to design a deep-level feature extraction and classification model.In this paper, an MMFE model was designed for classification.With the increasing of the number of convolutional layers, the spatial size of the feature map decreases sharply, leading to some information loss of the image.In a traditional CNN architecture, the fully connected layer is usually directly connected to the output of the last convolutional layer.In this case, the network pays more attention to the deep features and ignores the shallow features.The authors of this paper propose combining shallow convolution features with deep ones in classification.A diagram of the MMFE is given in Figure 3.
Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 20 where ( , ) h Fx  represents the output of the network, GF h x represents the label feature map, and H represents the number of pixels on each feature map.The authors of this paper made use of multiple feature maps for feature fusion, so after each training gets a fused feature, the network parameters are initialized and the next feature map is retrained until all feature maps have been trained.All the obtained fusion feature maps are stacked in the spectral dimension to obtain the adaptive fusion features with the same dimensions and sizes as the original two features.These features include the edge and principal component features of the HSI, and the fusion features obtained by the adaptive method are more beneficial to the HSI classification.

Multi-Scale and Multi-Level Feature Extraction
The method shown in Figure 2 merely fuses the two kinds of features adaptively.To achieve better classification results, we needed to design a deep-level feature extraction and classification model.In this paper, an MMFE model was designed for classification.With the increasing of the number of convolutional layers, the spatial size of the feature map decreases sharply, leading to some information loss of the image.In a traditional CNN architecture, the fully connected layer is usually directly connected to the output of the last convolutional layer.In this case, the network pays more attention to the deep features and ignores the shallow features.The authors of this paper propose combining shallow convolution features with deep ones in classification.A diagram of the MMFE is given in Figure 3.To make full use of the features of different levels, the proposed network adds a 2D convolution layer (conv2d) after each pooling layer.The first purpose of this is to extract multi-level features by adding convolution layers at different levels.The second is that the size of the feature map can be changed by using a convolution layer so that the feature maps of different levels have the same size after passing through the convolution layer.Let denote the i-th feature map obtained by introducing the convolutional layer after the pooling layer, where f is the activation function, i x is the feature map after the pooling layer, and i w and i b are the corresponding weight matrices and bias terms, respectively.The multi-level feature map output by the convolution is input to the ADD layer to perform addition C is then input to the global average pooling (GAP) to stretch it into a one-dimensional tensor, where the To make full use of the features of different levels, the proposed network adds a 2D convolution layer (conv2d) after each pooling layer.The first purpose of this is to extract multi-level features by adding convolution layers at different levels.The second is that the size of the feature map can be changed by using a convolution layer so that the feature maps of different levels have the same size after passing through the convolution layer.Let C i = f (w i x i + b i ) denote the i-th feature map obtained by introducing the convolutional layer after the pooling layer, where f is the activation function, x i is the feature map after the pooling layer, and w i and b i are the corresponding weight matrices and bias terms, respectively.The multi-level feature map output by the convolution is input to the ADD layer to perform addition C 5 = ∑ 4 i=1 C i , and the combined C 5 is then input to the global average pooling (GAP) to stretch it into a one-dimensional tensor, where the GAP can extract the main information of the feature map and can reduce parameters at the same time.The output of GAP finally passes through the fully connected layer (FC).This model extracts the deep features of the new features after fusion, and the obtained features also have multiple levels.

Spectral Feature Extraction
The authors of this paper used the LSTM model to extract the spectral features of the original HSI.The core module in the LSTM model is the storage unit, as shown in Figure 4.
Remote Sens. 2021, 13, x FOR PEER REVIEW 8 of 20 GAP can extract the main information of the feature map and can reduce parameters at the same time.The output of GAP finally passes through the fully connected layer (FC).This model extracts the deep features of the new features after fusion, and the obtained features also have multiple levels.

Spectral Feature Extraction
The authors of this paper used the LSTM model to extract the spectral features of the original HSI.The core module in the LSTM model is the storage unit, as shown in Figure 4.The storage unit consists of four elements, i.e., the input gate ( ) ), the output gate ( and the cell state ( 11 tanh( ) ).The input gate determines how much new information is added to the cell state ( 11 tanh( ) ).The output of the output gate is based on the cell state, but it is also a filtered version.The forget gate determines what information is discarded from the cell sate.The output of LSTM is tanh( ) In the above formulas, , , , , , , , is the weight matrix and , , , is the sigmoid function; and  is the dot product.
x is each band of the HSI input into the LSTM model, and h is the corresponding one-dimensional re- sultant vector.All the bands of pixels on the HSI are sequentially input into the LSTM model, and then a one-dimensional vector is output.Thus, the one-dimensional vector is a feature vector with the spectral characteristics of the HSI.

SSDF Model
Sections 2.1-2.4 introduced the adaptive feature fusion of edge features and principal component features, as well as the MMFE model for classification.These methods focus on processing the spatial context of pixels without considering the correlation between the pixels and different bands.Though deep feature extraction and fusion are performed on HSIs, most of the band information is ignored during feature preprocessing, and spectral features are not fully utilized.Therefore, we further introduced the LSTM model (Section 2.5) to extract the spectral band information of the HSI, which is combined with the spatial features obtained by the MMFE model to perform spectral-spatial classification and form a complete SSDF method.As shown in Figure 5, SSDF designs a deep adaptive fusion model to fully fuse the principal component features extracted by PCA and the edge features extracted by guided filtering, and the fused features are further extracted by MMFE and then combined with the spectral features extracted by LSTM for the final The storage unit consists of four elements, i.e., the input gate . The output of the output gate is based on the cell state, but it is also a filtered version.The forget gate determines what information is discarded from the cell sate.The output of LSTM is is the weight matrix and b i , b f , b o , b c is the bias vector; tan h is the hyperbolic tangent; σ(x) = 1/(1 + exp(−x)) is the sigmoid function; and ⊗ is the dot product.x is each band of the HSI input into the LSTM model, and h is the corresponding one-dimensional resultant vector.All the bands of pixels on the HSI are sequentially input into the LSTM model, and then a one-dimensional vector is output.Thus, the one-dimensional vector is a feature vector with the spectral characteristics of the HSI.

SSDF Model
Sections 2.1-2.4 introduced the adaptive feature fusion of edge features and principal component features, as well as the MMFE model for classification.These methods focus on processing the spatial context of pixels without considering the correlation between the pixels and different bands.Though deep feature extraction and fusion are performed on HSIs, most of the band information is ignored during feature preprocessing, and spectral features are not fully utilized.Therefore, we further introduced the LSTM model (Section 2.5) to extract the spectral band information of the HSI, which is combined with the spatial features obtained by the MMFE model to perform spectral-spatial classification and form a complete SSDF method.As shown in Figure 5, SSDF designs a deep adaptive fusion model to fully fuse the principal component features extracted by PCA and the edge features extracted by guided filtering, and the fused features are further extracted by MMFE and then combined with the spectral features extracted by LSTM for the final classification, which improves the classification accuracy.The Pavia University hyperspectral dataset [31] was input into the network of the proposed SSDF as an example, which is shown in Figure 5. classification, which improves the classification accuracy.The Pavia University hyperspectral dataset [31] was input into the network of the proposed SSDF as an example, which is shown in Figure 5.The network in the upper half of Figure 5 consists of two parts: one is the feature fusion network and the other is the MMFE model.The two networks are independently trained without affecting each other.In the feature fusion network, the first c principal components of the HSI are used as the input of the fusion network, and the hyperspectral edge feature map obtained by the guided filtering is used as the labels of the fusion network.During network training, only one principal component feature map is input at a time, corresponding to the label that is also an edge feature map.The feature map output by the last layer of the network after training is a fusion feature map that combines two hyperspectral features.A total of c feature maps need to be fused, so each time a feature map is re-input, the network parameters are initialized to ensure training consistency.After all the c feature maps are trained, the obtained c feature maps are stacked to get the final fusion feature.Then, the obtained fused features are input into the MMFE network to further extract multi-scale and multi-level features.The MMFE network outputs a onedimensional spatial feature vector ˆa y by using convolution layers and pooling opera- tions alternately, followed by a fully connected layer.
The input of the lower half of Figure 5 is the original HSI.For this example, all the bands of each pixel of the HSI were input into the LSTM model to obtain the one-dimensional feature vector of the pixel.This feature vector was then input into a fully connected layer to further extract integrated features to obtain a spectral feature vector ˆb y .In the SSDF model, the spatial feature vector ˆa y , spectral feature vector ˆb y , and the classifier training are integrated into a unified network.To complete the unified spectralspatial classification using the feature stacking method, the feature vector ˆa y obtained in The network in the upper half of Figure 5 consists of two parts: one is the feature fusion network and the other is the MMFE model.The two networks are independently trained without affecting each other.In the feature fusion network, the first c principal components of the HSI are used as the input of the fusion network, and the hyperspectral edge feature map obtained by the guided filtering is used as the labels of the fusion network.During network training, only one principal component feature map is input at a time, corresponding to the label that is also an edge feature map.The feature map output by the last layer of the network after training is a fusion feature map that combines two hyperspectral features.A total of c feature maps need to be fused, so each time a feature map is re-input, the network parameters are initialized to ensure training consistency.After all the c feature maps are trained, the obtained c feature maps are stacked to get the final fusion feature.Then, the obtained fused features are input into the MMFE network to further extract multi-scale and multi-level features.The MMFE network outputs a onedimensional spatial feature vector ŷa by using convolution layers and pooling operations alternately, followed by a fully connected layer.
The input of the lower half of Figure 5 is the original HSI.For this example, all the bands of each pixel of the HSI were input into the LSTM model to obtain the onedimensional feature vector of the pixel.This feature vector was then input into a fully connected layer to further extract integrated features to obtain a spectral feature vector ŷb .
In the SSDF model, the spatial feature vector ŷa , spectral feature vector ŷb , and the classifier training are integrated into a unified network.To complete the unified spectralspatial classification using the feature stacking method, the feature vector ŷa obtained in the MMFE is connected to the feature vector ŷb obtained in the LSTM to form a new feature vector ŷ, which then passes through a fully connected layer and a SoftMax layer.The loss function of SSDF is defined as Equation (12).
where y i represents the label feature map, ŷi represents the corresponding predicted label of the i-th training sample, and m represents the size of the training set.As the classification network is trained, all parameters are simultaneously optimized by a small batch random gradient descent algorithm.Finally, the SoftMax layer generates the prediction vector set ŷ = { ŷ1 , ŷ2 , . . . ,ŷN }.

Experimental Datasets
We used three real hyperspectral datasets [31,32] to test the performance of SSDF, including the Indian Pines, Pavia University, and Salinas scene datasets.
(1) Indian Pines dataset: This was captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) from the remote sensing test area in the northwest area of the Indian state.This dataset contained 220 bands, and 20 noise bands were removed before the experiments.Each band was of size 145 × 145 and had a wavelength ranging from 0.4 to 2.5 µm.It contained 16 ground-truth classes and a total of 10,249 samples with a spatial resolution of 20 m per pixel.Figure 6a,b shows its false-color image and the corresponding ground-truth map.
Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 20 the MMFE is connected to the feature vector ˆb y obtained in the LSTM to form a new feature vector ŷ , which then passes through a fully connected layer and a SoftMax layer.
The loss function of SSDF is defined as Equation ( 12).
where i y represents the label feature map, ˆi y represents the corresponding predicted label of the i-th training sample, and m represents the size of the training set.As the classification network is trained, all parameters are simultaneously optimized by a small batch random gradient descent algorithm.Finally, the SoftMax layer generates the prediction vector set

Experimental Datasets
We used three real hyperspectral datasets [31,32] to test the performance of SSDF, including the Indian Pines, Pavia University, and Salinas scene datasets.
( (2) Pavia University dataset: This was captured by the Reflective Optics System Imaging Spectrometer (ROSIS) from Pavia University in northeastern Italy.This dataset contained 115 bands, and 12 noise bands were removed before the experiments.Each band was of size 610 × 340 and had a wavelength ranging from 0.43 to 0.86 μm.It contained nine ground-truth classes and a total of 42,776 samples with a spatial resolution of 1.3 meters per pixel.Figure 7a,b shows its false-color image and the corresponding groundtruth map.
(3) Salinas scene dataset: This was collected by AVIRIS sensor over Salinas Valley, California.This dataset contained 224 bands, and 20 water absorption bands were removed before the experiments.Each band was of size 512 × 217.It contained 16 groundtruth classes and a total of 54,129 samples with a spatial resolution of 3.7 meters per pixel.Figure 8a,b shows its false-color image and the corresponding ground-truth map.(3) Salinas scene dataset: This was collected by AVIRIS sensor over Salinas Valley, California.This dataset contained 224 bands, and 20 water absorption bands were removed before the experiments.Each band was of size 512 × 217.It contained 16 ground-truth classes and a total of 54,129 samples with a spatial resolution of 3.7 m per pixel.Figure 8a,b shows its false-color image and the corresponding ground-truth map.Table 1 introduces the meaning of each category in three datasets and the number of samples they contained.Asphalt 6631 (a) (b)                Corn 237 (a) (b)  Trees 3064 (a) (b)          Grass-T 730 (a) (b)

Experimental Setup
We divided all samples of each dataset into training, validation, and test sets.Then we used the training set to update the parameters, used the validation set to monitor th generation of the network temporary model, and kept the model with the highest valida tion rate.Finally, we used the test set to test the classification performance of the reserve model.For Indian Pines dataset, 10%, 10%, and 80% samples were randomly selected from each class as the training, validation, and test sets, respectively.For the Pavia Universit and Salinas scene datasets, 5%, 5%, and 90% samples were randomly selected from eac class as the training, validation, and test sets, respectively.
Overall accuracy (OA), average accuracy (AA), and the kappa coefficient wer adopted as the evaluation indicators [33] to evaluate the classification performance of eac method.In our experiments, after appropriate experimental adjustments, the trainin epoch of the fusion network was set to 2000 times, and the learning rate was set to 0.003 The training epoch of the MMFE and LSTM network was set to 400 times, and the batc size was set to 128.The pooling layer used the maximum pooling operation.For this pa per, the activation function at the output layer used the SoftMax activation function, an the activation functions in other locations all used the Rectified Linear Units (ReLU) act vation function.For the Indian Pines, Pavia University, and Salinas scene datasets, th numbers of guided filtering input images were 10, 5, and 5, respectively, and the guide filtering radii of the three datasets were 2, 4, and 6, respectively.Thus for the three da tasets, we got fused feature dimensions of 30, 15, and 15, respectively.To avoid bias est mates, we ran the experiments five times and provided the final results by calculating th average of five values.All experiments were performed on the NVIDIA 1080Ti graphic card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we compared the experimental results o fusion features with those of single kind of features on the three datasets.As shown i Table 2, the principal component features (PFs), edge features (EFs), and fusion feature (FFs) proposed in this paper were input into the MMFE, and they were then combine with the spectral characteristics obtained by the LSTM model for classification.The effect of different features were compared in terms of OA, AA, and kappa coefficient.

Experimental Setup
We divided all samples of each dataset into training, validation, and test sets.Then we used the training set to update the parameters, used the validation set to monitor th generation of the network temporary model, and kept the model with the highest valida tion rate.Finally, we used the test set to test the classification performance of the reserve model.For Indian Pines dataset, 10%, 10%, and 80% samples were randomly selected from each class as the training, validation, and test sets, respectively.For the Pavia Universit and Salinas scene datasets, 5%, 5%, and 90% samples were randomly selected from eac class as the training, validation, and test sets, respectively.
Overall accuracy (OA), average accuracy (AA), and the kappa coefficient wer adopted as the evaluation indicators [33] to evaluate the classification performance of eac method.In our experiments, after appropriate experimental adjustments, the trainin epoch of the fusion network was set to 2000 times, and the learning rate was set to 0.003 The training epoch of the MMFE and LSTM network was set to 400 times, and the batc size was set to 128.The pooling layer used the maximum pooling operation.For this pa per, the activation function at the output layer used the SoftMax activation function, an the activation functions in other locations all used the Rectified Linear Units (ReLU) act vation function.For the Indian Pines, Pavia University, and Salinas scene datasets, th numbers of guided filtering input images were 10, 5, and 5, respectively, and the guide filtering radii of the three datasets were 2, 4, and 6, respectively.Thus for the three da tasets, we got fused feature dimensions of 30, 15, and 15, respectively.To avoid bias est mates, we ran the experiments five times and provided the final results by calculating th average of five values.All experiments were performed on the NVIDIA 1080Ti graphic card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we compared the experimental results o fusion features with those of single kind of features on the three datasets.As shown i Table 2, the principal component features (PFs), edge features (EFs), and fusion feature (FFs) proposed in this paper were input into the MMFE, and they were then combine with the spectral characteristics obtained by the LSTM model for classification.The effect of different features were compared in terms of OA, AA, and kappa coefficient.

Experimental Setup
We divided all samples of each dataset into training, validation, and test sets.Then we used the training set to update the parameters, used the validation set to monitor th generation of the network temporary model, and kept the model with the highest valida tion rate.Finally, we used the test set to test the classification performance of the reserve model.For Indian Pines dataset, 10%, 10%, and 80% samples were randomly selected from each class as the training, validation, and test sets, respectively.For the Pavia Universit and Salinas scene datasets, 5%, 5%, and 90% samples were randomly selected from eac class as the training, validation, and test sets, respectively.
Overall accuracy (OA), average accuracy (AA), and the kappa coefficient wer adopted as the evaluation indicators [33] to evaluate the classification performance of eac method.In our experiments, after appropriate experimental adjustments, the trainin epoch of the fusion network was set to 2000 times, and the learning rate was set to 0.003 The training epoch of the MMFE and LSTM network was set to 400 times, and the batc size was set to 128.The pooling layer used the maximum pooling operation.For this pa per, the activation function at the output layer used the SoftMax activation function, an the activation functions in other locations all used the Rectified Linear Units (ReLU) act vation function.For the Indian Pines, Pavia University, and Salinas scene datasets, th numbers of guided filtering input images were 10, 5, and 5, respectively, and the guide filtering radii of the three datasets were 2, 4, and 6, respectively.Thus for the three da tasets, we got fused feature dimensions of 30, 15, and 15, respectively.To avoid bias est mates, we ran the experiments five times and provided the final results by calculating th average of five values.All experiments were performed on the NVIDIA 1080Ti graphic card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we compared the experimental results o fusion features with those of single kind of features on the three datasets.As shown i Table 2, the principal component features (PFs), edge features (EFs), and fusion feature (FFs) proposed in this paper were input into the MMFE, and they were then combine with the spectral characteristics obtained by the LSTM model for classification.The effect of different features were compared in terms of OA, AA, and kappa coefficient.

Experimental Setup
We divided all samples of each dataset into training, validation, and test sets.Then, we used the training set to update the parameters, used the validation set to monitor the generation of the network temporary model, and kept the model with the highest validation rate.Finally, we used the test set to test the classification performance of the reserved model.For Indian Pines dataset, 10%, 10%, and 80% samples were randomly selected from each class as the training, validation, and test sets, respectively.For the Pavia University and Salinas scene datasets, 5%, 5%, and 90% samples were randomly selected from each class as the training, validation, and test sets, respectively.
Overall accuracy (OA), average accuracy (AA), and the kappa coefficient were adopted as the evaluation indicators [33] to evaluate the classification performance of each method.In our experiments, after appropriate experimental adjustments, the training epoch of the fusion network was set to 2000 times, and the learning rate was set to 0.003.The training epoch of the MMFE and LSTM network was set to 400 times, and the batch size was set to 128.The pooling layer used the maximum pooling operation.For this paper, the activation function at the output layer used the SoftMax activation function, and the activation functions in other locations all used the Rectified Linear Units (ReLU) activation function.For the Indian Pines, Pavia University, and Salinas scene datasets, the numbers of guided filtering input images were 10, 5, and 5, respectively, and the guided filtering radii of the three datasets were 2, 4, and 6, respectively.Thus for the three datasets, we got fused feature dimensions of 30, 15, and 15, respectively.To avoid bias estimates, we ran the experiments five times and provided the final results by calculating the average of five values.All experiments were performed on the NVIDIA 1080Ti graphics card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we compared the experimental results of fusion features with those of single kind of features on the three datasets.As shown in Table 2, the principal component features (PFs), edge features (EFs), and fusion features (FFs) proposed in this paper were input into the MMFE, and they were then combined with the spectral characteristics obtained by the LSTM model for classification.The effects of different features were compared in terms of OA, AA, and kappa coefficient.

Experimental Setup
We divided all samples of each dataset into tra we used the training set to update the parameters, u generation of the network temporary model, and ke tion rate.Finally, we used the test set to test the class model.For Indian Pines dataset, 10%, 10%, and 80% s each class as the training, validation, and test sets, r and Salinas scene datasets, 5%, 5%, and 90% sample class as the training, validation, and test sets, respect Overall accuracy (OA), average accuracy (AA adopted as the evaluation indicators [33] to evaluate method.In our experiments, after appropriate exp epoch of the fusion network was set to 2000 times, a The training epoch of the MMFE and LSTM networ size was set to 128.The pooling layer used the maxi per, the activation function at the output layer used the activation functions in other locations all used th vation function.For the Indian Pines, Pavia Univer numbers of guided filtering input images were 10, 5 filtering radii of the three datasets were 2, 4, and 6 tasets, we got fused feature dimensions of 30, 15, an mates, we ran the experiments five times and provid average of five values.All experiments were perform card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we co fusion features with those of single kind of features Table 2, the principal component features (PFs), edg (FFs) proposed in this paper were input into the M with the spectral characteristics obtained by the LSTM of different features were compared in terms of OA,

Experimental Setup
We divided all samples of each dataset into training, validation, and test sets.Then, we used the training set to update the parameters, used the validation set to monitor the generation of the network temporary model, and kept the model with the highest validation rate.Finally, we used the test set to test the classification performance of the reserved model.For Indian Pines dataset, 10%, 10%, and 80% samples were randomly selected from each class as the training, validation, and test sets, respectively.For the Pavia University and Salinas scene datasets, 5%, 5%, and 90% samples were randomly selected from each class as the training, validation, and test sets, respectively.
Overall accuracy (OA), average accuracy (AA), and the kappa coefficient were adopted as the evaluation indicators [33] to evaluate the classification performance of each method.In our experiments, after appropriate experimental adjustments, the training epoch of the fusion network was set to 2000 times, and the learning rate was set to 0.003.The training epoch of the MMFE and LSTM network was set to 400 times, and the batch size was set to 128.The pooling layer used the maximum pooling operation.For this paper, the activation function at the output layer used the SoftMax activation function, and the activation functions in other locations all used the Rectified Linear Units (ReLU) activation function.For the Indian Pines, Pavia University, and Salinas scene datasets, the numbers of guided filtering input images were 10, 5, and 5, respectively, and the guided filtering radii of the three datasets were 2, 4, and 6, respectively.Thus for the three datasets, we got fused feature dimensions of 30, 15, and 15, respectively.To avoid bias estimates, we ran the experiments five times and provided the final results by calculating the average of five values.All experiments were performed on the NVIDIA 1080Ti graphics card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we compared the experimental results of fusion features with those of single kind of features on the three datasets.As shown in Table 2, the principal component features (PFs), edge features (EFs), and fusion features (FFs) proposed in this paper were input into the MMFE, and they were then combined with the spectral characteristics obtained by the LSTM model for classification.The effects

Experimental Setup
We divided all samples of each dataset into tra we used the training set to update the parameters, u generation of the network temporary model, and ke tion rate.Finally, we used the test set to test the class model.For Indian Pines dataset, 10%, 10%, and 80% s each class as the training, validation, and test sets, r and Salinas scene datasets, 5%, 5%, and 90% sample class as the training, validation, and test sets, respect Overall accuracy (OA), average accuracy (AA adopted as the evaluation indicators [33] to evaluate method.In our experiments, after appropriate exp epoch of the fusion network was set to 2000 times, a The training epoch of the MMFE and LSTM networ size was set to 128.The pooling layer used the maxi per, the activation function at the output layer used the activation functions in other locations all used th vation function.For the Indian Pines, Pavia Univer numbers of guided filtering input images were 10, 5 filtering radii of the three datasets were 2, 4, and 6 tasets, we got fused feature dimensions of 30, 15, an mates, we ran the experiments five times and provid average of five values.All experiments were perform card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we co fusion features with those of single kind of features Table 2, the principal component features (PFs), edg (FFs) proposed in this paper were input into the M with the spectral characteristics obtained by the LSTM

Experimental Setup
We divided all samples of each dataset into training, validation, and test sets.Then, we used the training set to update the parameters, used the validation set to monitor the generation of the network temporary model, and kept the model with the highest validation rate.Finally, we used the test set to test the classification performance of the reserved model.For Indian Pines dataset, 10%, 10%, and 80% samples were randomly selected from each class as the training, validation, and test sets, respectively.For the Pavia University and Salinas scene datasets, 5%, 5%, and 90% samples were randomly selected from each class as the training, validation, and test sets, respectively.
Overall accuracy (OA), average accuracy (AA), and the kappa coefficient were adopted as the evaluation indicators [33] to evaluate the classification performance of each method.In our experiments, after appropriate experimental adjustments, the training epoch of the fusion network was set to 2000 times, and the learning rate was set to 0.003.The training epoch of the MMFE and LSTM network was set to 400 times, and the batch size was set to 128.The pooling layer used the maximum pooling operation.For this paper, the activation function at the output layer used the SoftMax activation function, and the activation functions in other locations all used the Rectified Linear Units (ReLU) activation function.For the Indian Pines, Pavia University, and Salinas scene datasets, the numbers of guided filtering input images were 10, 5, and 5, respectively, and the guided filtering radii of the three datasets were 2, 4, and 6, respectively.Thus for the three datasets, we got fused feature dimensions of 30, 15, and 15, respectively.To avoid bias estimates, we ran the experiments five times and provided the final results by calculating the average of five values.All experiments were performed on the NVIDIA 1080Ti graphics card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we compared the experimental results of fusion features with those of single kind of features on the three datasets.As shown in Table 2, the principal component features (PFs), edge features (EFs), and fusion features (FFs) proposed in this paper were input into the MMFE, and they were then combined with the spectral characteristics obtained by the LSTM model for classification.The effects

Experimental Setup
We divided all samples of each dataset into tra we used the training set to update the parameters, u generation of the network temporary model, and ke tion rate.Finally, we used the test set to test the class model.For Indian Pines dataset, 10%, 10%, and 80% s each class as the training, validation, and test sets, r and Salinas scene datasets, 5%, 5%, and 90% sample class as the training, validation, and test sets, respec Overall accuracy (OA), average accuracy (AA adopted as the evaluation indicators [33] to evaluate method.In our experiments, after appropriate exp epoch of the fusion network was set to 2000 times, a The training epoch of the MMFE and LSTM networ size was set to 128.The pooling layer used the maxi per, the activation function at the output layer used the activation functions in other locations all used th vation function.For the Indian Pines, Pavia Univer numbers of guided filtering input images were 10, 5 filtering radii of the three datasets were 2, 4, and 6 tasets, we got fused feature dimensions of 30, 15, an mates, we ran the experiments five times and provid average of five values.All experiments were perfor card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we co fusion features with those of single kind of features Table 2, the principal component features (PFs), edg (FFs) proposed in this paper were input into the M with the spectral characteristics obtained by the LSTM

Experimental Setup
We divided all samples of each dataset into training, validation, and test sets.Then, we used the training set to update the parameters, used the validation set to monitor the generation of the network temporary model, and kept the model with the highest validation rate.Finally, we used the test set to test the classification performance of the reserved model.For Indian Pines dataset, 10%, 10%, and 80% samples were randomly selected from each class as the training, validation, and test sets, respectively.For the Pavia University and Salinas scene datasets, 5%, 5%, and 90% samples were randomly selected from each class as the training, validation, and test sets, respectively.
Overall accuracy (OA), average accuracy (AA), and the kappa coefficient were adopted as the evaluation indicators [33] to evaluate the classification performance of each method.In our experiments, after appropriate experimental adjustments, the training epoch of the fusion network was set to 2000 times, and the learning rate was set to 0.003.The training epoch of the MMFE and LSTM network was set to 400 times, and the batch size was set to 128.The pooling layer used the maximum pooling operation.For this paper, the activation function at the output layer used the SoftMax activation function, and the activation functions in other locations all used the Rectified Linear Units (ReLU) activation function.For the Indian Pines, Pavia University, and Salinas scene datasets, the numbers of guided filtering input images were 10, 5, and 5, respectively, and the guided filtering radii of the three datasets were 2, 4, and 6, respectively.Thus for the three datasets, we got fused feature dimensions of 30, 15, and 15, respectively.To avoid bias estimates, we ran the experiments five times and provided the final results by calculating the average of five values.All experiments were performed on the NVIDIA 1080Ti graphics card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we compared the experimental results of fusion features with those of single kind of features on the three datasets.As shown in Table 2, the principal component features (PFs), edge features (EFs), and fusion features (FFs) proposed in this paper were input into the MMFE, and they were then combined with the spectral characteristics obtained by the LSTM model for classification.The effects

Experimental Setup
We divided all samples of each dataset into tra we used the training set to update the parameters, u generation of the network temporary model, and ke tion rate.Finally, we used the test set to test the class model.For Indian Pines dataset, 10%, 10%, and 80% s each class as the training, validation, and test sets, r and Salinas scene datasets, 5%, 5%, and 90% sample class as the training, validation, and test sets, respec Overall accuracy (OA), average accuracy (AA adopted as the evaluation indicators [33] to evaluate method.In our experiments, after appropriate exp epoch of the fusion network was set to 2000 times, a The training epoch of the MMFE and LSTM networ size was set to 128.The pooling layer used the maxi per, the activation function at the output layer used the activation functions in other locations all used th vation function.For the Indian Pines, Pavia Univer numbers of guided filtering input images were 10, 5 filtering radii of the three datasets were 2, 4, and 6 tasets, we got fused feature dimensions of 30, 15, an mates, we ran the experiments five times and provid average of five values.All experiments were perfor card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we co fusion features with those of single kind of features Table 2, the principal component features (PFs), edg (FFs) proposed in this paper were input into the M with the spectral characteristics obtained by the LSTM

Experimental Setup
We divided all samples of each dataset into training, validation, and test sets.Then, we used the training set to update the parameters, used the validation set to monitor the generation of the network temporary model, and kept the model with the highest validation rate.Finally, we used the test set to test the classification performance of the reserved model.For Indian Pines dataset, 10%, 10%, and 80% samples were randomly selected from each class as the training, validation, and test sets, respectively.For the Pavia University and Salinas scene datasets, 5%, 5%, and 90% samples were randomly selected from each class as the training, validation, and test sets, respectively.
Overall accuracy (OA), average accuracy (AA), and the kappa coefficient were adopted as the evaluation indicators [33] to evaluate the classification performance of each method.In our experiments, after appropriate experimental adjustments, the training epoch of the fusion network was set to 2000 times, and the learning rate was set to 0.003.The training epoch of the MMFE and LSTM network was set to 400 times, and the batch size was set to 128.The pooling layer used the maximum pooling operation.For this paper, the activation function at the output layer used the SoftMax activation function, and the activation functions in other locations all used the Rectified Linear Units (ReLU) activation function.For the Indian Pines, Pavia University, and Salinas scene datasets, the numbers of guided filtering input images were 10, 5, and 5, respectively, and the guided filtering radii of the three datasets were 2, 4, and 6, respectively.Thus for the three datasets, we got fused feature dimensions of 30, 15, and 15, respectively.To avoid bias estimates, we ran the experiments five times and provided the final results by calculating the average of five values.All experiments were performed on the NVIDIA 1080Ti graphics card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we compared the experimental results of fusion features with those of single kind of features on the three datasets.As shown in Table 2, the principal component features (PFs), edge features (EFs), and fusion features (FFs) proposed in this paper were input into the MMFE, and they were then combined

Experimental Setup
We divided all samples of each dataset into tra we used the training set to update the parameters, u generation of the network temporary model, and ke tion rate.Finally, we used the test set to test the class model.For Indian Pines dataset, 10%, 10%, and 80% s each class as the training, validation, and test sets, r and Salinas scene datasets, 5%, 5%, and 90% sample class as the training, validation, and test sets, respect Overall accuracy (OA), average accuracy (AA adopted as the evaluation indicators [33] to evaluate method.In our experiments, after appropriate exp epoch of the fusion network was set to 2000 times, a The training epoch of the MMFE and LSTM networ size was set to 128.The pooling layer used the maxi per, the activation function at the output layer used the activation functions in other locations all used th vation function.For the Indian Pines, Pavia Univer numbers of guided filtering input images were 10, 5 filtering radii of the three datasets were 2, 4, and 6 tasets, we got fused feature dimensions of 30, 15, an mates, we ran the experiments five times and provid average of five values.All experiments were perform card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we co fusion features with those of single kind of features Table 2, the principal component features (PFs), edg (FFs) proposed in this paper were input into the M Vinyard_V_T 1807

Experimental Setup
We divided all samples of each dataset into training, validation, and test sets.Then, we used the training set to update the parameters, used the validation set to monitor the generation of the network temporary model, and kept the model with the highest validation rate.Finally, we used the test set to test the classification performance of the reserved model.For Indian Pines dataset, 10%, 10%, and 80% samples were randomly selected from each class as the training, validation, and test sets, respectively.For the Pavia University and Salinas scene datasets, 5%, 5%, and 90% samples were randomly selected from each class as the training, validation, and test sets, respectively.
Overall accuracy (OA), average accuracy (AA), and the kappa coefficient were adopted as the evaluation indicators [33] to evaluate the classification performance of each method.In our experiments, after appropriate experimental adjustments, the training epoch of the fusion network was set to 2000 times, and the learning rate was set to 0.003.The training epoch of the MMFE and LSTM network was set to 400 times, and the batch size was set to 128.The pooling layer used the maximum pooling operation.For this paper, the activation function at the output layer used the SoftMax activation function, and the activation functions in other locations all used the Rectified Linear Units (ReLU) activation function.For the Indian Pines, Pavia University, and Salinas scene datasets, the numbers of guided filtering input images were 10, 5, and 5, respectively, and the guided filtering radii of the three datasets were 2, 4, and 6, respectively.Thus for the three datasets, we got fused feature dimensions of 30, 15, and 15, respectively.To avoid bias estimates, we ran the experiments five times and provided the final results by calculating the average of five values.All experiments were performed on the NVIDIA 1080Ti graphics card using Python.

The Effectiveness of Fusion Features
To verify the validity of fusion features, we compared the experimental results of fusion features with those of single kind of features on the three datasets.As shown in Table 2, the principal component features (PFs), edge features (EFs), and fusion features (FFs) proposed in this paper were input into the MMFE, and they were then combined with the spectral characteristics obtained by the LSTM model for classification.The effects of different features were compared in terms of OA, AA, and kappa coefficient.It can be seen from Table 2 that the results of fusion feature were better than those of the single kind of features on the three evaluation indicators.For example, in the Indian Pines dataset, if only the hyperspectral principal component features or edge features were used for classification, the AA was only 95.32% or 97.19%, respectively.After fusing two kinds of features, the AA of the fusion features reached 99.08%, and the accuracy was improved by about 4% or 2%, respectively, which showed that the fused features performed well in the classification tasks.For the Pavia University and Salinas scene datasets, the results of FFs were also better than those of PFs and EFs.The results shown in Table 2 demonstrate that the idea of feature fusion is effective, with focus on the complementarity between different features.Thus, the multi-feature fusion had a richer correlation between spectral information and spatial information, thus improving the classification performance of the network.

The Effectiveness of Introducing Spectral Features by LSTM
To verify the necessity and effectiveness of introducing spectral features by LSTM, three classification methods were compared: the method that directly inputs HSIs into LSTM (LSTM), the method that only uses MMFE classification after the U-shaped network that generates fusion features (U-shaped and MMFE), and the SSDF method proposed in this paper.The results are given in Table 3.It can be seen in Table 3 that the SSDF method had the best overall classification accuracy on the three datasets.It can be seen from Table 3 that if only LSTM was used for classification, the classification results were not satisfactory.Because the extraction of single spectral features only shallowly uses the information of the HSI, the use of the spatial relationship between pixels and the reasonable judgment of edge pixels was insufficient.The second method was found to significantly improve the classification results by introducing principal component features, edge features, and a multi-scale and multi-level classification structure.However, this method mainly focuses on processing the spatial context of pixels, ignoring the correlation between the pixels and different bands.In contrast, SSDF was found to achieve better results by combing the features of two methods.Experimental results showed that the further introduction of spectral features by LSTM was necessary, and it further improved classification performance.
The SVM, as a classic machine learning method for classification, was used as a baseline for comparison.The PCA method used here referred to using the first 20 principal components and SVM with Radial Basis Function (RBF-SVM) for classification.LSTM [22] used long-term and short-term memory models to extract and utilize spectral features in image bands.3DCNN [21] used a three-dimensional convolution cube to extract spectral and spatial features from the original HSIs.SSRN [24] fused the spectral features obtained by the 3D convolution kernel and the spatial features obtained by the 2D convolution kernel in a tandem manner, allowing the model to obtain the spectral and spatial features.DFFN [26] combined features extracted from residual networks in different levels for classification.MSCNN [27] proposed the use of cross-domain convolutional neural networks for feature extraction and classification.For the sake of fairness, we adjusted the parameters to make these comparison methods achieve their best performances, and we trained these models in the exact same experimental environment.

Results on Indian Pines
Table 4 shows the classification results of eight methods on the Indian Pines dataset.Figure 9 shows the false-color image, the ground truth, and the classification maps of all methods on the Indian Pines dataset.
As can be seen from Table 4, the OA results of SSDF were 18.19% and 22.98% higher than those of SVM and PCA, respectively.The OA values of SSDF were also 17.8%, 8.99%, 1.15%, 0.31%, and 0.86% higher than the LSTM, 3DCNN, SSRN, DFFN, and MSCNN stateof-the-art methods, respectively.It can be observed from Figure 9 that the classification maps of SVM, PCA, LSTM, and 3DCNN had the serious problem of "salt and pepper", whereas the classification maps of DFFN and our SSDF were most similar to the ground truth.Additionally, the SSDF method still performed well when there were few samples in some categories.As shown in Table 4, the classification accuracy of SSDF on the first category (Alfalfa) reached 100%, and it reached 100% on the seventh category (Grass-P-M), which exceeded most other classification algorithms.At the same time, SSDF could also obtain satisfactory results in the 9th and 16th categories.As can be seen from Table 4, the OA results of SSDF were 18.19% and 22.98% higher than those of SVM and PCA, respectively.The OA values of SSDF were also 17.8%, 8.99%, 1.15%, 0.31%, and 0.86% higher than the LSTM, 3DCNN, SSRN, DFFN, and MSCNN stateof-the-art methods, respectively.It can be observed from Figure 9 that the classification maps of SVM, PCA, LSTM, and 3DCNN had the serious problem of "salt and pepper", whereas the classification maps of DFFN and our SSDF were most similar to the ground truth.Additionally, the SSDF method still performed well when there were few samples in some categories.As shown in Table 4, the classification accuracy of SSDF on the first category (Alfalfa) reached 100%, and it reached 100% on the seventh category (Grass-P-M), which exceeded most other classification algorithms.At the same time, SSDF could also obtain satisfactory results in the 9th and 16th categories.

Results on Pavia University
Table 5 provides the classification results of eight methods on the Pavia University dataset.Figure 10 shows the false-color image, the ground truth, and the classification maps of all methods on the Pavia University dataset.

Results on Pavia University
Table 5 provides the classification results of eight methods on the Pavia University dataset.Figure 10 shows the false-color image, the ground truth, and the classification maps of all methods on the Pavia University dataset.
As can be seen from Table 5, the OA values of SSDF were 6.3% and 6.51% higher than those of the traditional SVM and PCA methods, respectively.The OA values of SSDF were also 7.04%, 3.42%, 0.17%, 0.41%, and 0.62% higher than the LSTM, 3DCNN, SSRN, DFFN, and MSCNN state-of-the-art methods, respectively.At the same time, it was seen that SSDF achieved better classification results than other methods in most categories.It can be observed from Figure 10 that the classification maps of SSRN, DFFN, MSCNN, and SSDF were very close to the ground truth.In contrast, other methods had the serious problem of "salt and pepper."As can be seen from Table 5, the OA values of SSDF were 6.3% and 6.51% higher than those of the traditional SVM and PCA methods, respectively.The OA values of SSDF were also 7.04%, 3.42%, 0.17%, 0.41%, and 0.62% higher than the LSTM, 3DCNN, SSRN, DFFN, and MSCNN state-of-the-art methods, respectively.At the same time, it was seen that SSDF achieved better classification results than other methods in most categories.It can be observed from Figure 10 that the classification maps of SSRN, DFFN, MSCNN, and SSDF were very close to the ground truth.In contrast, other methods had the serious problem of "salt and pepper."

Results on Salinas Scene
Table 6 gives the classification results of eight methods on the Salinas scene dataset.Figure 11 shows the false-color image, the ground truth, and the classification maps of all methods.

Results on Salinas Scene
Table 6 gives the classification results of eight methods on the Salinas scene dataset.Figure 11 shows the false-color image, the ground truth, and the classification maps of all methods.It can be seen from Table 6 that the OA values of SSDF were 10.46% and 8.43% higher than those of the traditional SVM and PCA methods, respectively.The OA values of SSDF were also 7.43%, 4.25%, 0.8%, 0.43%, and 0.77% higher than LSTM, 3DCNN, SSRN, DFFN, and MSCNN, respectively.Specifically, it can be seen that for the 8th (Grapes_untrained) and 15th classes (Vinyard_untrained), which were difficult for classification, SSDF achieved the best results.At the same time, according to the variance values of the evaluation indexes in Table 6, it can be seen that the performance of SSDF was more stable.It can be observed from Figure 11 that the classification map of SSDF was much more similar to the ground truth than those of other methods.Especially, the red area in the upper left corner of each image in Figure 11 distinctly demonstrates the superiority of SSDF.

Discussion
To test the generalization ability and robustness of the SSDF, we randomly selected 5%, 10%, 15%, and 20% labeled samples from the Indian Pines dataset and 3%, 4%, 5%, and 6% labeled samples from the Pavia University and Salinas scene datasets as the training data.The curves in Figure 12 show the overall accuracies of the eight methods versus different percentages of training samples.It can be seen from Table 6 that the OA values of SSDF were 10.46% and 8.43% higher than those of the traditional SVM and PCA methods, respectively.The OA values of SSDF were also 7.43%, 4.25%, 0.8%, 0.43%, and 0.77% higher than LSTM, 3DCNN, SSRN, DFFN, and MSCNN, respectively.Specifically, it can be seen that for the 8th (Grapes_untrained) and 15th classes (Vinyard_untrained), which were difficult for classification, SSDF achieved the best results.At the same time, according to the variance values of the evaluation indexes in Table 6, it can be seen that the performance of SSDF was more stable.It can be observed from Figure 11 that the classification map of SSDF was much more similar to the ground truth than those of other methods.Especially, the red area in the upper left corner of each image in Figure 11 distinctly demonstrates the superiority of SSDF.

Discussion
To test the generalization ability and robustness of the SSDF, we randomly selected 5%, 10%, 15%, and 20% labeled samples from the Indian Pines dataset and 3%, 4%, 5%, and 6% labeled samples from the Pavia University and Salinas scene datasets as the training data.The curves in Figure 12 show the overall accuracies of the eight methods versus different percentages of training samples.It can be seen from Figure 12 that when there were fewer training data, SSDF could still achieve a much higher classification accuracy than SVM, PCA, LSTM, and 3DCNN, and it was also superior to other the state-of-the-art SSRN, DFFN and MSCNN methods.
As can be seen from all the above experimental results, the proposed SSDF achieved the best classification performance in most categories, and it also obtained the best classification results on the three evaluation indicators of OA, AA, and Kappa.The reasons for this performance improvement are as follows: (1) The principal component features and edge features were used as the input and label of the U-shaped network, respectively, so that the network adaptively generated new fusion features that could adaptively learn the correlation and complementarity of two different features through network training and provide more sufficient information for classification.
(2) The MMFE model combined low-level features with high-level features, making the model perform better.
Compared with the simple spectral-spatial combination of SSRN, the proposed SSDF introduced the idea of merging multiple features, thus fully merging the rich feature correlation and feature dissimilarity between two different features.At the same time, compared with the MSCNN network, SSDF not only used the U-shaped network to adaptively generate advanced features but also introduced the idea of a multi-scale and multi-level It can be seen from Figure 12 that when there were fewer training data, SSDF could still achieve a much higher classification accuracy than SVM, PCA, LSTM, and 3DCNN, and it was also superior to other the state-of-the-art SSRN, DFFN and MSCNN methods.
As can be seen from all the above experimental results, the proposed SSDF achieved the best classification performance in most categories, and it also obtained the best classification results on the three evaluation indicators of OA, AA, and Kappa.The reasons for this performance improvement are as follows: (1) The principal component features and edge features were used as the input and label of the U-shaped network, respectively, so that the network adaptively generated new fusion features that could adaptively learn the correlation and complementarity of two different features through network training and provide more sufficient information for classification.
(2) The MMFE model combined low-level features with high-level features, making the model perform better.
Compared with the simple spectral-spatial combination of SSRN, the proposed SSDF introduced the idea of merging multiple features, thus fully merging the rich feature correlation and feature dissimilarity between two different features.At the same time, compared with the MSCNN network, SSDF not only used the U-shaped network to adaptively generate advanced features but also introduced the idea of a multi-scale and multi-level classification network, and this structure could more deeply extract the original information of HSIs and further improve the classification effect.

Conclusions
The authors of this paper have proposed a hyperspectral image spectral-spatial classification based on deep adaptive feature fusion (SSDF).Compared with other existing network models, the U-shaped structure in SSDF is composed of special inputs and labels, i.e., the principal component features and edge features are used as the input and label of the U-shaped network, respectively.Corresponding training of inputs and labels through deep networks can effectively extract and fuse two elementary features to generate advanced features.Moreover, compared with a network model with single feature input, SSDF was found to greatly retain the complementarity and rich correlation among the features by making full use of various features.Additionally, the proposed SSDF model contains a multi-scale and multi-level network for extracting deep features that, to some extent, fuses elementary features with advanced features, thus making our method more generalizable.The experimental results showed that the performance of SSDF on the three datasets was better than other existing state-of-the-art methods, and SSDF was always able to obtain good classification results under different training conditions, which further validated that the proposed SSDF has excellent generalization ability and robustness.
Though the idea of multi-feature fusion brings higher classification accuracy, it also increases the computational complexity of a model.In the future, we will try to simplify the proposed model.Furthermore, although the fusion of edge features and principal component features has shown its effectiveness of improving classification accuracy, we will investigate the possibility of working with other types of features for fusion, which may result in better performance if better combination of features is found.
are obtained.It can be seen from Equation (1) that the output image and the guided image have a linear relationship in the window, that isi k i c a I    .Therefore, when the guided image I contains edge information, the output image c retains the edge information at the corresponding position.Therefore, the output image c is a feature map with the edge features of the HSI.

Figure 1 .
Figure 1.Flow chart of guided filtering to extract edge features.PCA: principal component analysis; ICA: independent component analysis; MNF: minimum noise fraction rotation.

Figure 1 .
Figure 1.Flow chart of guided filtering to extract edge features.PCA: principal component analysis; ICA: independent component analysis; MNF: minimum noise fraction rotation.
Remote Sens. 2021, 13, x FOR PEER REVIEW 6 of 20 correlation and difference between model inputs and model labels by training on a network, thereby generating fusion features that contain both edge features and principal component features.Thus, the authors of this paper designed a U-shaped deep fusion network model with the principal component features as the model input and the edge features as the model label.The final output of the model comprises the fusion features that contain edge features and principal component features.The U-shaped network structure is shown in Figure 2.

Figure 12 .
Figure 12.The curves of overall accuracies versus different percentages of training samples obtained by different methods on three datasets: (a) Indian Pines dataset, (b) Pavia University dataset, and (c) Salinas scene dataset.

Figure 12 .
Figure 12.The curves of overall accuracies versus different percentages of training samples obtained by different methods on three datasets: (a) Indian Pines dataset, (b) Pavia University dataset, and (c) Salinas scene dataset.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 2 .
The results of different features on different datasets.OA: overall accuracy; AA: average accuracy; PFs: principal component features; EFs: edge features; FFs: fusion features.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 2 .
The results of different features on different datasets.OA: overall accuracy; AA: average accuracy; PFs: principal component features; EFs: edge features; FFs: fusion features.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 2 .
The results of different features on different datasets.OA: overall accuracy; AA: average accuracy; PFs: principal component features; EFs: edge features; FFs: fusion features.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 1 .
Information of each category in three datasets.

Table 2 .
The results of different features on different datasets.OA: overall accuracy; AA: average accuracy; PFs: principal component features; EFs: edge features; FFs: fusion features.

Table 3 .
The overall accuracy (%) of different methods.

Table 5 .
The classification results of all methods on the Pavia University dataset with OA, AA, and Kappa data given in the form of mean ± standard deviation.

Table 5 .
The classification results of all methods on the Pavia University dataset with OA, AA, and Kappa data given in the form of mean ± standard deviation.

Table 6 .
The classification results of all methods on the Salinas scene dataset with OA, AA, and Kappa data given in the form of mean ± standard deviation.

Table 6 .
The classification results of all methods on the Salinas scene dataset with OA, AA, and Kappa data given in the form of mean ± standard deviation.± 0.51 98.89 ± 0.31 99.31 ± 0.11 98.93 ± 0.18 99.79 ± 0.06