Alternately Updated Spectral–Spatial Convolution Network for the Classification of Hyperspectral Images

The connection structure in the convolutional layers of most deep learning-based algorithms used for the classification of hyperspectral images (HSIs) has typically been in the forward direction. In this study, an end-to-end alternately updated spectral–spatial convolutional network (AUSSC) with a recurrent feedback structure is used to learn refined spectral and spatial features for HSI classification. The proposed AUSSC includes alternating updated blocks in which each layer serves as both an input and an output for the other layers. The AUSSC can refine spectral and spatial features many times under fixed parameters. A center loss function is introduced as an auxiliary objective function to improve the discrimination of features acquired by the model. Additionally, the AUSSC utilizes smaller convolutional kernels than other convolutional neural network (CNN)-based methods to reduce the number of parameters and alleviate overfitting. The proposed method was implemented on four HSI data sets, as follows: Indian Pines, Kennedy Space Center, Salinas Scene, and Houston. Experimental results demonstrated that the proposed AUSSC outperformed the HSI classification accuracy obtained by state-of-the-art deep learning-based methods with a small number of training samples.


Introduction
Hyperspectral images (HSIs) contain both spectral and spatial information and generally consist of hundreds of spectral bands for the same observed scene [1]. Due to the vast amounts of information they contain, HSIs have found important applications in a variety of fields, such as the non-contact analysis of food materials [2], the detection and identification of plant diseases [3], multispectral change detection [4], and medicine [5]. HSI classification is the core technology in these applications. However, since HSIs include inherently high-dimensional structures, their classification remains a challenging task in the remote sensing community.
Traditional classification methods involve feature engineering using a classifier. This process aims to extract or select features from original HSI data, typically producing a classifier based on low-dimensional features. Support vector machines (SVMs) are the most commonly used method in the early stages of HSI classification, due to their low sensitivity to high dimensionality [6]. Spectral-spatial classification methods have become predominant in recent years [7]. Mathematical-morphology-based techniques [8], Markov random fields (MRFs) [9], and sparse representations [10] are also commonly used branches. However, many of these techniques suffer from low classification accuracy due to shallow feature extraction.
(1) The proposed method includes a recurrent feedback spectral-spatial structure with fixed parameters, in order to learn not only deep but also refined spectral and spatial features to improve HSI classification accuracy. (2) The effectiveness of the center loss function is validated as an auxiliary loss function used to improve the results of hyperspectral image classification. (3) The AUSSC decomposes a large 3D convolutional kernel into three smaller 1D convolutional kernels, thereby saving a large number of parameters and reducing overfitting. (4) The AUSSC achieves state-of-the-art classification accuracy across four widely used HSI data sets, using limited training data with a fixed spatial size.
The remainder of this paper is organized as follows. Section 2 presents the framework of the proposed AUSSC. Section 3 describes the experimental data sets. The details of the experimental results and a discussion are given in Section 4. Conclusions and suggestions for future work are presented in Section 5.

Methods
In this section, an alternately updated spectral-spatial convolutional network is proposed for HSI classification. Figure 1 shows an overview of the proposed method. For HSI data with L channels and a size of H × W, a spatial size of s × s was selected from the raw HSI data and used as the input to the AUSSC network. First, the AUSSC uses three smaller convolutional kernels to learn spectral and spatial features from an original HSI patch. Second, the alternately updated spectral and spatial blocks refine the deep spectral and spatial features using recurrent feedback. Finally, the model parameters are optimized using the cross-entropy loss and center-loss loss functions. Details of each stage are elaborated in the following subsections. Figure 1. An overview of the proposed end-to-end alternately updated spectral-spatial convolutional network (AUSSC). "Conv" refers to the convolution operation. The operations denoted by "Some operations" are presented in detail in Section 2.4. "Logits" refers to the output of the last fully connected layer. Classification results are acquired after the Softmax operation.

Learning Spectral and Spatial Features with Smaller Convolutional Kernels
During HSI classification, deep CNN-based methods typically utilize preprocessing technology such as PCA. This is often followed by several convolutional layers with multiple activation functions and a classifier for obtaining classification maps. The convolution and activation can be formulated as where X l j is the ith input feature map for the (l + 1)th layer, N is the number of feature maps in the (l + 1)th layer, * is the convolution operation, f (·) is an activation function, and k l+1 ji and b l+1 i are learnable parameters that can be fine-tuned using the back-propagation (BP) algorithm.
The 3D CNN, SSRN, and FDSSC algorithms all demonstrate that an end-to-end 3D-CNN-based framework outperforms 2D-CNN-based methods that include preprocessing or post-processing, as well as other deep learning-based methods. One reason for this is that an end-to-end framework can reduce pre-processing and subsequent post-processing, allowing the connection between the original input and the final output to be as close as possible. The model then includes more space that can be adjusted automatically by the data, thereby increasing the degree of fitness. Additionally, when applied to HSIs with a 3D structure, 1D convolution operations focus on spectral features. 2D convolution operations focus on spatial features and 3D convolution operations can learn both spatial and spectral features. However, 3D kernel parameters are larger than 2D or 1D kernel parameters when the number of convolutional layers and kernels is the same. As such, a large number of model parameters can lead to overfitting.
As such, we propose an end-to-end CNN-based framework that uses smaller convolutional kernels compared to other CNN-based methods. As shown in Figure 2, the AUSSC utilizes kernels for HSI classification, ignoring other specific architectures. The 3D CNN method uses two similar convolutional kernels with sizes of a × a × m 1 and a × a × m 2 , with the two convolutional kernels differing only in spectral dimension. SSRN uses a spectral kernel with a size of 1 × 1 × m and a spatial kernel with a size of a × a × d to learn spectral and spatial representations, respectively. Convolutional kernels dictate model parameters and determine which features are learned by the CNN. In contrast, we introduce the idea of factorization into smaller convolutions from InceptionV3 [26]. In this process, a larger 3D convolutional kernel with a size of a × a × m was divided into three smaller convolutional kernels with sizes of 1 × 1 × m, 1 × a × 1, and a × 1 × 1. This substantially reduced the number of parameters, accelerated the operation, and reduced the possibility of overfitting. As shown in Table 1, in the absence of bias (with all other conditions remaining the same), the convolutional kernel with a size of a × a × m included a 2 m parameters. The smallest convolutional kernel only included parameters, which is more economical than the other two. This increased the nonlinear representation capabilities of the model due to the use of multiple nonlinear activation functions. Table 1. Parameters for different convolutional kernels.

Convolutional Kernels Parameters
a × a × m a 2 m

Refining Spectral and Spatial Features via Alternately Updated Blocks
Deep CNN architectures have been used for HSI classification and have produced competitive classification results [17]. However, the connection structure in the convolutional layers is typically in the forward direction. Additionally, the convolutional kernels in SSRN and FDSSC increase with depth. Alternately updated cliques have a recurrent feedback structure and go deeper into the convolutional layers with a fixed number of parameters [25]. Therefore, we propose combining small convolutional kernels with this loop structure and design two alternately updated blocks to learn refined spectral and spatial features separately from HSIs.
As shown in Figure 3, there are two stages in the alternately updated spectral blocks. In the initialization stage (stage 1), the 3D convolutional layers use k kernels with sizes of 1 × 1 × to learn deep spectral features. In stage 2, the 3D convolutional layers use k kernels with sizes of 1 × 1 × to learn refined spectral features. A feature map with a size of × × and a number, , was input to the alternately updated spectral block. This input is denoted as , where the subscript 0 represents the feature map in the initial position of the alternately updated spectral block. The superscript (1) indicates the feature map is in the first stage of the alternately updated process. In stage 1, the input of every convolutional layer is the output of all the previous convolutional layers. Stage 1 can be formulated as follows:

Refining Spectral and Spatial Features via Alternately Updated Blocks
Deep CNN architectures have been used for HSI classification and have produced competitive classification results [17]. However, the connection structure in the convolutional layers is typically in the forward direction. Additionally, the convolutional kernels in SSRN and FDSSC increase with depth. Alternately updated cliques have a recurrent feedback structure and go deeper into the convolutional layers with a fixed number of parameters [25]. Therefore, we propose combining small convolutional kernels with this loop structure and design two alternately updated blocks to learn refined spectral and spatial features separately from HSIs.
As shown in Figure 3, there are two stages in the alternately updated spectral blocks. In the initialization stage (stage 1), the 3D convolutional layers use k kernels with sizes of 1 × 1 × m to learn deep spectral features. In stage 2, the 3D convolutional layers use k kernels with sizes of 1 × 1 × m to learn refined spectral features. A feature map with a size of s × s × b and a number, n, was input to the alternately updated spectral block. This input is denoted as X (1) 0 , where the subscript 0 represents the feature map in the initial position of the alternately updated spectral block. The superscript (1) indicates the feature map is in the first stage of the alternately updated process. In stage 1, the input Remote Sens. 2019, 11,1794 6 of 21 of every convolutional layer is the output of all the previous convolutional layers. Stage 1 can be formulated as follows: X where X (1) l is the output of the lth (l ≥ 1) convolutional layer in stage 1 of an alternately updated spectral block, f (≥) is a nonlinear activation function, * is the convolutional operation using the padding method, and WW jl is a parameter reused in stage 2.
In the looping stage (stage 2), each convolutional layer (except the input convolutional layer) is alternately updated to refine features. Stage 2 has a recurrent feedback structure, meaning that the feature map can be refined several times using the same weights. Therefore, any two convolutional layers in the alternately updated spectral block are connected bi-directionally. Stage 2 can then be formulated as follows: where r ≥ 2 since the feature map is in stage 2 and can be updated multiple times by the recurrent feedback structure. Similarly, l ≥ 1 since the input feature map is not updated.
Remote Sens. 2019, 11, x FOR PEER REVIEW 6 of 20 layers in the alternately updated spectral block are connected bi-directionally. Stage 2 can then be formulated as follows: where ≥ 2 since the feature map is in stage 2 and can be updated multiple times by the recurrent feedback structure. Similarly, ≥ 1 since the input feature map is not updated. After learning refined deep spectral features, the input convolutional layer and the updated convolutional layer are concatenated in the alternately updated spectral block and transferred to the next block. Once spectral information from the HSI has been learned, the high dimensions of the feature map can be reduced by valid convolution and reshaping operations (see figure in Section 2.4.). The resulting input to the alternately updated spatial block is a feature map with number, , and size × × 1. Figure 4, there are two different convolutional kernels in the alternately updated spatial block. The 3D convolutional layers use × 1 × 1 and 1 × × 1 convolutional kernels to learn deep refined spatial features with an alternately updated structure that is also used for the alternately updated spectral block. In the spatial block, two different convolutional kernels learn spatial features in parallel rather than in series. The convolutional relationship between the spatial block is the same as for the previous block. These alternately updated blocks achieve spectral and spatial attention due to the presence of refined features obtained in the looping stages. Densely connected forward and feedback structures allow the spectral and spatial information to flow in convolutional layers within the blocks. These alternately updated blocks also include weight sharing. In stage 1, the weights increase linearly as the number of convolutional layers increases. However, in stage 2, the weights are fixed since they are shared. The partial weights from stage 1, such as , , and (see Figure 2), are reused in stage 2. As features are cycled repeatedly in stage 2, the number of parameters remains unchanged. After learning refined deep spectral features, the input convolutional layer and the updated convolutional layer are concatenated in the alternately updated spectral block and transferred to the next block. Once spectral information from the HSI has been learned, the high dimensions of the feature map can be reduced by valid convolution and reshaping operations (see figure in Section 2.4.). The resulting input to the alternately updated spatial block is a feature map with number, n, and size t × t × 1.

As shown in
As shown in Figure 4, there are two different convolutional kernels in the alternately updated spatial block. The 3D convolutional layers use ka × 1 × 1 and k1 × a × 1 convolutional kernels to learn deep refined spatial features with an alternately updated structure that is also used for the alternately updated spectral block. In the spatial block, two different convolutional kernels learn spatial features in parallel rather than in series. The convolutional relationship between the spatial block is the same as for the previous block.
These alternately updated blocks achieve spectral and spatial attention due to the presence of refined features obtained in the looping stages. Densely connected forward and feedback structures allow the spectral and spatial information to flow in convolutional layers within the blocks. These alternately updated blocks also include weight sharing. In stage 1, the weights increase linearly as the number of convolutional layers increases. However, in stage 2, the weights are fixed since they are shared. The partial weights from stage 1, such as W 12 , W 13 , and W 23 (see Figure 2), are reused in stage 2. As features are cycled repeatedly in stage 2, the number of parameters remains unchanged.

Optimization by the Cross-Entropy Loss and Center Loss Functions
HSI classification is inherently a multi-classification task and cross-entropy loss with a softmax layer is a well-known objective function that is used for such problems. The softmax cross-entropy loss can be written in the following form: where m is the size of the mini-batch, n is the number of classes, i x is the th deep feature belonging to the th class, is the th column of the weights W in the last fully connected layer, and b is the bias. The last layer of the CNN-based model is typically fully connected, as it is difficult to make the dimensions of the last layer equal to the number of categories without a fully connected layer. Intuitively, one would expect that learning more discriminatory features would improve the generalization performance. As such, we introduce an auxiliary loss function [24] to improve the discrimination of features acquired by the model. This function can be formulated as follows: where is the central feature in the th class. The function decreases the quadratic sum of the distance from the center of the feature to the features of each sample in one batch, which decreases the intra-class distance. The center of feature can then be updated through iterative training. When two loss functions are used together for HSI classification, the softmax cross-entropy loss is considered to be responsible for increasing the inter-class distance. The center loss is then

Optimization by the Cross-Entropy Loss and Center Loss Functions
HSI classification is inherently a multi-classification task and cross-entropy loss with a softmax layer is a well-known objective function that is used for such problems. The softmax cross-entropy loss can be written in the following form: where m is the size of the mini-batch, n is the number of classes, x i is the ith deep feature belonging to the y i th class, W j is the jth column of the weights W in the last fully connected layer, and b is the bias. The last layer of the CNN-based model is typically fully connected, as it is difficult to make the dimensions of the last layer equal to the number of categories without a fully connected layer. Intuitively, one would expect that learning more discriminatory features would improve the generalization performance. As such, we introduce an auxiliary loss function [24] to improve the discrimination of features acquired by the model. This function can be formulated as follows: where c y i is the central feature in the y i th class. The function decreases the quadratic sum of the distance from the center of the feature to the features of each sample in one batch, which decreases the intra-class distance. The center of feature c y i can then be updated through iterative training. When two loss functions are used together for HSI classification, the softmax cross-entropy loss is considered to be responsible for increasing the inter-class distance. The center loss is then responsible for reducing the intra-class distance, thus increasing the discriminant degree and generalization abilities of learned features. Consequently, the objective function for the AUSSC can be written in the following form: where λ ∈ [0, 1) controls the proportion of center loss and the value of λ is determined experimentally, as discussed in the following section. In summary, the cross-entropy loss is the principal objective function and the inter-class distance is the principal component. The center loss is the auxiliary used to reduce the intra-class distance.

Alternatively Updated Spectral-Spatial Convolutional Network
A flowchart is included below to explain the steps in the AUSSC end-to-end network. Considering the cost and time requirements of the collection of HSI labeled samples, we propose a 3D CNN-based framework that maximizes the flow and circulation of spectral and spatial information. Figure 5 shows a 9 × 9 × L cube, which is used as input in our technique, where L is the number of HSI bands. Due to high computational costs, two convolutional layers were used in the alternately updated blocks and a single loop was used in stage 2.
Remote Sens. 2019, 11, x FOR PEER REVIEW 8 of 20 responsible for reducing the intra-class distance, thus increasing the discriminant degree and generalization abilities of learned features. Consequently, the objective function for the AUSSC can be written in the following form: where λ ∈ [0, 1) controls the proportion of center loss and the value of λ is determined experimentally, as discussed in the following section. In summary, the cross-entropy loss is the principal objective function and the inter-class distance is the principal component. The center loss is the auxiliary used to reduce the intra-class distance.

Alternatively Updated Spectral-Spatial Convolutional Network
A flowchart is included below to explain the steps in the AUSSC end-to-end network. Considering the cost and time requirements of the collection of HSI labeled samples, we propose a 3D CNN-based framework that maximizes the flow and circulation of spectral and spatial information. Figure 5 shows a 9 × 9 × cube, which is used as input in our technique, where L is the number of HSI bands. Due to high computational costs, two convolutional layers were used in the alternately updated blocks and a single loop was used in stage 2. L2 loss and batch normalization (BN) [27] were used to improve the normalization of our model. In a broad sense, L2 and other regularization parameter terms added to the loss function in machine learning are essentially weighted norms. The goal of normalization with L2 loss is to effectively reduce the size of the original parameter values in the model, with BN performing normalization operations on input neuron values. The normalization target regularizes its input value to a normal distribution with a mean value of zero and a variance of one. The blue layers and blue lines both refer to the BN, rectified linear units (ReLU), and the convolution operation. The first convolutional layer lacks both a BN and a ReLU.
The original HSI input, which has a size of 9 × 9 × , flows to the first convolutional layer with a kernel size of 1,1,7 and a stride of 1,1,2 to generate feature maps with a size of 6 49 × 9 × . The number of kernels in the convolutional layers of alternately updated spectral block was 36, the kernel size was 1,1,7 , and the convolutional padding method was the same. As a result, the output L2 loss and batch normalization (BN) [27] were used to improve the normalization of our model. In a broad sense, L2 and other regularization parameter terms added to the loss function in machine learning are essentially weighted norms. The goal of normalization with L2 loss is to effectively reduce the size of the original parameter values in the model, with BN performing normalization operations on input neuron values. The normalization target regularizes its input value to a normal distribution with a mean value of zero and a variance of one. The blue layers and blue lines both refer to the BN, rectified linear units (ReLU), and the convolution operation. The first convolutional layer lacks both a BN and a ReLU.
The original HSI input, which has a size of 9 × 9 × L, flows to the first convolutional layer with a kernel size of (1, 1, 7) and a stride of (1, 1, 2) to generate feature maps with a size of 6 49 × 9 × bThe number of kernels in the convolutional layers of alternately updated spectral block was 36, the kernel Remote Sens. 2019, 11, 1794 9 of 21 size was (1,1,7), and the convolutional padding method was the same. As a result, the output size for each layer remained 36 9 × 9 × b, which was unchanged in stage 1 and stage 2. After concatenating the input and updated feature maps, the output of the alternately updated spectral blocks had a size of 136 9 × 9 × b.
A valid convolutional layer with 48 channels and a kernel size of 1 × 1 × b was included between alternately updated spectral and spatial blocks. This reduced the dimensions of the output of alternately updated spectral blocks, resulting in 48 feature maps with a size of 9 × 9 × 1. After reshaping the third dimension and the channel dimension, 48 channels with a size of 9 × 9 × 1 were merged into a single 9 × 9 × 48 channel. A valid convolutional layer with a kernel size of 3 × 3 × 48 and 64 kernels transformed the feature map into 64 channels with a size of 7 × 7 × 1.
Similar to the alternately updated spectral blocks, the alternately updated spatial block featured two convolutional kernels with sizes of 1 × 3 × 1 and 3 × 1 × 1In stage 1 and stage 2, the output of each layer 367 × 7 × 1 was 36 kernels with a size of 7 × 7 × 1. The results of two convolutional kernels were concatenated into 272 kernels with a size of 7 × 7 × 1. Finally, the output passed through a 3D average pooling layer with a pooling size of 21 × 1 × 1, which was converted into 272 feature maps with a size of 1 × 1 × 1. After the flattening operation, a vector with a size of 1 × 1 × C was produced by the fully connected layer, where C is the number of classes. Trainable AUSSC parameters were optimized by iterative training using Equation (6) and used to compute the loss between the predicted and real values.
The following sections provide a summary of the advantages of this proposed AUSSC architecture. First, the use of three different small convolutional kernels reduced both the number of parameters and overfitting, thereby increasing the nonlinear representation ability of the model and the diversity of features. Compared with symmetric splitting into several identical small convolutional kernels, this asymmetric splitting can handle more and richer features. Second, refined deep features learned by both forward and feedback connections between convolutional layers are more robust and have more high-level spectral and spatial information. Additionally, SSRN and FDSSC learn deeper features by increasing the number of convolutional layers in the blocks. However, unlike these conventional models, AUSSC can go deeper with fixed parameters due to its loop structure and shared weights. Finally, an auxiliary loss function was used to reduce the intra-class distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The Indiana Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA), and Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA Airborne Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 bands containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 1996 and had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also collected by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes. Table 2 lists these classes and the corresponding false-color composite maps for three data sets.
However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 ×1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The Indiana Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA), and Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA Airborne Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 bands containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 1996 and had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also collected by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes. Table  2 lists these classes and the corresponding false-color composite maps for three data sets.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The Indiana Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA), and Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA Airborne Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 bands containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 1996 and had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also collected by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes. Table  2 lists these classes and the corresponding false-color composite maps for three data sets.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The Indiana Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA), and Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA Airborne Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 bands containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 1996 and had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also collected by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes. Table  2 lists these classes and the corresponding false-color composite maps for three data sets.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The Indiana Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA), and Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA Airborne Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 bands containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 1996 and had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also collected by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes. Table  2 lists these classes and the corresponding false-color composite maps for three data sets.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The Indiana Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA), and Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA Airborne Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 bands containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 1996 and had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also collected by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes. Table  2 lists these classes and the corresponding false-color composite maps for three data sets. distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The Indiana Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA), and Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA Airborne Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 bands containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 1996 and had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also collected by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes. Table  2 lists these classes and the corresponding false-color composite maps for three data sets. distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The Indiana Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA), and Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA Airborne Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 bands containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 1996 and had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also collected by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes. Table  2 lists these classes and the corresponding false-color composite maps for three data sets. However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set. However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set. However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set. However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set. However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set. However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set. However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set. However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set. However, with the development of state-of-art algorithms for hyperspectral image classification, these three data sets are easily classified. When the number of training samples was more than 800, SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The difference between the classification accuracies of these methods is less than 1%. Therefore, in addition to the three data sets discussed above, this study included the Houston (Houston, TX, USA) data set, which was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more difficult as conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification above 90% with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 bands containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color composite maps for this data set.  between alternately updated spectral and spatial blocks. This reduced the dimensions of the of alternately updated spectral blocks, resulting in 48 feature maps with a size of 9 × 9 × 1 reshaping the third dimension and the channel dimension, 48 channels with a size of 9 × 9 × merged into a single 9 × 9 × 48 channel. A valid convolutional layer with a kernel size of 3 × and 64 kernels transformed the feature map into 64 channels with a size of 7 × 7 × 1. Similar to the alternately updated spectral blocks, the alternately updated spatial block fe two convolutional kernels with sizes of 1 × 3 × 1 and 3 × 1 × 1. In stage 1 and stage 2, the ou each layer 367 × 7 × 1 was 36 kernels with a size of 7 × 7 × 1. The results of two convolutional were concatenated into 272 kernels with a size of 7 × 7 × 1. Finally, the output passed throug average pooling layer with a pooling size of 21 × 1 × 1, which was converted into 272 featur with a size of 1 × 1 × 1. After the flattening operation, a vector with a size of 1 × 1 × was pr by the fully connected layer, where C is the number of classes. Trainable AUSSC parameter optimized by iterative training using Equation (6) and used to compute the loss betwe predicted and real values.
The following sections provide a summary of the advantages of this proposed A architecture. First, the use of three different small convolutional kernels reduced both the num parameters and overfitting, thereby increasing the nonlinear representation ability of the mod the diversity of features. Compared with symmetric splitting into several identical convolutional kernels, this asymmetric splitting can handle more and richer features. Second, deep features learned by both forward and feedback connections between convolutional lay more robust and have more high-level spectral and spatial information. Additionally, SSR FDSSC learn deeper features by increasing the number of convolutional layers in the blocks. Ho unlike these conventional models, AUSSC can go deeper with fixed parameters due to i structure and shared weights. Finally, an auxiliary loss function was used to reduce the intr distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The I Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA A Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 19 had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also co by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes 2 lists these classes and the corresponding false-color composite maps for three data sets. between alternately updated spectral and spatial blocks. This reduced the dimensions of the of alternately updated spectral blocks, resulting in 48 feature maps with a size of 9 × 9 × 1 reshaping the third dimension and the channel dimension, 48 channels with a size of 9 × 9 × merged into a single 9 × 9 × 48 channel. A valid convolutional layer with a kernel size of 3 × and 64 kernels transformed the feature map into 64 channels with a size of 7 × 7 × 1. Similar to the alternately updated spectral blocks, the alternately updated spatial block fe two convolutional kernels with sizes of 1 × 3 × 1 and 3 × 1 × 1. In stage 1 and stage 2, the ou each layer 367 × 7 × 1 was 36 kernels with a size of 7 × 7 × 1. The results of two convolutional were concatenated into 272 kernels with a size of 7 × 7 × 1. Finally, the output passed throug average pooling layer with a pooling size of 21 × 1 × 1, which was converted into 272 featur with a size of 1 × 1 × 1. After the flattening operation, a vector with a size of 1 × 1 × was pro by the fully connected layer, where C is the number of classes. Trainable AUSSC parameter optimized by iterative training using Equation (6) and used to compute the loss betwe predicted and real values.
The following sections provide a summary of the advantages of this proposed A architecture. First, the use of three different small convolutional kernels reduced both the num parameters and overfitting, thereby increasing the nonlinear representation ability of the mod the diversity of features. Compared with symmetric splitting into several identical convolutional kernels, this asymmetric splitting can handle more and richer features. Second, deep features learned by both forward and feedback connections between convolutional lay more robust and have more high-level spectral and spatial information. Additionally, SSR FDSSC learn deeper features by increasing the number of convolutional layers in the blocks. Ho unlike these conventional models, AUSSC can go deeper with fixed parameters due to i structure and shared weights. Finally, an auxiliary loss function was used to reduce the intr distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The I Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA A Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 19 had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also co by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes 2 lists these classes and the corresponding false-color composite maps for three data sets. A valid convolutional layer with 48 channels and a kernel size of 1 × 1 × was in between alternately updated spectral and spatial blocks. This reduced the dimensions of the of alternately updated spectral blocks, resulting in 48 feature maps with a size of 9 × 9 × 1 reshaping the third dimension and the channel dimension, 48 channels with a size of 9 × 9 × merged into a single 9 × 9 × 48 channel. A valid convolutional layer with a kernel size of 3 × and 64 kernels transformed the feature map into 64 channels with a size of 7 × 7 × 1.
Similar to the alternately updated spectral blocks, the alternately updated spatial block fe two convolutional kernels with sizes of 1 × 3 × 1 and 3 × 1 × 1. In stage 1 and stage 2, the ou each layer 367 × 7 × 1 was 36 kernels with a size of 7 × 7 × 1. The results of two convolutional were concatenated into 272 kernels with a size of 7 × 7 × 1. Finally, the output passed throug average pooling layer with a pooling size of 21 × 1 × 1, which was converted into 272 featur with a size of 1 × 1 × 1. After the flattening operation, a vector with a size of 1 × 1 × was pro by the fully connected layer, where C is the number of classes. Trainable AUSSC parameter optimized by iterative training using Equation (6) and used to compute the loss betwe predicted and real values.
The following sections provide a summary of the advantages of this proposed A architecture. First, the use of three different small convolutional kernels reduced both the num parameters and overfitting, thereby increasing the nonlinear representation ability of the mod the diversity of features. Compared with symmetric splitting into several identical convolutional kernels, this asymmetric splitting can handle more and richer features. Second, deep features learned by both forward and feedback connections between convolutional lay more robust and have more high-level spectral and spatial information. Additionally, SSR FDSSC learn deeper features by increasing the number of convolutional layers in the blocks. Ho unlike these conventional models, AUSSC can go deeper with fixed parameters due to i structure and shared weights. Finally, an auxiliary loss function was used to reduce the intr distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The I Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA A Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 19 had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also co by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes 2 lists these classes and the corresponding false-color composite maps for three data sets. A valid convolutional layer with 48 channels and a kernel size of 1 × 1 × was in between alternately updated spectral and spatial blocks. This reduced the dimensions of the of alternately updated spectral blocks, resulting in 48 feature maps with a size of 9 × 9 × 1 reshaping the third dimension and the channel dimension, 48 channels with a size of 9 × 9 × merged into a single 9 × 9 × 48 channel. A valid convolutional layer with a kernel size of 3 × and 64 kernels transformed the feature map into 64 channels with a size of 7 × 7 × 1.
Similar to the alternately updated spectral blocks, the alternately updated spatial block fe two convolutional kernels with sizes of 1 × 3 × 1 and 3 × 1 × 1. In stage 1 and stage 2, the ou each layer 367 × 7 × 1 was 36 kernels with a size of 7 × 7 × 1. The results of two convolutional were concatenated into 272 kernels with a size of 7 × 7 × 1. Finally, the output passed throug average pooling layer with a pooling size of 21 × 1 × 1, which was converted into 272 featur with a size of 1 × 1 × 1. After the flattening operation, a vector with a size of 1 × 1 × was pro by the fully connected layer, where C is the number of classes. Trainable AUSSC parameter optimized by iterative training using Equation (6) and used to compute the loss betwe predicted and real values.
The following sections provide a summary of the advantages of this proposed A architecture. First, the use of three different small convolutional kernels reduced both the num parameters and overfitting, thereby increasing the nonlinear representation ability of the mod the diversity of features. Compared with symmetric splitting into several identical convolutional kernels, this asymmetric splitting can handle more and richer features. Second, deep features learned by both forward and feedback connections between convolutional lay more robust and have more high-level spectral and spatial information. Additionally, SSR FDSSC learn deeper features by increasing the number of convolutional layers in the blocks. Ho unlike these conventional models, AUSSC can go deeper with fixed parameters due to i structure and shared weights. Finally, an auxiliary loss function was used to reduce the intr distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The I Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA A Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 19 had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also co by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes 2 lists these classes and the corresponding false-color composite maps for three data sets. A valid convolutional layer with 48 channels and a kernel size of 1 × 1 × was in between alternately updated spectral and spatial blocks. This reduced the dimensions of the of alternately updated spectral blocks, resulting in 48 feature maps with a size of 9 × 9 × 1 reshaping the third dimension and the channel dimension, 48 channels with a size of 9 × 9 × merged into a single 9 × 9 × 48 channel. A valid convolutional layer with a kernel size of 3 × and 64 kernels transformed the feature map into 64 channels with a size of 7 × 7 × 1.
Similar to the alternately updated spectral blocks, the alternately updated spatial block fe two convolutional kernels with sizes of 1 × 3 × 1 and 3 × 1 × 1. In stage 1 and stage 2, the ou each layer 367 × 7 × 1 was 36 kernels with a size of 7 × 7 × 1. The results of two convolutional were concatenated into 272 kernels with a size of 7 × 7 × 1. Finally, the output passed throug average pooling layer with a pooling size of 21 × 1 × 1, which was converted into 272 featur with a size of 1 × 1 × 1. After the flattening operation, a vector with a size of 1 × 1 × was pr by the fully connected layer, where C is the number of classes. Trainable AUSSC parameter optimized by iterative training using Equation (6) and used to compute the loss betwe predicted and real values.
The following sections provide a summary of the advantages of this proposed A architecture. First, the use of three different small convolutional kernels reduced both the num parameters and overfitting, thereby increasing the nonlinear representation ability of the mod the diversity of features. Compared with symmetric splitting into several identical convolutional kernels, this asymmetric splitting can handle more and richer features. Second, deep features learned by both forward and feedback connections between convolutional lay more robust and have more high-level spectral and spatial information. Additionally, SSR FDSSC learn deeper features by increasing the number of convolutional layers in the blocks. H unlike these conventional models, AUSSC can go deeper with fixed parameters due to i structure and shared weights. Finally, an auxiliary loss function was used to reduce the intr distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The I Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA A Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 19 had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also co by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes 2 lists these classes and the corresponding false-color composite maps for three data sets. A valid convolutional layer with 48 channels and a kernel size of 1 × 1 × was in between alternately updated spectral and spatial blocks. This reduced the dimensions of the of alternately updated spectral blocks, resulting in 48 feature maps with a size of 9 × 9 × 1 reshaping the third dimension and the channel dimension, 48 channels with a size of 9 × 9 × merged into a single 9 × 9 × 48 channel. A valid convolutional layer with a kernel size of 3 × and 64 kernels transformed the feature map into 64 channels with a size of 7 × 7 × 1.
Similar to the alternately updated spectral blocks, the alternately updated spatial block fe two convolutional kernels with sizes of 1 × 3 × 1 and 3 × 1 × 1. In stage 1 and stage 2, the ou each layer 367 × 7 × 1 was 36 kernels with a size of 7 × 7 × 1. The results of two convolutional were concatenated into 272 kernels with a size of 7 × 7 × 1. Finally, the output passed throug average pooling layer with a pooling size of 21 × 1 × 1, which was converted into 272 featur with a size of 1 × 1 × 1. After the flattening operation, a vector with a size of 1 × 1 × was pr by the fully connected layer, where C is the number of classes. Trainable AUSSC parameter optimized by iterative training using Equation (6) and used to compute the loss betwe predicted and real values.
The following sections provide a summary of the advantages of this proposed A architecture. First, the use of three different small convolutional kernels reduced both the num parameters and overfitting, thereby increasing the nonlinear representation ability of the mod the diversity of features. Compared with symmetric splitting into several identical convolutional kernels, this asymmetric splitting can handle more and richer features. Second, deep features learned by both forward and feedback connections between convolutional lay more robust and have more high-level spectral and spatial information. Additionally, SSR FDSSC learn deeper features by increasing the number of convolutional layers in the blocks. Ho unlike these conventional models, AUSSC can go deeper with fixed parameters due to i structure and shared weights. Finally, an auxiliary loss function was used to reduce the intr distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The I Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA A Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 19 had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also co by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes 2 lists these classes and the corresponding false-color composite maps for three data sets. blocks had a size of 136 9 × 9 × . A valid convolutional layer with 48 channels and a kernel size of 1 × 1 × was in between alternately updated spectral and spatial blocks. This reduced the dimensions of the of alternately updated spectral blocks, resulting in 48 feature maps with a size of 9 × 9 × 1 reshaping the third dimension and the channel dimension, 48 channels with a size of 9 × 9 × merged into a single 9 × 9 × 48 channel. A valid convolutional layer with a kernel size of 3 × and 64 kernels transformed the feature map into 64 channels with a size of 7 × 7 × 1.
Similar to the alternately updated spectral blocks, the alternately updated spatial block fe two convolutional kernels with sizes of 1 × 3 × 1 and 3 × 1 × 1. In stage 1 and stage 2, the ou each layer 367 × 7 × 1 was 36 kernels with a size of 7 × 7 × 1. The results of two convolutional were concatenated into 272 kernels with a size of 7 × 7 × 1. Finally, the output passed throug average pooling layer with a pooling size of 21 × 1 × 1, which was converted into 272 featur with a size of 1 × 1 × 1. After the flattening operation, a vector with a size of 1 × 1 × was pr by the fully connected layer, where C is the number of classes. Trainable AUSSC parameter optimized by iterative training using Equation (6) and used to compute the loss betwe predicted and real values.
The following sections provide a summary of the advantages of this proposed A architecture. First, the use of three different small convolutional kernels reduced both the num parameters and overfitting, thereby increasing the nonlinear representation ability of the mod the diversity of features. Compared with symmetric splitting into several identical convolutional kernels, this asymmetric splitting can handle more and richer features. Second, deep features learned by both forward and feedback connections between convolutional lay more robust and have more high-level spectral and spatial information. Additionally, SSR FDSSC learn deeper features by increasing the number of convolutional layers in the blocks. Ho unlike these conventional models, AUSSC can go deeper with fixed parameters due to i structure and shared weights. Finally, an auxiliary loss function was used to reduce the intr distance and increase the distinction between features of different categories.

Description of Experimental Data Sets
Three common HSI data sets were used to validate the proposed AUSSC, as follows: The I Pines (IP; northwestern Indiana, USA), Kennedy Space Center (KSC; Merritt Island, FL, USA Salinas Scene (SS; Salinas Valley, CA, USA). These IP data were obtained by the NASA A Visible Imaging Spectrometer (AVIRIS) sensor. The size of the IP data was 145 × 145, with 220 containing 16 kinds of ground cover. The KSC data were collected by the AVIRIS sensor in 19 had a size of 512 × 614, with 176 bands and 13 ground truth classes. The SS data were also co by the AVIRIS sensor and had a size of 512 × 217, with 204 bands and 9 ground truth classes 2 lists these classes and the corresponding false-color composite maps for three data sets. However, with the development of state-of-art algorithms for hyperspectral image classif these three data sets are easily classified. When the number of training samples was more th SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The dif between the classification accuracies of these methods is less than 1%. Therefore, in addition three data sets discussed above, this study included the Houston (Houston, TX, USA) data set was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more diff conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification abo with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color com However, with the development of state-of-art algorithms for hyperspectral image classif these three data sets are easily classified. When the number of training samples was more th SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The dif between the classification accuracies of these methods is less than 1%. Therefore, in addition three data sets discussed above, this study included the Houston (Houston, TX, USA) data set was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more diff conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification abo with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 containing 15 kinds of ground cover. Table 3 lists the classes and corresponding false-color com However, with the development of state-of-art algorithms for hyperspectral image classif these three data sets are easily classified. When the number of training samples was more th SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The dif between the classification accuracies of these methods is less than 1%. Therefore, in addition three data sets discussed above, this study included the Houston (Houston, TX, USA) data set was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more diff conventional algorithms (SSRN, FDSSC, etc.) have been unable to achieve classification abo with 200 labeled training samples. The size of the Houston data was 349 × 1905, with 144 containing 15 kinds of ground cover. However, with the development of state-of-art algorithms for hyperspectral image classif these three data sets are easily classified. When the number of training samples was more th SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The dif between the classification accuracies of these methods is less than 1%. Therefore, in addition three data sets discussed above, this study included the Houston (Houston, TX, USA) data set was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more diff conventional algorithms (SSRN, FDSSC,  However, with the development of state-of-art algorithms for hyperspectral image classif these three data sets are easily classified. When the number of training samples was more th SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The dif between the classification accuracies of these methods is less than 1%. Therefore, in addition three data sets discussed above, this study included the Houston (Houston, TX, USA) data set was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more diff conventional algorithms (SSRN, FDSSC,  However, with the development of state-of-art algorithms for hyperspectral image classif these three data sets are easily classified. When the number of training samples was more th SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The dif between the classification accuracies of these methods is less than 1%. Therefore, in addition three data sets discussed above, this study included the Houston (Houston, TX, USA) data set was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more diff conventional algorithms (SSRN, FDSSC,  However, with the development of state-of-art algorithms for hyperspectral image classif these three data sets are easily classified. When the number of training samples was more th SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The dif between the classification accuracies of these methods is less than 1%. Therefore, in addition three data sets discussed above, this study included the Houston (Houston, TX, USA) data set was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more diff conventional algorithms (SSRN, FDSSC,  However, with the development of state-of-art algorithms for hyperspectral image classif these three data sets are easily classified. When the number of training samples was more th SSRN and FDSSC achieved an accuracy higher than 98% for the three HSI data sets. The dif between the classification accuracies of these methods is less than 1%. Therefore, in addition three data sets discussed above, this study included the Houston (Houston, TX, USA) data set was distributed for the 2013 GRSS Data Fusion Contest [28]. The Houston data are more diff conventional algorithms (SSRN, FDSSC,  under different conditions. A subset of 200 labeled samples were used for training and 100 labeled samples were used for validating. A series of 400, 600, 800, and 1000 training samples were then included to test the robustness and generalizability of the proposed AUSSC.

Framework Setting
The framework for all data sets was established as follows. From 10 random seeds, all data sets were randomly divided into the three following groups: A training set, a validation set, and a test set. The training sets were used to optimize model parameters. The validation sets were not directly used in the training process and were only included to verify whether the model was overfitting. The testing sets were used to test the performance of the model after the training was completed. The number of validation sets was half the number of training sets and the remainder of the sets were test sets. The batch size was set to 16 and the Adam [29] optimizer was used for stochastic optimization. The initialization of model weights was performed using the He normal distribution method [30] for all 3D convolutional layers and the Xavier normal distribution method for the fully connected layer [31]. We used a variable learning rate, which was gradually reduced during the optimization process. This was done because the learning rate must be smaller when closer to the valley. The number of training epochs was set to 400 and the initial learning rate was set to 0.0001 for IP, KSC, and SS data sets and 0.0003 for the Houston data set. The learning rate was halved when the validation loss did not decrease after 10 epochs.
In addition to these basic settings, four key factors were used to configure the AUSSC framework for HSI classification. Namely, (1) the number of convolutional layers and loops in one block of stage 2; (2) the number of convolutional kernels in alternately updated blocks; (3) the spatial size of input cubes; and (4) the coefficients of the center loss function. These four factors are discussed by the OA of IP, KSC, and SS below.
First, the number of convolutional layers and loops in each block of stage 2 determined the depth of the entire network, which consequently affected classification accuracy and runtime. As shown in Figure 6, appropriately increasing the number of convolutional layers and the number of loops improved classification. However, the network depth had a significant impact on training time and was almost linearly related to the training time. Therefore, we used two convolutional layers and only loop in each block to conserve training time.
Remote Sens. 2019, 11, x FOR PEER REVIEW 11 of 20 testing sets were used to test the performance of the model after the training was completed. The number of validation sets was half the number of training sets and the remainder of the sets were test sets. The batch size was set to 16 and the Adam [29] optimizer was used for stochastic optimization. The initialization of model weights was performed using the He normal distribution method [30] for all 3D convolutional layers and the Xavier normal distribution method for the fully connected layer [31]. We used a variable learning rate, which was gradually reduced during the optimization process. This was done because the learning rate must be smaller when closer to the valley. The number of training epochs was set to 400 and the initial learning rate was set to 0.0001 for IP, KSC, and SS data sets and 0.0003 for the Houston data set. The learning rate was halved when the validation loss did not decrease after 10 epochs.
In addition to these basic settings, four key factors were used to configure the AUSSC framework for HSI classification. Namely, (1) the number of convolutional layers and loops in one block of stage 2; (2) the number of convolutional kernels in alternately updated blocks; (3) the spatial size of input cubes; and (4) the coefficients of the center loss function. These four factors are discussed by the OA of IP, KSC, and SS below.
First, the number of convolutional layers and loops in each block of stage 2 determined the depth of the entire network, which consequently affected classification accuracy and runtime. As shown in Figure 6, appropriately increasing the number of convolutional layers and the number of loops improved classification. However, the network depth had a significant impact on training time and was almost linearly related to the training time. Therefore, we used two convolutional layers and only loop in each block to conserve training time. Second, increasing the number of convolutional kernels often extracted more rich features. If enough convolutional kernels were provided, abstract high-order structures could be efficiently learned from the convolutional layer. As shown in Figure 7, the overall accuracy (OA) of the AUSSC was weakly positively related to the number of convolutional kernels, which had little effect on Second, increasing the number of convolutional kernels often extracted more rich features. If enough convolutional kernels were provided, abstract high-order structures could be efficiently learned from the convolutional layer. As shown in Figure 7, the overall accuracy (OA) of the AUSSC was weakly positively related to the number of convolutional kernels, which had little effect on training time. Combining the performance of the AUSSC for the three data sets, the number of kernels in the first convolutional layer was set to 64 in each block and the number of kernels was set to 36 in two blocks.
Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 20 Figure 7. The overall accuracy of the AUSSC with different numbers of convolutional kernels in the first layer using two blocks. The + notation on the x-axis denotes an AUCN with kernels in the first convolutional layer and kernels in the two blocks.
Third, a larger input space allowed more spatial information to be extracted. Input samples with spatial sizes of 5 × 5, 7 × 7, 9 × 9, and 11 × 11 were used in the three data sets. As shown in Figure  8, the OAs of the IP, KSC, and SS data sets increased with increasing input spatial size. However, for inputs with spatial sizes greater than or equal to 9 × 9, the increase in OA was less than 1%. Considering the cost of calculation, the 9 × 9 spatial size was selected for all data sets to test the performance of the AUSSC framework. Moreover, the coefficient of center loss also played an important role in our proposed AUSSC. The coefficient of L2 loss was set to 0.0001 and the possible values of the coefficients for center loss were set to 0, 0.1, 0.01, and 0.001. As shown in Figure 9, the center loss could not be used directly as an objective function. However, as an auxiliary objective function, the center loss can slightly increase the overall classification accuracy. When the coefficient of center loss was set to 0.001, the OA of the AUSSC using the IP and SS data sets increased slightly. However, the OA of the AUSSC using the Figure 7. The overall accuracy of the AUSSC with different numbers of convolutional kernels in the first layer using two blocks. The a + b notation on the x-axis denotes an AUCN with a kernels in the first convolutional layer and b kernels in the two blocks.
Third, a larger input space allowed more spatial information to be extracted. Input samples with spatial sizes of 5 × 5, 7 × 7, 9 × 9, and 11 × 11 were used in the three data sets. As shown in Figure 8, the OAs of the IP, KSC, and SS data sets increased with increasing input spatial size. However, for inputs with spatial sizes greater than or equal to 9 × 9, the increase in OA was less than 1%. Considering the cost of calculation, the 9 × 9 spatial size was selected for all data sets to test the performance of the AUSSC framework.
Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 20 Figure 7. The overall accuracy of the AUSSC with different numbers of convolutional kernels in the first layer using two blocks. The + notation on the x-axis denotes an AUCN with kernels in the first convolutional layer and kernels in the two blocks.
Third, a larger input space allowed more spatial information to be extracted. Input samples with spatial sizes of 5 × 5, 7 × 7, 9 × 9, and 11 × 11 were used in the three data sets. As shown in Figure  8, the OAs of the IP, KSC, and SS data sets increased with increasing input spatial size. However, for inputs with spatial sizes greater than or equal to 9 × 9, the increase in OA was less than 1%. Considering the cost of calculation, the 9 × 9 spatial size was selected for all data sets to test the performance of the AUSSC framework. Moreover, the coefficient of center loss also played an important role in our proposed AUSSC. The coefficient of L2 loss was set to 0.0001 and the possible values of the coefficients for center loss were set to 0, 0.1, 0.01, and 0.001. As shown in Figure 9, the center loss could not be used directly as an objective function. However, as an auxiliary objective function, the center loss can slightly increase the overall classification accuracy. When the coefficient of center loss was set to 0.001, the OA of the AUSSC using the IP and SS data sets increased slightly. However, the OA of the AUSSC using the KSC data set increased by nearly 1%. As such, the coefficient of center loss was set to 0.001. Moreover, the coefficient of center loss also played an important role in our proposed AUSSC. The coefficient of L2 loss was set to 0.0001 and the possible values of the coefficients for center loss were set to 0, 0.1, 0.01, and 0.001. As shown in Figure 9, the center loss could not be used directly as an objective function. However, as an auxiliary objective function, the center loss can slightly increase the overall classification accuracy. When the coefficient of center loss was set to 0.001, the OA of the AUSSC using the IP and SS data sets increased slightly. However, the OA of the AUSSC using the KSC data set increased by nearly 1%. As such, the coefficient of center loss was set to 0.001.

Experimental Results
In this section, we compare the proposed AUSSC framework with deep learning-based methods, including SAE-LR [14], CNN [18], SSRN [21], 3D-GAN [24], and FDSSC [22]. As SSRN, FDSSC, and the proposed AUSSC are all 3D CNN-based methods, the input spatial size was fixed at 9 × 9 to allow a fair comparison. Ten groups of 200 training samples were randomly selected from the IP, KSC, SS, and Houston data sets. The classification accuracy indices for the experiment included the OA, average accuracy (AA), and kappa coefficient (Κ). The results of these three metrics are displayed in the form of mean ± standard deviation. The original hyperspectral data were normalized to a zero mean and standard deviation of one. The dimensions of the image block were the same as those of the original hyperspectral data. Figures 10-13 show classification results obtained from the IP, KSC, SS, and Houston data sets using different algorithms.

Experimental Results
In this section, we compare the proposed AUSSC framework with deep learning-based methods, including SAE-LR [14], CNN [18], SSRN [21], 3D-GAN [24], and FDSSC [22]. As SSRN, FDSSC, and the proposed AUSSC are all 3D CNN-based methods, the input spatial size was fixed at 9 × 9 to allow a fair comparison. Ten groups of 200 training samples were randomly selected from the IP, KSC, SS, and Houston data sets. The classification accuracy indices for the experiment included the OA, average accuracy (AA), and kappa coefficient (K). The results of these three metrics are displayed in the form of mean ± standard deviation. The original hyperspectral data were normalized to a zero mean and standard deviation of one. The dimensions of the image block were the same as those of the original hyperspectral data. Figures 10-13 show classification results obtained from the IP, KSC, SS, and Houston data sets using different algorithms.

Experimental Results
In this section, we compare the proposed AUSSC framework with deep learning-based methods, including SAE-LR [14], CNN [18], SSRN [21], 3D-GAN [24], and FDSSC [22]. As SSRN, FDSSC, and the proposed AUSSC are all 3D CNN-based methods, the input spatial size was fixed at 9 × 9 to allow a fair comparison. Ten groups of 200 training samples were randomly selected from the IP, KSC, SS, and Houston data sets. The classification accuracy indices for the experiment included the OA, average accuracy (AA), and kappa coefficient (Κ). The results of these three metrics are displayed in the form of mean ± standard deviation. The original hyperspectral data were normalized to a zero mean and standard deviation of one. The dimensions of the image block were the same as those of the original hyperspectral data. Figures 10-13 show classification results obtained from the IP, KSC, SS, and Houston data sets using different algorithms.       Tables 4-7 display the results of the OA, AA, kappa coefficient, and accuracy of each category for the IP, KSC, SS, and Houston data sets and the best accuracy is shown in bold. These experimental results show that our proposed AUSSC method is superior to early deep learning methods (SAE-LR and CNN), novel 3D-GAN, and recent 3D CNN-based methods (SSRN and FDSSC). Table 4. Overall accuracy (OA), average accuracy (AA), kappa coefficient (Κ), and accuracy for each HSI category in the Indiana Pines (IP) data set. Data are given as mean ± standard deviation.  Tables 4-7 display the results of the OA, AA, kappa coefficient, and accuracy of each category for the IP, KSC, SS, and Houston data sets and the best accuracy is shown in bold. These experimental results show that our proposed AUSSC method is superior to early deep learning methods (SAE-LR and CNN), novel 3D-GAN, and recent 3D CNN-based methods (SSRN and FDSSC).  Tables 4-7 display the results of the OA, AA, kappa coefficient, and accuracy of each category for the IP, KSC, SS, and Houston data sets and the best accuracy is shown in bold. These experimental results show that our proposed AUSSC method is superior to early deep learning methods (SAE-LR and CNN), novel 3D-GAN, and recent 3D CNN-based methods (SSRN and FDSSC).  Table 5. OA, AA, K, and accuracy for each HSI category in the Kennedy Space Center (KSC) data set.

SAE-LR CNN SSRN 3D-GAN FDSSC AUSSC
performance for the KSC data set. AUSSC also achieved the best accuracy in 7 of 13 KSC categories, producing results similar to those of 3D-GAN for Class 10 (Cattail marsh) and Class 11 (Salt marsh). 3D-GAN achieved significantly better results for Class 5 (Oak), Class 6 (Hardwood), and Class 7 (Swamp). However, its accuracy for Class 2 (Willow swamp) and Class 8 (Graminoid marsh) was~20% lower than that of our method. As shown in Table 6, the values of OA and K obtained using AUSSC were 0.63% and 0.71% higher than those produced by FDSSC, which exhibited the second-best performance for the SS data set. AUSSC achieved similar or better results than FDSSC across all 16 categories in the SS data set. As shown in Table 7, the values of OA, AA, and K, obtained using AUSSC, were 1.81%, 1.91%, and 1.95% higher than those obtained by FDSSC, which exhibited the second-best performance for the Houston data set. CNN achieved excellent results for Class 2 (Grass Stressed), Class 5 (Soil), and Class 15 (Running Track). However, the accuracy of CNN in Category 10 (Highway), Class 12 (Parking Lot 1), and Class 13 (Parking Lot 2) was~40% lower than that of our method.
These experimental results indicate AUSSC achieved the best performance in terms of OA and K for all four HSI data sets. Other methods, especially CNN, were superior to our methods in some categories, but performed poorly in others. These poorly performing categories dramatically reduced the OA, AA, and K.
With the exception of the 3D-GAN data, which were obtained from the literature, these classification results shown in Tables 4-7 were trained and tested using a desktop computer with 32 GB of memory equipped with an NVIDA GTX 1080Ti GPU. Table 8 shows the mean and standard deviation of the training time and testing time for 10 runs using CNN-based methods and the minimum time is shown in bold. As shown in the tables, the training times for deep 3D CNN-based methods were longer than those of other deep learning-based methods. The AUSSC required a longer training time than SSRN or FDSSC. For AUSSC applied to the IP data set, the number of floating-point operations per second (FLOPs) was 5362.386 K and the number of parameters was 761.064 K. To corroborate the robustness and generalizability of the proposed method, Figures 14 and 15 show the OA obtained using different methods for different training samples. When the number of training samples was higher than 400, our method performed similarly to SSRN and FDSSC. This is because the OA of SSRN and FDSSC reached more than 98%, creating a small gap between our method and these conventional techniques. This also demonstrates that the three datasets published more than 10 years ago are easily classified by state-of-the-art methods. The Houston data set, provided by the University of Houston for the 2013 IEEE GRSS Data Fusion Contest, is more challenging. As shown in Figure 15, it is more discriminant than the three datasets in comparing AUSSC with other methods. The resulting difference in OA between AUSSC, FDSSC, and SSRN was more than 1%. method and these conventional techniques. This also demonstrates that the three datasets published more than 10 years ago are easily classified by state-of-the-art methods. The Houston data set, provided by the University of Houston for the 2013 IEEE GRSS Data Fusion Contest, is more challenging. As shown in Figure 15, it is more discriminant than the three datasets in comparing AUSSC with other methods. The resulting difference in OA between AUSSC, FDSSC, and SSRN was more than 1%.   method and these conventional techniques. This also demonstrates that the three datasets published more than 10 years ago are easily classified by state-of-the-art methods. The Houston data set, provided by the University of Houston for the 2013 IEEE GRSS Data Fusion Contest, is more challenging. As shown in Figure 15, it is more discriminant than the three datasets in comparing AUSSC with other methods. The resulting difference in OA between AUSSC, FDSSC, and SSRN was more than 1%.

Discussion
In this study, a highly limited number of training samples (200) was used to demonstrate that our proposed method can reduce data dependence. Insufficiently labeled data are unavoidable in remote sensing applications. Additionally, the collection and labeling of remote sensing data is complex and expensive. Thus, it is very difficult to build large-scale, high-quality labeled sets. The number of labeled samples used for training is the most important factor in deep-learning supervised methods, as data dependence is one of the most serious problems in deep learning. Compared with traditional machine-learning methods, deep learning relies heavily on large-scale training data, which are necessary to understand potential patterns. Semi-supervised 3D-GANs also require~200 training samples; however, their classification accuracy is significantly lower.
The proposed method offers three principal benefits. First, it provides an end-to-end framework for HSI classification. SAE-LR, CNN, and 3D-GAN all require PCA to preprocess hyperspectral data. Second, deep CNN architectures and convolutional kernels were used to determine classification accuracy in 3D CNN-based methods [20]. These networks include only two convolutional layers with 3 × 3 × m convolutional kernels. SSRN and FDSSC use residual blocks, dense blocks, and two different convolutional kernels to learn deep spectral and spatial features. The biggest difference between AUSSC and the 3D CNN-based methods discussed above is its use of recurrent CNN architectures and three 1D convolutional kernels. Alternately updated blocks can not only learn deep spectral and spatial features but also refined spectral and spatial features. As a result, three 1D convolutional kernels can be combined to generate more abundant features. As a result, AUSSC achieved better classification accuracy than current state-of-the-art deep learning-based methods. Finally, unlike these other methods, only cross-entropy objective functions were used in the AUSSC. We also introduced center loss in the AUSSC as an auxiliary objective function to learn more discriminating features.
Although the proposed method provides better performance than conventional architectures (especially SSRN and FDSSC), it has a much higher computational requirement (see Table 8). There are three primary reasons for this. First, AUSSC uses more convolutional kernels in two blocks than SSRN and FDSSC. Second, the use of the center loss function increases the computational cost. Finally, and most importantly, more training epochs are used in AUSSC than in SSRN and FDSSC. In fact, the training time for one epoch in AUSSC is only slightly longer than in FDSSC or SSRN. However, AUSSC requires far more training epochs. The regular updating of graphics cards and the use of high performance graphics cards, such as the NVIDIA GeForce RTX 2080Ti, could effectively alleviate this problem.

Conclusions
In this study, refined spectral and spatial features in HSIs were used as core concepts to design an end-to-end CNN-based framework for HSI classification. This alternately updated convolutional spectral-spatial network utilizes alternately updated spectral and spatial blocks and primarily includes small convolutional kernels in three different dimensions to learn HIS features, combining them into advanced features.
The learning of deep refined spectral and spatial features by alternately updated blocks makes our method superior to other deep learning-based methods, as this allows it to achieve a high classification accuracy. Furthermore, experimental results also demonstrated that the center loss function can slightly improve the classification accuracy of hyperspectral images. Results showed that when 200 training samples were used from different HSI data sets, the AUSSC achieved the highest classification accuracy among the deep learning-based methods for all three data sets. Additionally, using different training samples, the AUSSC was also found to be the best method in terms of OA for all HSI data sets. However, the AUSSC has a longer training time than other conventional algorithms. In a future study, network pruning will be used to reduce the heavy calculation of the deep model.