A Lightweight Spectral–Spatial Feature Extraction and Fusion Network for Hyperspectral Image Classification

: Hyperspectral image (HSI) classiﬁcation accuracy has been greatly improved by employing deep learning. The current research mainly focuses on how to build a deep network to improve the accuracy. However, these networks tend to be more complex and have more parameters, which makes the model di ﬃ cult to train and easy to overﬁt. Therefore, we present a lightweight deep convolutional neural network (CNN) model called S2FEF-CNN. In this model, three S2FEF blocks are used for the joint spectral–spatial features extraction. Each S2FEF block uses 1D spectral convolution to extract spectral features and 2D spatial convolution to extract spatial features, respectively, and then fuses spectral and spatial features by multiplication. Instead of using the full connected layer, two pooling layers follow three blocks for dimension reduction, which further reduces the training parameters. We compared our method with some state-of-the-art HSI classiﬁcation methods based on deep network on three commonly used hyperspectral datasets. The results show that our network can achieve a comparable classiﬁcation accuracy with signiﬁcantly reduced parameters compared to the above deep networks, which reﬂects its potential advantages in HSI classiﬁcation.


Introduction
Hyperspectral remote sensing technology is a focus in the remote sensing field, which has been applied for crop management, image segmentation, object recognition, etc. [1][2][3][4]. Hyperspectral image classification often plays the most important role of these applications. In HSI, each pixel is always considered as a high-dimensional vector. So, the classification of HSI is essentially to predict a specific category for each pixel according to its characteristics [5].
According to the way in which hyperspectral image features are acquired, we classify the hyperspectral image (HSI) classification methods into two categories: one is the method that extracts the HSI features manually, while the other is the method that extracts the HSI features automatically.
The traditional HSI classification methods belong to the first category, most of which analyze the HSIs and extract their shallow features for classification. The most prominent feature of HSI is the rich spectral information. Early researches concentrated on acquiring accurate and efficient spectrum characteristics [6][7][8][9][10]. For example, the authors of [6] and [7] used spectral angle or spectral divergence for pixel matching. In addition, the authors of [8][9][10] used another kind of method based on statistics, completed the classification by learning from the labeled samples. In [11], a more accurate feature extraction based on Principal Components Analysis (PCA) was used. However, the HSI data includes not only spectral features of each pixel but also the spatial relationship between these pixels. Only using the spectral information for classification leads to low accuracy. Therefore, current research on HSI

Related Work
One-dimensional-CNN uses spectral features for classification, in which the data input is usually a single pixel. The feature map extracted by 1D convolution reflects the characteristics of the spectrum. A typical 1D convolution and 1D-CNN classification framework [24] are shown in Figure 1.  [24]. In this framework, there are several layers: 1D convolutional layer, pooling layer, fully-connected (FC) layer, and an output layer. CNN: convolutional neural network.
Two-dimensional-CNN uses spatial features for classification. With spatial features extraction, there is redundancy between spectrum bands. Therefore, this framework is usually combined with PCA to reduce the spectral dimension before spatial feature extraction [18,36]. First, the HSIs are preprocessed by PCA for dimension reduction. Then, diverse spatial features are extracted by 2D convolution kernel. A typical 2D convolution and the classification framework of 2D-CNN [15] are shown in Figure 2.  [15]. After PCA for dimension reduction, the framework contains two 2D convolutional layers, two pooling layers, and an output layer.  [24]. In this framework, there are several layers: 1D convolutional layer, pooling layer, fully-connected (FC) layer, and an output layer. CNN: convolutional neural network. Two-dimensional-CNN uses spatial features for classification. With spatial features extraction, there is redundancy between spectrum bands. Therefore, this framework is usually combined with PCA to reduce the spectral dimension before spatial feature extraction [18,36]. First, the HSIs are preprocessed by PCA for dimension reduction. Then, diverse spatial features are extracted by 2D convolution kernel. A typical 2D convolution and the classification framework of 2D-CNN [15] are shown in Figure 2.
One-dimensional-CNN uses spectral features for classification, in which the data input is usually a single pixel. The feature map extracted by 1D convolution reflects the characteristics of the spectrum. A typical 1D convolution and 1D-CNN classification framework [24] are shown in Figure 1.  [24]. In this framework, there are several layers: 1D convolutional layer, pooling layer, fully-connected (FC) layer, and an output layer. CNN: convolutional neural network.
Two-dimensional-CNN uses spatial features for classification. With spatial features extraction, there is redundancy between spectrum bands. Therefore, this framework is usually combined with PCA to reduce the spectral dimension before spatial feature extraction [18,36]. First, the HSIs are preprocessed by PCA for dimension reduction. Then, diverse spatial features are extracted by 2D convolution kernel. A typical 2D convolution and the classification framework of 2D-CNN [15] are shown in Figure 2.  [15]. After PCA for dimension reduction, the framework contains two 2D convolutional layers, two pooling layers, and an output layer.  [15]. After PCA for dimension reduction, the framework contains two 2D convolutional layers, two pooling layers, and an output layer.
As described above, in the current HSI classification research, there are two kinds of spectral-spatial feature-based CNN classification methods: 3D-CNN and hybrid CNN. The 3D-CNN method extracts features in the spectral and spatial dimensions simultaneously through 3D convolution. However the hybrid CNN selects some of the above mentioned types of CNNs and fuse them in a sequential way or in a multi-channel way. Figure 3 shows the schematic diagram of the above two classes. spatial feature-based CNN classification methods: 3D-CNN and hybrid CNN. The 3D-CNN method extracts features in the spectral and spatial dimensions simultaneously through 3D convolution. However the hybrid CNN selects some of the above mentioned types of CNNs and fuse them in a sequential way or in a multi-channel way. Figure 3 shows the schematic diagram of the above two classes.   [30]; (b) a sequential 3D-2D hybrid CNN framework [38]; (c) a multi-channel 1D-2D hybrid CNN framework [37].  [30]; (b) a sequential 3D-2D hybrid CNN framework [38]; (c) a multi-channel 1D-2D hybrid CNN framework [37].
In summary, 1D-CNN makes full use of the spectral features of the HSI, but lacks the spatial features. Two-dimensional-CNN has to deal with high spectral dimension problems before extracting spatial features. PCA is often used for reducing spectral dimension. It simplifies the process but brings information loss and destruction of the internal structure of the data. Three-dimensional-CNN can make the process of feature extraction and fusion easier, but the prominent problem is that the model suffers from heavy network parameters. If only the number of network parameters is taken into account, the dual-channel hybrid CNN method in Figure 3c seems to be more suitable and effective.

Proposed Methodology
In this section, we first illustrate the structure of the elementary block of our model, and then show in detail how the block extracts and fuses the features. Finally, we elaborate on the architecture of the S2FEF-CNN.

S2FEF Block
Details of the basic S2FEF block are demonstrated in Figure 4.
brings information loss and destruction of the internal structure of the data. Three-dimensional-CNN can make the process of feature extraction and fusion easier, but the prominent problem is that the model suffers from heavy network parameters. If only the number of network parameters is taken into account, the dual-channel hybrid CNN method in Figure 3c seems to be more suitable and effective.

Proposed Methodology
In this section, we first illustrate the structure of the elementary block of our model, and then show in detail how the block extracts and fuses the features. Finally, we elaborate on the architecture of the S2FEF-CNN.

S2FEF Block
Details of the basic S2FEF block are demonstrated in Figure 4. The S2FEF block contains two stages: one is for feature extraction and the other is for feature fusion.
In the first stage, spectral and spatial features are extracted by 1D/2D convolutional kernels in spectral and spatial channels, respectively. This step is formulated as follow: (1) (2) where denotes the input HSI data, i =1,…,I (I is the input number), and / and are the weights and bias of spectral/spatial kernel t in layer j, respectively, t=1,…, Kk (Kk is the number of kernels), j represents the layer index, f is the features extractor, and subscript e and a represent spectral and spatial, respectively.
In the following stage, and are fused in three steps. (1) Features from two channels are fused by element-wise multiplication. The S2FEF block contains two stages: one is for feature extraction and the other is for feature fusion. In the first stage, spectral and spatial features are extracted by 1D/2D convolutional kernels in spectral and spatial channels, respectively. This step is formulated as follow: where x j i denotes the input HSI data, i = 1, . . . ,I (I is the input number), W j e t and b j e t /W j a t and b j a t are the weights and bias of spectral/spatial kernel t in layer j, respectively, t = 1, . . . , K k (K k is the number of kernels), j represents the layer index, f is the features extractor, and subscript e and a represent spectral and spatial, respectively.
In the following stage, f j e t (x j i ) and f j a t (x j i ) are fused in three steps. (1) Features from two channels are fused by element-wise multiplication.
After the former feature extraction, we directly fuse the spectral features and spatial features by element-wise multiplication (EWM). Compared with feature concatenation, the EWM will not increase the feature dimension but will adjust the spectral features by spatial information to a certain extent.
(2) The maximum element in different feature cubes are selected to produce the final feature cube.
Remote Sens. 2020, 12, 1395 6 of 17 (3) The maximal feature cube selected by Equation (5) is added to the original input cube x j i to form a more accurate output cube x j+1 i . Finally, rectified linear unit (Relu) is exploited for activation.
The above process is summarized in Algorithm 1.

Algorithm 1: Feature Extraction with S2FEF Block
Input: A joint spectral-spatial feature map F, Spectral/Spatial kernel size S pe /S pa and kernel number k. Output: A new joint spectral-spatial feature map F'. 1. begin 2. Extract spectral/spatial features f spe /f spa with k spectral/spatial kernel (size 1 × 1 × S pe /S pa × S pa × 1). 3. Fuse the spectral and spatial features together by element-wise multiplication (f spe × f spa ) to get the joint features f joint . 4. Select the max value from the corresponding pixel in f joint to form a special feature F'. 5. Return F'. 6. end

S2FEF-CNN Architecture
The proposed network mainly consists of three steps: (1) Step 1: extracting spectral-spatial joint features by three S2FEF blocks; (2) Step 2: reducing spectral and spatial dimensions of the joint features by two pooling layers; (3) Step 3: determining the pixel label via a softmax layer after flattening the joint features from Step 2 into a vector.
The architecture of the proposed S2FEF-CNN is shown in Figure 5. extent.
(2) The maximum element in different feature cubes are selected to produce the final feature cube.

max
(4) (3) The maximal feature cube selected by Equation (5) is added to the original input cube to form a more accurate output cube . Finally, rectified linear unit (Relu) is exploited for activation.
The above process is summarized in Algorithm 1. Extract spectral/spatial features fspe/fspa with k spectral/spatial kernel (size 11Spe/SpaSpa1).

3.
Fuse the spectral and spatial features together by element-wise multiplication (fspe  fspa) to get the joint features fjoint.

4.
Select the max value from the corresponding pixel in fjoint to form a special feature Fʹ. 5.

S2FEF-CNN Architecture
The proposed network mainly consists of three steps: (1) Step 1: extracting spectral-spatial joint features by three S2FEF blocks; (2) Step 2: reducing spectral and spatial dimensions of the joint features by two pooling layers; (3) Step 3: determining the pixel label via a softmax layer after flattening the joint features from Step 2 into a vector.
The architecture of the proposed S2FEF-CNN is shown in Figure 5. Define ∈ ℝ (i=1,2,…,K) as the input HSI cube and as the output label of . the process is defined as follow: Define x i ∈ R m 1 ×m 1 ×N 1 (i = 1,2, . . . ,K) as the input HSI cube andŷ i as the output label of x i . The process is defined as follow:ŷ where S2FEF denotes the operator of the proposed spectral-spatial feature extraction and fusion, δ p denotes two max pooling operators, and δ s denotes the softmax classification.
Most of the commonly used 2D-CNN classification methods use PCA to preprocess HSIs for dimension reduction, but PCA introduces the problem of spectral information loss. In our architecture, we abandoned PCA and used the full spectrum of HSIs as input, and two pooling layers were used for feature dimension reduction.
CNN-based classification frameworks usually have one or two FC layers to integrate the features before the final classification. However, the FC layer is a heavily weighted layer, which may account Remote Sens. 2020, 12, 1395 7 of 17 for 80% of the parameters of a network. Hence, we also dropped the FC layer in our architecture for network parameter reduction.
Improvements of the architecture make our proposed network light. Experimental results show that even with only a few thousand parameters, our network can achieve comparative classification accuracy as those state-of-the-art deep networks with heavy weight.
The above steps are summarized in Algorithm 2.

Datasets
We evaluated our work on three public hyperspectral datasets, Indian Pines (IP), Salinas (SA), and Pavia University (PU), captured by two different sensors: AVIRIS and ROSIS-03. AVIRIS can provide HSI with 224 contiguous spectral bands, covering wavelengths from 0.4 to 2.5 µm and with a spatial resolution of 20 m/pixel, while ROSIS-03 delivers HSI in 115 bands with a spectral coverage ranging from 0.43 to 0.86 µm and with a spatial resolution of 1.3 m/pixel. Indian Pines (IP) and Salinas (SA) are two commonly used datasets by AVIRIS. IP is captured in Northwestern Indiana and is 145 × 145 pixels in size, and Salinas is recorded over Salinas Valley and includes 512 × 217 pixels. They both consist of 16 ground-truth classes. The scene of PU is captured by ROSIS-03 with a size of 610 × 340 pixels and contains nine different classes. In our experiments, we use a corrected version with 200/204 bands for IP and SA, and 103 bands for PU in experiments after removing the noisy bands and a blank strip.

Parameters Setting
In our network, three S2FEF blocks were used for all the three datasets. In each block, the number and size of convolutional kernels were the same for 1D convolution and 2D convolution. We empirically set the parameters just as the other networks do. The spectral kernel size was 1×3, the spatial kernel size was 3 × 3, and the kernel number was 4.
The input HSI cube size was set differently for each dataset. Unlike some networks that use a small size input, we wanted the original input cube to contain enough spatial information, and then chose a big size input. For the IP dataset, the size was 19 × 19 × N (i.e., m = 19), where N represented the band number. For the PU dataset, the size was 15 × 15 × N, while for the SA dataset, the size was 21 × 21.
We compared our experimental results with four well-performed deep network based HSI classification methods: SAE [39], 1D-CNN [24], 3D-CNN [30], and DC-CNN [36]. First, we compare the number of parameters in Section 4.3, and the results of classification accuracy are described in

Comparison of Parameter Numbers
In our S2FEF-CNN architecture, the parameters are contained in the S2FEF block and in the final output layer. Detailed analysis results are given in Tables 1 and 2, respectively. Each S2FEF block has the same number of parameters. For the entire network, a large number of parameters are in the softmax layers, depending on the characteristics of the dataset. For Indian Pines and Salinas datasets, they have more input spectral bands and more ground truth classes, so they have more parameters. Pavia University has fewer bands and fewer classes, so it has fewer parameters.
For SAE, 1D-CNN, 3D-CNN, and DC-CNN, we used the same architecture and the same parameter settings as in their papers. For those settings that are not explicitly given in the paper, we adopted the commonly used values in HSI classification (e.g., the pooling stride was 2). Table 3 shows the comparison results in detail. Table 4 gives the parameter percentage of S2FEF-CNN compared with other networks.  Obviously, our proposed network works well with the fewest parameters, most of which are no more than 5% of the parameters used by the other deep networks, and the highest percentage is no more than 8%. It seems to be a potentially feasible way to solve the problem of heavy weights while training a deep network.

Results of the Indian Pines Dataset
This dataset is a bit different from the other datasets. The most notable characteristic is that there are not enough samples in some classes, a few of which have less than 20 labeled data. Therefore, we adopted the method of [30] to split the labeled data in a 1:1 ratio for training and testing.
The classification results are shown in Figures 6 and 7.
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 19 From the results listed above, we can find that the S2FEF-CNN works well even with only a few thousand parameters. For all these methods, the OAs of 16 classes vary widely. For example, the class Oats has significantly lower OA because it has fewer labeled samples than the other categories.  From the results listed above, we can find that the S2FEF-CNN works well even with only a few thousand parameters. For all these methods, the OAs of 16 classes vary widely. For example, the class Oats has significantly lower OA because it has fewer labeled samples than the other categories.  In this paper, three commonly-used metrics were adopted to evaluate the classification performance, which are the overall accuracy (OA), the average accuracy (AA), and the Kappa coefficient. Figure 6 shows the OA curves with the visualized line diagram clearly.

Results of the Pavia University dataset
From the results listed above, we can find that the S2FEF-CNN works well even with only a few thousand parameters. For all these methods, the OAs of 16 classes vary widely. For example, the class Oats has significantly lower OA because it has fewer labeled samples than the other categories.

Results of the Pavia University Dataset
In this dataset, there are enough labeled samples, so we set the split ratio to 2:8 for training and testing. The results are shown in Figures 8-10, where Figure 9 shows the training accuracy and loss curves.
Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 19 In this dataset, there are enough labeled samples, so we set the split ratio to 2:8 for training and testing. The results are shown in Figures 8-10, where Figure 9 shows the training accuracy and loss curves.  In this dataset, there are enough labeled samples, so we set the split ratio to 2:8 for training and testing. The results are shown in Figures 8-10, where Figure 9 shows the training accuracy and loss curves.  As can be seen from Figure 8, S2FEF-CNN also performs well on the PU dataset, and it has a good classification ability for all nine classes. The accuracy of each class is more than 94%. There is little difference in the OA value between categories. Figure 9 shows the dynamic changing of training on the PU dataset. All the methods converged after 100 epochs. DC-CNN and SAE are the fastest convergence methods. The curve of S2FEF-CNN oscillated a bity during the test.

Results of the Salinas dataset
The portion of training set and test set of Salinas is the same as that of PU. Figures 11 and 12 show the results. As can be seen from Figure 8, S2FEF-CNN also performs well on the PU dataset, and it has a good classification ability for all nine classes. The accuracy of each class is more than 94%. There is little difference in the OA value between categories. Figure 9 shows the dynamic changing of training on the PU dataset. All the methods converged after 100 epochs. DC-CNN and SAE are the fastest convergence methods. The curve of S2FEF-CNN oscillated a bity during the test.

Results of the Salinas Dataset
The portion of training set and test set of Salinas is the same as that of PU. Figures 11 and 12 show the results.
The performance of S2FEF-CNN on the SA dataset was similar to that on the IP and PU datasets, and the OA curve also looks stable. Several methods did not produce high results on the Grapes_untrained and Vinyard_untrained classes, and most of them were misclassified. From Figure 12 we can see that the two classes are very close geographically. In addition, the spectral lines of the two classes are also very similar. This may be the cause of the misclassification. The performance of S2FEF-CNN on the SA dataset was similar to that on the IP and PU datasets, and the OA curve also looks stable. Several methods did not produce high results on the Grapes_untrained and Vinyard_untrained classes, and most of them were misclassified. From Figure  12 we can see that the two classes are very close geographically. In addition, the spectral lines of the two classes are also very similar. This may be the cause of the misclassification.

Parameter Influence
In this section, we discuss and show how the parameters influence the classification performance as shown in Figure 13-15. Some vital hyperparameters such as kernel number, kernel size, and network depth are discussed. We use the classification results of the Indian Pines dataset as an example. Figure 13 shows the influence of cube size on OA, AA, and Kappa. As expected, when the spatial size of the cube increased from 7  7 to 19  19, the result increased significantly. Three-dimensional-CNN [30] uses 3D cubes as the input and sets the spatial size to 5  5. As mentioned before, we need a big spatial size of cube to extract sufficient features for the next classification. Of course, we made a tradeoff between performance and cost as well, and finally set the size to 19  19.

Parameter Influence
In this section, we discuss and show how the parameters influence the classification performance as shown in Figures 13-15. Some vital hyperparameters such as kernel number, kernel size, and network depth are discussed. We use the classification results of the Indian Pines dataset as an example.   Figure 14 shows the results of different network depths. In general, the deeper the network the better, because a deeper network can extract more high-level features that are beneficial for classification. However, our results are not proportional to the depth of the network. In other words, deeper is not better in S2FEF architecture. Actually, networks with four or more blocks perform as well as that with three. Considering the balance between performance and cost, we eventually set the network layer to 3.
Another important parameter is the convolutional kernel numbers. Although there is no universal setting, 2 n is preferred. Figure 15 shows the accuracy comparison of different kernel numbers in each layer. We tried eight different combinations [2,2,2], [2,2,4], [2,4,4], [2,4,2], [4,2,2], [4,4,2], [4,2,4], and [4,4,4] (notation [k1, k2, k3] means k1 kernels in layer 1, k2 kernels in layer 2, k3 kernels in layer 3), and the combination of [4,4,4] worked the best. Unexpectedly, the suboptimal combination was [2,2,2].   Figure 14 shows the results of different network depths. In general, the deeper the network the better, because a deeper network can extract more high-level features that are beneficial for classification. However, our results are not proportional to the depth of the network. In other words, deeper is not better in S2FEF architecture. Actually, networks with four or more blocks perform as well as that with three. Considering the balance between performance and cost, we eventually set the network layer to 3.

Discussion
From the above results, we can draw the following conclusions. Firstly, it is obvious that the deep network using spectral-spatial features can achieve better classification accuracy than those using only spectral features. The results strongly prove that spectral-spatial features benefit HSI classification.
Secondly, deep learning performs outstandingly in some remote sensing fields. However, the  Figure 13 shows the influence of cube size on OA, AA, and Kappa. As expected, when the spatial size of the cube increased from 7 × 7 to 19 × 19, the result increased significantly. Three-dimensional-CNN [30] uses 3D cubes as the input and sets the spatial size to 5 × 5. As mentioned before, we need a big spatial size of cube to extract sufficient features for the next classification. Of course, we made a tradeoff between performance and cost as well, and finally set the size to 19 × 19. Figure 14 shows the results of different network depths. In general, the deeper the network the better, because a deeper network can extract more high-level features that are beneficial for classification. However, our results are not proportional to the depth of the network. In other words, deeper is not better in S2FEF architecture. Actually, networks with four or more blocks perform as well as that with three. Considering the balance between performance and cost, we eventually set the network layer to 3.

Discussion
From the above results, we can draw the following conclusions. Firstly, it is obvious that the deep network using spectral-spatial features can achieve better classification accuracy than those using only spectral features. The results strongly prove that spectral-spatial features benefit HSI classification.
Secondly, deep learning performs outstandingly in some remote sensing fields. However, the trend of making networks more complex and deeper brings a heavy load of parameters during training. More parameters may give the model better classification ability. It can be seen from the above results that DC-CNN shows the best accuracy. Sometimes we do not require high precision, but want to reduce the network parameters, so we can make some tradeoffs within an acceptable error range. Therefore, our method is a good attempt to simplify the network, such as with fewer kernels and/or fewer convolutional layers. PCA is an effective preprocessing method for dimension-reduction, while the fully connected layer is commonly used in CNN before classification, but these two common methods are replaceable. Our experimental results indicate that even simple and shallow networks can work well if we can come up with some effective strategies.
Finally, we have to talk about batch normalization, which is important for deep learning. Without batch normalization, the network can still work, but converges slowly. The explicit use of batch normalization forces the distribution of data to be more reasonable, which not only speeds up the convergence rate, but also smooths the accuracy curves.

Conclusions
In this paper, we explored an effective and novel spectral-spatial feature extraction and fusion block for HSI classification. Based on the block, we put forward a lightweight CNN-based network. Compared to state-of-the-art deep networks, the most impressive advantage of the proposed network is that it can obtain a considerable classification accuracy with very few parameters. This is a potentially feasible way to simplify the network while maintaining the accuracy of classification. Inevitably, we found various problems in training, e.g., convergence. Sometimes, bad initial variables led to very slow convergence or even nonconvergence. Computation of multiplication is also troublesome.
Our future research will focus on the following two points: 1) We will concentrate on an efficient fusion method of the spectral and spatial features to further reduce the computation; and 2) we will study the semi-supervised method for HSI classification. There is a lot of unlabeled data in HSI that may provide us with valuable features. The semi-supervised approach will look at how to make full use of these unmarked samples, which is worth studying.