## 1. Introduction

Hyperspectral remote sensing technology is a focus in the remote sensing field, which has been applied for crop management, image segmentation, object recognition, etc. [

1,

2,

3,

4]. Hyperspectral image classification often plays the most important role of these applications. In HSI, each pixel is always considered as a high-dimensional vector. So, the classification of HSI is essentially to predict a specific category for each pixel according to its characteristics [

5].

According to the way in which hyperspectral image features are acquired, we classify the hyperspectral image (HSI) classification methods into two categories: one is the method that extracts the HSI features manually, while the other is the method that extracts the HSI features automatically.

The traditional HSI classification methods belong to the first category, most of which analyze the HSIs and extract their shallow features for classification. The most prominent feature of HSI is the rich spectral information. Early researches concentrated on acquiring accurate and efficient spectrum characteristics [

6,

7,

8,

9,

10]. For example, the authors of [

6] and [

7] used spectral angle or spectral divergence for pixel matching. In addition, the authors of [

8,

9,

10] used another kind of method based on statistics, completed the classification by learning from the labeled samples. In [

11], a more accurate feature extraction based on Principal Components Analysis (PCA) was used. However, the HSI data includes not only spectral features of each pixel but also the spatial relationship between these pixels. Only using the spectral information for classification leads to low accuracy. Therefore, current research on HSI classification mostly uses the spectral–spatial features for classification [

12,

13,

14,

15,

16,

17,

18,

19]. In [

16], a controlled random sampling strategy was used to get a more accurate training set and testing set. In [

17], the spectrum was first partitioned into several groups and then band-specific spectral–spatial features were extracted by a convolutional neural network (CNN). In [

18], the Gabor features were stacked through Gabor filters with spectral features for classification. All of this research has proven that using spectral–spatial joint features for HSI classification can effectively improve classification accuracy.

CNN is a representative deep learning model [

20,

21,

22,

23]. The task of HSI classification, according to the different features processed by CNN, falls into three categories. The first category is 1D-CNN based on spectral features only. In 2015, Wei et al. [

24] applied a five-layer CNN to HSI classification and proposed the 1D-CNN network to extract spectral features. Li et al. [

14] proposed the idea of pixel pairs, learning the spectral features of a pair of pixels, and predicted the final classification results by voting. Yue [

25] used preprocessing before feature extraction. Mei et al. [

26] preprocessed each pixel with the mean and standard deviation from its neighborhood. The second category is a spatial feature-based method called 2D-CNN. The authors of [

15] introduced L2 regularization and virtual samples. In [

27], the Bayesian method was introduced. In [

28], spatial features were expressed by sparse representation. The authors of [

29] proposed a classification network on specific scenes. The last category is the spectral–spatial feature based method. In this case, there are two different ways of feature processing. One is 3D-CNN. Three-dimensional convolution was first used for video processing, and is now widely used in HSI classification [

30,

31,

32,

33,

34]. The other is hybrid CNN. Various different applications have been proposed [

35,

36,

37,

38]. For example, different hybrid ways of using 1D-CNN and 2D-CNN were presented in [

35,

36,

37]. While the authors of [

38] proposed a hybrid network of 3D-CNN and 2D-CNN.

In addition to CNN, other different deep models have been proposed and applied in HSI classification, such as stacked autoencoder (SAE) [

39], deep belief network (DBN) [

40], deep restricted Boltzmann machine (DRBM) [

41], capsule networks [

42], dense block [

43], and others.

Current research of HSI classification based on deep learning mainly focuses on how to build deep networks to improve accuracy. However, the more complex are these networks, the more training parameters are there. For example, there are about 360,000 training parameters in the classification network proposed in [

31]. In [

44], the proposed 3D-1D hybrid CNN method used a maximum of 61,949 parameters. The network in [

38], a 3D–2D hybrid CNN, used 5,122,176 parameters. Using so many training parameters makes the network difficult to train and easy to overfit.

In this paper, we present a lightweight spectral–spatial feature extraction and fusion convolutional neural network (S2FEF-CNN). In this model, three S2FEF blocks are concatenated to provide the joint spectral–spatial features. Each S2FEF block uses 1D and 2D convolution for spectral and spatial feature extraction, respectively, and then fuses the spectral and spatial features by multiplication. Pooling layers are used for dimension reduction and finally prediction of the classification results.

The main contributions of our work are as follows: 1) it proposes a lightweight end-to-end network model for hyperspectral image classification. In comparison with several state-of-the-art HSI classification methods based on deep networks, most of the time, our network can achieve comparable classification accuracy using no more than 5% of the parameters of the above deep networks, 2) it proposes a dual-channel feature extraction and feature fusion method of HSI classification, 3) without using the fully connected (FC) layer and PCA, it also greatly reduces the network parameters.

The rest of this paper is organized as follows.

Section 2 is a brief introduction of various related CNN frameworks. The proposed S2FEF-CNN is illustrated in detail in

Section 3. Experimental results and analysis are given in

Section 4.

Section 5 presents conclusions.

## 3. Proposed Methodology

In this section, we first illustrate the structure of the elementary block of our model, and then show in detail how the block extracts and fuses the features. Finally, we elaborate on the architecture of the S2FEF-CNN.

#### 3.1. S2FEF Block

Details of the basic S2FEF block are demonstrated in

Figure 4.

The S2FEF block contains two stages: one is for feature extraction and the other is for feature fusion.

In the first stage, spectral and spatial features are extracted by 1D/2D convolutional kernels in spectral and spatial channels, respectively. This step is formulated as follow:

where

${x}_{i}^{j}$ denotes the input HSI data, i = 1,…,I (I is the input number),

${W}_{{e}_{t}}^{j}$ and

${b}_{{e}_{t}}^{j}$/

${W}_{{a}_{t}}^{j}$ and

${b}_{{a}_{t}}^{j}$ are the weights and bias of spectral/spatial kernel

t in layer

j, respectively,

t = 1,…, K

_{k} (K

_{k} is the number of kernels),

j represents the layer index,

f is the features extractor, and subscript

e and

a represent spectral and spatial, respectively.

In the following stage, ${f}_{{e}_{t}}^{j}({x}_{i}^{j})$ and ${f}_{{a}_{t}}^{j}({x}_{i}^{j})$ are fused in three steps.

(1) Features from two channels are fused by element-wise multiplication.

After the former feature extraction, we directly fuse the spectral features and spatial features by element-wise multiplication (EWM). Compared with feature concatenation, the EWM will not increase the feature dimension but will adjust the spectral features by spatial information to a certain extent.

(2) The maximum element in different feature cubes are selected to produce the final feature cube.

(3) The maximal feature cube selected by Equation (5) is added to the original input cube

${x}_{i}^{j}$ to form a more accurate output cube

${x}_{i}^{j+1}$. Finally, rectified linear unit (Relu) is exploited for activation.

The above process is summarized in Algorithm 1.

**Algorithm 1:** Feature Extraction with S2FEF Block |

Input: A joint spectral–spatial feature map F, |

Spectral/Spatial kernel size S_{pe}/S_{pa} and kernel number k. |

Output: A new joint spectral–spatial feature map F’. |

1. **begin** |

2. Extract spectral/spatial features f_{spe}/f_{spa} with k spectral/spatial kernel (size 1 × 1 × S_{pe}/S_{pa} × S_{pa} × 1). |

3. Fuse the spectral and spatial features together by element-wise multiplication (f_{spe} × f_{spa}) to get the joint features f_{joint}. |

4. Select the max value from the corresponding pixel in f_{joint} to form a special feature F’. |

5. Return F’. |

6. **end** |

#### 3.2. S2FEF-CNN Architecture

The proposed network mainly consists of three steps:

(1) Step 1: extracting spectral–spatial joint features by three S2FEF blocks;

(2) Step 2: reducing spectral and spatial dimensions of the joint features by two pooling layers;

(3) Step 3: determining the pixel label via a softmax layer after flattening the joint features from Step 2 into a vector.

The architecture of the proposed S2FEF-CNN is shown in

Figure 5.

Define

${x}_{i}\in {\mathbb{R}}^{{m}_{1}\times {m}_{1}\times {N}_{1}}$ (i = 1,2,…,K) as the input HSI cube and

${\widehat{y}}_{i}$ as the output label of

${x}_{i}$. The process is defined as follow:

where

$S2FEF$ denotes the operator of the proposed spectral–spatial feature extraction and fusion,

${\delta}_{p}$ denotes two max pooling operators, and

${\delta}_{s}$ denotes the softmax classification.

Most of the commonly used 2D-CNN classification methods use PCA to preprocess HSIs for dimension reduction, but PCA introduces the problem of spectral information loss. In our architecture, we abandoned PCA and used the full spectrum of HSIs as input, and two pooling layers were used for feature dimension reduction.

CNN-based classification frameworks usually have one or two FC layers to integrate the features before the final classification. However, the FC layer is a heavily weighted layer, which may account for 80% of the parameters of a network. Hence, we also dropped the FC layer in our architecture for network parameter reduction.

Improvements of the architecture make our proposed network light. Experimental results show that even with only a few thousand parameters, our network can achieve comparative classification accuracy as those state-of-the-art deep networks with heavy weight.

The above steps are summarized in Algorithm 2.

**Algorithm 2:** S2FEF-CNN Classification |

Input: Hyperspectral image cube size m, S2FEF block number K_{k}. |

Output: The class label L of each pixel. |

1. **begin** |

2. Create input data set I in which each pixel input cube is size m × m × N (N is the spectrum band number). |

3. For each pixel in training set from I. |

4. Extract spectral–spatial features by K_{k} S2FEF blocks. |

5. The joint feature is pooled by two max pooling layers after which the feature size is m’ × m’ × N’. |

6. Flatten the feature into a vector v. |

7. Computing the softmax output L. |

8. Return L. |

9. **end** |

## 4. Experimental Results

#### 4.1. Datasets

We evaluated our work on three public hyperspectral datasets, Indian Pines (IP), Salinas (SA), and Pavia University (PU), captured by two different sensors: AVIRIS and ROSIS-03. AVIRIS can provide HSI with 224 contiguous spectral bands, covering wavelengths from 0.4 to 2.5 μm and with a spatial resolution of 20 m/pixel, while ROSIS-03 delivers HSI in 115 bands with a spectral coverage ranging from 0.43 to 0.86 μm and with a spatial resolution of 1.3 m/pixel. Indian Pines (IP) and Salinas (SA) are two commonly used datasets by AVIRIS. IP is captured in Northwestern Indiana and is 145 × 145 pixels in size, and Salinas is recorded over Salinas Valley and includes 512 × 217 pixels. They both consist of 16 ground-truth classes. The scene of PU is captured by ROSIS-03 with a size of 610 × 340 pixels and contains nine different classes. In our experiments, we use a corrected version with 200/204 bands for IP and SA, and 103 bands for PU in experiments after removing the noisy bands and a blank strip.

#### 4.2. Parameters Setting

In our network, three S2FEF blocks were used for all the three datasets. In each block, the number and size of convolutional kernels were the same for 1D convolution and 2D convolution. We empirically set the parameters just as the other networks do. The spectral kernel size was 1×3, the spatial kernel size was 3 × 3, and the kernel number was 4.

The input HSI cube size was set differently for each dataset. Unlike some networks that use a small size input, we wanted the original input cube to contain enough spatial information, and then chose a big size input. For the IP dataset, the size was 19 × 19 × N (i.e., m = 19), where N represented the band number. For the PU dataset, the size was 15 × 15 × N, while for the SA dataset, the size was 21 × 21.

We compared our experimental results with four well-performed deep network based HSI classification methods: SAE [

39], 1D-CNN [

24], 3D-CNN [

30], and DC-CNN [

36]. First, we compare the number of parameters in

Section 4.3, and the results of classification accuracy are described in

Section 4.4,

Section 4.5,

Section 4.6 and

Section 4.7.

#### 4.3. Comparison of Parameter Numbers

In our S2FEF-CNN architecture, the parameters are contained in the S2FEF block and in the final output layer. Detailed analysis results are given in

Table 1 and

Table 2, respectively.

Each S2FEF block has the same number of parameters. For the entire network, a large number of parameters are in the softmax layers, depending on the characteristics of the dataset. For Indian Pines and Salinas datasets, they have more input spectral bands and more ground truth classes, so they have more parameters. Pavia University has fewer bands and fewer classes, so it has fewer parameters.

For SAE, 1D-CNN, 3D-CNN, and DC-CNN, we used the same architecture and the same parameter settings as in their papers. For those settings that are not explicitly given in the paper, we adopted the commonly used values in HSI classification (e.g., the pooling stride was 2).

Table 3 shows the comparison results in detail.

Table 4 gives the parameter percentage of S2FEF-CNN compared with other networks.

Obviously, our proposed network works well with the fewest parameters, most of which are no more than 5% of the parameters used by the other deep networks, and the highest percentage is no more than 8%. It seems to be a potentially feasible way to solve the problem of heavy weights while training a deep network.

#### 4.4. Results of the Indian Pines Dataset

This dataset is a bit different from the other datasets. The most notable characteristic is that there are not enough samples in some classes, a few of which have less than 20 labeled data. Therefore, we adopted the method of [

30] to split the labeled data in a 1:1 ratio for training and testing.

In this paper, three commonly-used metrics were adopted to evaluate the classification performance, which are the overall accuracy (OA), the average accuracy (AA), and the Kappa coefficient.

Figure 6 shows the OA curves with the visualized line diagram clearly.

From the results listed above, we can find that the S2FEF-CNN works well even with only a few thousand parameters. For all these methods, the OAs of 16 classes vary widely. For example, the class Oats has significantly lower OA because it has fewer labeled samples than the other categories.

#### 4.5. Results of the Pavia University Dataset

In this dataset, there are enough labeled samples, so we set the split ratio to 2:8 for training and testing. The results are shown in

Figure 8,

Figure 9 and

Figure 10, where

Figure 9 shows the training accuracy and loss curves.

As can be seen from

Figure 8, S2FEF-CNN also performs well on the PU dataset, and it has a good classification ability for all nine classes. The accuracy of each class is more than 94%. There is little difference in the OA value between categories.

Figure 9 shows the dynamic changing of training on the PU dataset. All the methods converged after 100 epochs. DC-CNN and SAE are the fastest convergence methods. The curve of S2FEF-CNN oscillated a bity during the test.

#### 4.6. Results of the Salinas Dataset

The portion of training set and test set of Salinas is the same as that of PU.

Figure 11 and

Figure 12 show the results.

The performance of S2FEF-CNN on the SA dataset was similar to that on the IP and PU datasets, and the OA curve also looks stable. Several methods did not produce high results on the Grapes_untrained and Vinyard_untrained classes, and most of them were misclassified. From

Figure 12 we can see that the two classes are very close geographically. In addition, the spectral lines of the two classes are also very similar. This may be the cause of the misclassification.

#### 4.7. Parameter Influence

In this section, we discuss and show how the parameters influence the classification performance as shown in

Figure 13,

Figure 14 and

Figure 15. Some vital hyperparameters such as kernel number, kernel size, and network depth are discussed. We use the classification results of the Indian Pines dataset as an example.

Figure 13 shows the influence of cube size on OA, AA, and Kappa. As expected, when the spatial size of the cube increased from 7 × 7 to 19 × 19, the result increased significantly. Three-dimensional-CNN [

30] uses 3D cubes as the input and sets the spatial size to 5 × 5. As mentioned before, we need a big spatial size of cube to extract sufficient features for the next classification. Of course, we made a tradeoff between performance and cost as well, and finally set the size to 19 × 19.

Figure 14 shows the results of different network depths. In general, the deeper the network the better, because a deeper network can extract more high-level features that are beneficial for classification. However, our results are not proportional to the depth of the network. In other words, deeper is not better in S2FEF architecture. Actually, networks with four or more blocks perform as well as that with three. Considering the balance between performance and cost, we eventually set the network layer to 3.

Another important parameter is the convolutional kernel numbers. Although there is no universal setting, 2

^{n} is preferred.

Figure 15 shows the accuracy comparison of different kernel numbers in each layer. We tried eight different combinations [2,2,2], [2,2,4], [2,4,4], [2,4,2], [4,2,2], [4,4,2], [4,2,4], and [4,4,4] (notation [k

_{1}, k

_{2}, k

_{3}] means k

_{1} kernels in layer 1, k

_{2} kernels in layer 2, k

_{3} kernels in layer 3), and the combination of [4,4,4] worked the best. Unexpectedly, the suboptimal combination was [2,2,2].

## 5. Discussion

From the above results, we can draw the following conclusions.

Firstly, it is obvious that the deep network using spectral–spatial features can achieve better classification accuracy than those using only spectral features. The results strongly prove that spectral–spatial features benefit HSI classification.

Secondly, deep learning performs outstandingly in some remote sensing fields. However, the trend of making networks more complex and deeper brings a heavy load of parameters during training. More parameters may give the model better classification ability. It can be seen from the above results that DC-CNN shows the best accuracy. Sometimes we do not require high precision, but want to reduce the network parameters, so we can make some tradeoffs within an acceptable error range. Therefore, our method is a good attempt to simplify the network, such as with fewer kernels and/or fewer convolutional layers. PCA is an effective preprocessing method for dimension-reduction, while the fully connected layer is commonly used in CNN before classification, but these two common methods are replaceable. Our experimental results indicate that even simple and shallow networks can work well if we can come up with some effective strategies.

Finally, we have to talk about batch normalization, which is important for deep learning. Without batch normalization, the network can still work, but converges slowly. The explicit use of batch normalization forces the distribution of data to be more reasonable, which not only speeds up the convergence rate, but also smooths the accuracy curves.