Spatial-Spectral Network for Hyperspectral Image Classification: A 3-D CNN and Bi-LSTM Framework

Recently, deep learning methods based on the combination of spatial and spectral features have been successfully applied in hyperspectral image (HSI) classification. To improve the utilization of the spatial and spectral information from the HSI, this paper proposes a unified network framework using a three-dimensional convolutional neural network (3-D CNN) and a band grouping-based bidirectional long short-term memory (Bi-LSTM) network for HSI classification. In the framework, extracting spectral features is regarded as a procedure of processing sequence data, and the Bi-LSTM network acts as the spectral feature extractor of the unified network to fully exploit the close relationships between spectral bands. The 3-D CNN has a unique advantage in processing the 3-D data; therefore, it is used as the spatial-spectral feature extractor in this unified network. Finally, in order to optimize the parameters of both feature extractors simultaneously, the Bi-LSTM and 3D CNN share a loss function to form a unified network. To evaluate the performance of the proposed framework, three datasets were tested for HSI classification. The results demonstrate that the performance of the proposed method is better than the current state-of-the-art HSI classification


Introduction
With the rising potential of remote-sensing applications in real life, research in remote-sensing analysis is increasingly necessary [1,2]. Hyperspectral imaging is commonly used in remote sensing. A hyperspectral image (HSI) is obtained by collecting tens or hundreds of spectrum bands in an identical region of the Earth's surface by an imaging spectrometer [3,4]. In an HSI, each pixel in the scene includes a sequential spectrum, which can be analyzed by its reflectance or emissivity to identify the type of material in each pixel [5,6]. Owing to the subtle differences among HSI spectra, HSIs have been applied in many fields. For instance, hydrological science [7], ecological science [8,9], geological science [10,11], precision agriculture [12,13], and military applications [14].
In recent decades, the classification of HSIs has become a popular field of research for the hyperspectral community. While the abundant spectral information is useful for improving classification accuracy compared to natural images, the high dimensionality presents new difficulties [15,16]. The HSI classification task has the following challenges: (1) HSI has high intra-class variability and inter-class diversity. These are influenced by many factors, such as changes in lighting, environment, atmosphere, and temporal conditions. (2) The available training samples are limited in relation to the high dimensionality of HSIs. As the dimension of HSIs increases, the required training samples also keep increasing, while the available samples of HSIs are limited. Therefore, these factors can result in an unsuitable methodology, reducing the classifier's ability for generalization.
In early HSI classification studies, most approaches focused on the influence of HSI spectral features on classification results. Therefore, several existing methods are based on pixel-level HSI classification, for instance, multinomial logistic regression [17], support vector machines (SVM) [18][19][20], K-nearest neighbor (KNN) [21], neural networks [22], linear discriminative analysis [23][24][25], and maximum likelihood methods [26]. SVM is mainly dedicated to the transformation of linearly inseparable problems into linearly separable problems by finding the optimal hyperplane (such as the radial basis kernel and composite kernel [19]), which finally completes the classification task. Since these methods utilize the spatial context information insufficiently, the classification results obtained by these pixel classifiers using only spectral features are unsatisfactory. Recently, researchers have found that spatial feature-based classification methods have significantly improved the representation of hyperspectral data and classification accuracy [27,28]. Thus, more researchers are combining spectral-spatial features into pixel classifiers to exploit the information of HSIs completely and improve the classification results. For example, multiple kernel learning uses various kernel functions to extract different features separately, which are fed into the classifier to generate a map of classification results. In addition, researchers in [29,30] segmented HSIs into multiple superpixels to obtain similar spatial pixels based on intensity or texture similarity. Although these methods have achieved sufficient performance, hand-crafted filters extract limited features, and most can only extract shallow features. The hand-crafted features depend on the expert's experience in setting parameters, which limits the development and applicability of these methods. Therefore, for HSI classification, the extraction of deeper and more easily discernible features is the key.
In recent decades, deep learning [31][32][33] has been extensively adopted in computer vision, for instance, in image classification [34][35][36], object detection [37][38][39][40], natural language processing [41], and has obtained remarkable performance in HSI classification. In contrast to traditional algorithms, deep learning extracts deep information from input data through a range of hierarchical structures. In detail, some simple line and shape features can be extracted at shallow layers, while deeper layers can extract abstract and complex features. The deep learning process is fully automatic without human intervention and can extract different feature types depending on the network; therefore, deep learning methods are suitable for handling various situations.
At present, there are various deep-learning-based approaches for HSI classification, including deep belief networks (DBNs) [42], stacked auto-encoders (SAEs) [43], recurrent neural networks (RNNs) [44,45], convolutional neural networks (CNNs) [46,47], residual networks [48], and generative adversarial networks (GANs) [49]. The SAEs consist of multiple auto-encoder (AE) units that use the output of one layer as input to subsequent layers. Li et al. [50] used active learning techniques to enhance the parameter training of SAEs. Guo et al. [51] reduce the dimensionality by fusing principal component analysis (PCA) and kernel PCA to optimize the standard training process of DBNs. Although these methods have adequate classification performance, the number of model parameters is large. In addition, the HSI cube data are vectorized, and the spatial structure can be corrupted, which leads to inaccurate classification.
The CNN can extract local two-dimensional (2-D) spatial features of images, and the weight-sharing mechanism of a CNN can effectively decrease the number of network parameters. Therefore, CNNs are widely used in HSI classification. Hu et al. [52] proposed a deep CNN with five one-dimensional (1-D) layers, which receives pixel vectors as input data and classifies HSI data in the spectral domain only. However, this method loses spatial information, and the network depth is shallow, limiting the extraction of complex features. Zhao et al. [53] proposed a CNN2D architecture, in which multi-scale, convolutional AEs based on Laplace pyramids obtain a series of deep spatial features, while the PCA extracts three principal components. Then, logistic regression is used as a classifier that connects the extracted spatial features and spectral information. However, the method does not consider spectral features and the classification effect on improvement. To extract the spatial-spectral information, Chen et al. [54] proposed three convolutional models for creating input blocks of their CNN3D model using full-pixel vectors from the original HSI. This method extracts spectral, spatial, and spatial-spectral features, which generate data redundancy. In addition, Liu et al. [55] proposed a bidirectional-convolutional long shortterm memory (Bi-CLSTM) network with which the convolutional operators across spatial domains are combined into a bidirectional long short-term memory (Bi-LSTM) network to obtain spatial features while fully incorporating spectral contextual information.
In summary, sufficiently exploiting features of HSI data and minimizing computational burden are the keys to HSI classification. This paper proposes a joint unified network operating in the spatial-spectral domain for the HSI classification. The network uses three layers of 3-D convolution for extracting the spatial-spectral feature of HSI, and subsequently adds a layer of 2-D convolution to further extract spatial features. For spectral feature extraction, this network treats all spectral bands as a sequence of images and enhances the interactions between spectral bands using Bi-LSTM. Finally, two fully connected (FC) layers are combined and use the softmax function for classification, which forms a unified neural network. We list the major contributions of our proposed method.

A Bi-LSTM framework based on band grouping is proposed for extracting spectral
features. Bi-LSTM can obtain better performance in learning contextual features between adjacent spectral bands. In contrast to the general recurrent neural network, this framework can better adapt to a deeper network for HSI classification. 2. The proposed method adopts 3-D CNN for extracting the spatial-spectral features.
To reduce the computational complexity of the whole framework, PCA is used before the convolutional layer of the 3-D CNN to reduce the data dimensionality. 3. A unified framework named the Bi-LSTM-CNN is proposed which integrates two subnetworks into a unified network by sharing the loss function. In addition, the framework adds the auxiliary loss function, which balances the effects of spectral and spatial-spectral features for the classification results to increase the classification accuracy.
The structure of the remaining part is as follows. Section 2 describes long short-term memory (LSTM), a 3-D CNN, and the framework of the Bi-LSTM-CNN. Section 3 introduces the HSI datasets, experimental configuration, and experimental results. Section 4 provides a detailed analysis and interpretation of the experimental results. Finally, conclusions are summarized in Section 5.

LSTM
Some tasks need to consider the information of previous and subsequent inputs in processing the current input. RNNs can solve these problems and handle the spectral contextual information of an HSI. Figure 1 shows the architecture of an RNN. Given a series of values ( ) , ( ) , … , ( ) as input data, the formula for each cell structure in the RNN network is shown as Equations (1) and (2): where , , denote the weight matrices that represent the relation of two nodes. In detail, connects the previously hidden node and the currently hidden node, connects the input node and the hidden node, and connects the hidden node and the output node. Vectors and are bias vectors. At time , ( ) represents the input value, ℎ ( ) represents the hidden value, and ( ) represents the output value. The tanh is a nonlinear activation function. The initialization value of ℎ ( ) in Equation (1) is set to zero. Equation (1) indicates that the output is jointly determined by the input ( ) at time and the ℎ ( ) at time − 1. As | | < 1 or | | > 1, ℎ ( ) will be closer to infinity or zero as time increases. This will cause the gradient to disappear or explode in the backpropagation phase. In other words, when the relevant information is very far from the current location, RNN will not utilize this information effectively. RNN cannot solve the problem of long-term dependence. The LSTM network is proposed to solve this problem. Through the gating mechanism, the LSTM not only remembers past information but also filters some unimportant information. The LSTM is effective in solving the long dependency problem in the RNNs.
The architecture of LSTM is shown in Figure 2. The memory cell is a critical component of the LSTM, replacing the hidden unit of the RNNs. The cell state runs throughout the cell, but it has few branches to ensure information flows unchanged throughout the RNNs. The LSTM network has a structure called a gate, which can delete or add information about the cell state. The gate is combined by the Hadamard product operation and the sigmoid function and can filter which information is allowed to pass. The LSTM has three gates: the input gate, which determines how the previous memory is combined with the new input information; the output gate, which controls if the state of the cell at the next time step will affect other neurons; and the forget gate, which regulates the cell state, causing the cell to forget or remember a previous state. The candidate cell value stores updated information from the output of the input gate operation. At time , the forward propagation of the LSTM is defined as Equations (3) Forget gate Output gate Candidate cell value Cell state LSTM output where denotes the logistic sigmoid function and * represents the Hadamard product operation. The matrices , , , , , , , and are weight matrices. The vectors , , , and are bias vectors.

CNN
CNNs are being applied with great success in many research areas. A CNN can extract various kinds of features from an image, such as color, texture, shape, and topology, so it has the advantage of processing 2-D images, such as identifying displacement, scaling, and other forms of distortion invariance. Similar to biological neural networks, the structure of the weight-sharing network of CNNs decreases the number of parameters, thus decreasing the complexity of the network model.
CNNs include 1-D CNN, 2-D CNN, and 3-D CNN. The 1-D CNN is mainly adopted for sequence data processing; the 2-D CNN is usually adopted for image recognition; the 3-D CNN is mainly used for medical image and video data recognition. A CNN consists of three structures: convolution layer, activation function, and pooling layer. There are no pooling layers in some CNNs. In detail, the purpose of the convolutional layer is for the extraction of the input data features; with more convolutional layers, the extracted features are more complex. The activation function increases the nonlinearity of the neural network model. The pooling layer preserves the main features while decreasing the number of parameters and calculations, preventing overfitting and improving model generalization. The schematic diagrams of the 2-D convolution and 3-D convolution are shown in Figure 3.

Framework of Proposed Method
The proposed Bi-LSTM-CNN network is based on the combination of Bi-LSTM and 3-D CNN for HSI classification. The method consists of two main parts-one for extracting spectral features through Bi-LSTM on the raw data and the other for extracting spatialspectral features using 3-D CNN after the PCA dimension reduction on the data. To optimize the parameters of two subnetworks simultaneously, we concatenate the last of the FC layers of the Bi-LSTM and 3-D CNN to form a new FC layer, after which a softmax function is added. In this framework, the dimensionality of the raw data is decreased by the PCA to minimize the computational effort of 3-D convolution, and Bi-LSTM manages the original data to compensate for the spectral loss after dimensionality reduction. In addition, Bi-LSTM can better handle the contextual information of the spectra and fully exploit the spectral features of the HSI. After the last FC layer of each subnetwork, auxiliary functions are added to balance the contribution of the two subnetworks to the whole framework. The framework diagram of the Bi-LSTM-CNN is shown in Figure 4.

Bi-LSTM
In Section 2.1.1, we discussed the use of LSTM to process continuous HSI data and extract spectral information. The LSTM can retain only previous input information through cell states because it cannot access future input information. To handle this problem, we propose to extract spectral information using Bi-LSTM instead of LSTM. Unlike LSTM, Bi-LSTM preserves both latter and previous information. Additionally, Bi-LSTM shows accurate results with a better understanding of the context.
The Bi-LSTM network focuses on spectral contextual information. In general, the Bi-LSTM network inputs one band at a time step. However, HSI has hundreds of bands, making the Bi-LSTM network too deep to obtain a robust network under the condition of limited HSI samples. Therefore, the strategy used to group the spectral bands is crucial to improve HSI classification results. In Bi-LSTM, denotes the number of groups; denotes the number of bands; and = ( / ) represents the sequence length of each group, where ( ) denotes rounding down . Z = [ , , … , … ] is the spectral vectors per pixel in the HSI, where is the reflectance value of the th band. The grouping strategy is shown as Equation (9): where ( ) is the th group. In this strategy, there will be a relative shortening of the spectral distance between different groups, and most of the spectral range will be covered. The framework diagram of the Bi-LSTM is shown in Figure 5. The Bi-LSTM contains information about the forward and backward of the input sequence. At time of the input sequence, the forward LSTM layer contains information before time , while the backward LSTM layer contains information after time . The vectors output from the two LSTM layers are processed using concatenation. In Bi-LSTM, the colored squares represent each grouping. Each blue LSTM square represents an LSTM unit, and the red and green arrows indicate the forward LSTM and the backward LSTM, respectively, and the two LSTMs pass the information along the arrow direction. Meanwhile, the forward LSTM and the backward LSTM have the same input data.

3-D CNN
The 3-D CNN has unique advantages in processing spatial-spectral features. Since there are many bands in an HSI, the 3-D convolution has a large computational complexity when extracting spatial-spectral features, which influences the efficiency of HSI classification. Therefore, in the Bi-LSTM-CNN, the 3-D CNN is used for the HSI after PCA dimensionality reduction to reduce the computational complexity.
HSI is denoted by ∈ × × , where represents the original data, and represent the width and the height, respectively, and denotes the number of spectral bands. Each HSI pixel in contains a one-hot label vector = ( , , , … , ) ∈ × × and a value in each of the spectral bands, where denotes the land-cover categories. During the convolution operation, to remove the spectral correlation and decrease the computational costs, the number of spectral bands is decreased from to by the PCA, while keeping the and of the spatial dimension constant. The spectral bands are reduced, but the essential spatial information for HSI classification is preserved. We create the 3-D patches centered on each pixel and extract adjacent regions of size × × , where × denotes the size of the window and denotes the number of first principal components that have been reserved by the PCA. The label of the central pixel decides the truth labels of the patches. The dataset is represented by ∈ × × × , where represents the number of samples.
To achieve the spatial-spectral feature maps in the 3-D CNN, 3-D convolution is executed three times. Considering the crucial role of 2-D convolution in spatial information extraction, we apply 2-D convolution to increase the spatial feature maps before the flatten layer. To prevent overfitting and to improve the generalization of this model, the dropout is applied once after each FC layer. The structure of the 3-D CNN is shown in Figure 6.

Loss Function
In this framework, we adopt the softmax function after the final FC layer to determine the probability distributions over the pixel classes. In addition, to increase the nonlinearity and accelerate the convergence of the Bi-LSTM-CNN, we adopt the rectified linear units (ReLU) function after each layer.
To better train the parameters of the whole framework, after the final FC layer in the Bi-LSTM and the 3-D CNN, we adopt the auxiliary loss function. The complete loss function is defined as Equations (10) where represents the loss function, and and refer to the predicted label and true label for the th training sample, respectively. The superscript , -, and denote the whole framework, the Bi-LSTM network, and the 3-D CNN, respectively. The variable denotes the number of the training sample. The parameters of the Bi-LSTM-CNN were optimized by the mini-batch stochastic gradient descent (SGD) algorithm.
The implementation procedure of the proposed Bi-LSTM-CNN method is shown in Algorithm 1.
(2) The size of the patch , the number of retained principal components .
Step 1 For each pixel in the HSI, use Equation (9) to divide the hyperspectral cube into t sequences as the input of the Bi-LSTM network.
Step 2 Retain the first principal components with PCA. Extract a patch of size × × in the neighborhood of each pixel after the reduced-dimensional HSI as the input of the 3-D CNN.
Step 3 Initialize the weights in the Bi-LSTM-CNN by assigning random values that follow a Gaussian distribution, where the mean, standard deviation, and bias terms are initialized to 0, 0.1, and 0, respectively. Step 4 Import training samples into the Bi-LSTM-CNN. Bi-LSTM and 3-D CNN extract the spectral features and spatial-spectral features for HSI, respectively. Then, softmax is applied to classify the extracted features. Finally, the mini-batch SGD algorithm is exploited to optimize the parameters of the Bi-LSTM-CNN, and the parameters are adjusted by backpropagation to obtain the optimal parameters.
Step 5 For each pixel in the HSI, input the corresponding data from Step 1 and Step 2 to the Bi-LSTM-CNN to obtain the predicted value for the HSI.

Output
Prediction results for each pixel of HSI.

Results
In this section, three open HSI datasets (the three datasets are available at http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes accessed on 14 November 2020) are evaluated in the performance of the Bi-LSTM-CNN by applying three evaluation metrics, comprising overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa).

Dataset Description and Training Details
The Indian Pines (IP) is an image acquired by the AVIRIS sensor in 1992 from the Agricultural and Forestry Hyperspectral Experiment site in northwestern Indiana. The dataset is an agricultural region with geometrically regular crop areas but irregular forest areas. The dataset is the size of 145 × 145 and has 224 spectral reflectance bands. The spatial resolution of each pixel is 20 . Of these, 24 spectral bands were excluded because they covered the water absorption region; the wavelength range of the retained 200 bands is 0.4-2.5 μm. The available samples were divided into 16 categories, representing approximately half of the total data. The false-color composite image and the ground-truth map correspond to in Figure 7a,b respectively. In the experiment, the dataset was selected randomly in the labeled parts from each category, and the ratio of the training and testing set was 1:9. Table 1   The University of Pavia (PU) campus in northern Italy was gathered by the ROSIS-03 sensor in 2001. This scene has a size of 610 × 340 × 115 and a wavelength range of 0.43-0.86 μm. The spatial resolution of each pixel is 1.3 m. This scene contained nine categories and 103 spectral bands after 12 noisy bands were discarded. The false-color composite image and the ground-truth map correspond to in Figure 8a,b respectively. Table 2 exhibits the details of the sample, as well as the corresponding colors of the ground-truth map.
In the labeled pixels from the PU, only 5% were used as the training set and the rest as the testing set.  The Salinas Valley (SV) scene is an image of the Salinas Valley, California, collected by the AVIRIS sensor. This scene forms a cube of dimension 512 × 217 × 224, and the spatial resolution of each pixel is 3.7 m. Similar to the IP dataset, after discarding 20 water absorption and noise bands, only 204 bands remained. This scene included 16 different agriculture crop categories. The false-color composite image and the ground-truth map correspond to in Figure 9a,b respectively. Table 3 exhibits the details of the sample, as well as the corresponding colors of the ground-truth map. Among the labeled pixels of this scene, only 5% were used as the training set and the rest as the testing set. In the above three datasets, the data of all training sets are randomly selected.

Experimental Configuration
All the experiments were run with an NVIDIA GTX 1070 GPU and an Inter i7-6700 3.40-GHz CPU with 32 GB of RAM. We performed randomized training and test data replication 10 times for each test. Based on several experiments, we chose 0.0001 as the best learning rate. To optimize the Bi-LSTM-CNN, we adopted the mini-batch SGD algorithm, with a batch size of 128. In the Bi-LSTM, the spectral bands are divided into three groups.
In Table 4, it is evident that the Bi-LSTM-CNN obtained the optimal results when the input window of the 3-D patches was 25 × 25. Table 5 shows the effect of on the classification results in the IP dataset. When is 30, the classification results are best. If keeps increasing, the number of network parameters will increase sharply. Therefore, the input size of the 3-D patches was 25 × 25 × 30. In Figure 10, the curves of classification accuracy with epochs during training over IP, PU, and SV datasets. When the epoch is 100, the classification accuracy was close to 1, but there was still instability. The classification accuracy was stable when the epoch reached 300, so the epoch of the Bi-LSTM-CNN was adopted 300. The parameters for the proposed Bi-LSTM-CNN method on the IP dataset are shown in Table 6. Table 4. Impact of the input window size of the 3-D patches on the performance.

Classification Results
In this paper, we compared the Bi-LSTM-CNN with some state-of-the-art methods, which are CNN1D, CNN2D, and CNN3D [56], SSUN [57], and HybridSN [58]. CNN1D, CNN2D, and CNN3D used 1-D convolution, 2-D convolution, and 3-D convolution, respectively. SSUN used an LSTM and a multiscale CNN to extract spectral and spatial features for implementing spatial and spectral joints. HybridSN adopted a mixture of 3-D convolution and 2-D convolution to extract spatial-spectral features with mainly spatial information. All the comparison methods were run in the same environment. Tables 7-9 show the results acquired by six methods on the IP (10% of the total dataset), PU (5%), and SV (5%), including OA, AA, Kappa, testing time, accuracy for each class. These are the result of running on the testing set. CNN1D had the worst classification results. In detail, all three evaluation metrics (OA, AA, and Kappa) of CNN1D were lower than the other methods. The accuracy of each class is the lowest among the six methods. CNN2D had better classification results than CNN1D, but still had a large drawback with other methods. Among the remaining methods, each method achieves the best results in some classes. Specifically, the Bi-LSTM-CNN obtained higher performance than other methods on OA, AA, Kappa. In addition, the Bi-LSTM-CNN obtains the highest accuracy in most of classes. The testing time of CNN1D, CNN2D, SSUN are less than other comparison methods.
The classification maps of the six methods in the three datasets are shown in Figures  7-9. These Figures show the prediction results of the six methods for all labeled samples. It is obvious that the classification maps of CNN1D and CNN2D have a large amount of salt and pepper noise. As the remaining four methods used spatial and spectral information, the classification map approximates more closely to the ground-truth map. In particular, the Bi-LSTM-CNN has very few pixel points that are different from the groundtruth map.

Discussion
In the experiment result, it is obvious that the Bi-LSTM-CNN significantly outperforms the other methods. The OA of the CNN1D method did not exceed 94% in all the considered datasets. Since the input data of CNN1D is a 1-D vector, spatial information of the input data is lost, resulting in the worst classification results of CNN1D among all methods. The CNN2D model considers the spatial information, which makes the classification results an improvement compared to CNN1D. Thus, it shows that spatial information is critical for HSI classification.
However, the CNN2D model has problems, which usually result in degraded shapes of some objects and materials. The union of spatial and spectral information can address this issue, and the other methods (CNN3D, SSUN, HybridSN, and Bi-LSTM-CNN) all achieve more similar classification results to the corresponding ground-truth maps. The SSUN model extracts spatial and spectral features separately, which are integrated and then sent to the classifier for classification. As spatial features dominate the classification results, SSUN is unable to effectively balance the two features, thus resulting in a little contribution of spectral features to the classification results. The CNN3D model directly extracts the spatial-spectral features of the HSI, but to decline the computational complexity of the convolutional layers, the PCA dimensionality reduction is performed on the input data. Hence, a small amount of spectral information is lost. Despite this, CNN3D still spends a lot of time in the testing phase on the PU and SV dataset compared to HybridSN and Bi-LSTM-CNN.
The HybridSN model replaces the final 3-D convolutional layer with 2-D convolution, decreasing the number of parameters in the network while maintaining accuracy. However, in the PU dataset experiments, the OA of the HybridSN model is lower than the CNN3D model, and the generalizability of the HybridSN method is slightly worse. In the Bi-LSTM-CNN, the lack of 3-D CNN processing spectral information is compensated, and the experimental results after adding Bi-LSTM are significantly better than the other methods.
In the classes with a small number of samples, the Bi-LSTM-CNN method also obtains better classification results. In the IP dataset, due to the very small number of labeled samples in some classes, the number of available training samples is extremely small. For example, the number of samples for C1, C7, C9, C16 is not more than ten, which greatly increases the learning difficulty for these classes. Except for C1, the best classification accuracy is obtained for several other categories. Except for C1, the Bi-LSTM-CNN method obtains a higher OA in the other classes than other methods. In the PU and SV datasets, the number of training samples for each class is sufficient for the Bi-LSTM-CNN method, although there is a large difference in the number of samples for different classes.

Conclusions
This paper proposed a unified network framework that contained a band-groupingbased Bi-LSTM network and a 3-D CNN for HSI classification. In this network, Bi-LSTM can extract high-quality spectral features considering complete spectral contextual information, which compensates for the shortcomings of the 3-D CNN. The Bi-LSTM-CNN network is able to harness the strengths of both subnetworks by using auxiliary loss functions. Compared with the model using only 3-D CNN, the Bi-LSTM-CNN can obtain better classification results by adding a few parameters. In the PU and SV datasets, we validated the performance of the model using less training data (5%). The experimental results showed that the Bi-LSTM-CNN method significantly improved the accuracy of HSI classification. In future work, we will either replace the LSTM with the Gated Recurrent Unit to improve the speed of the network or use the optimized 3-D CNN to further improve the HSI classification results.
Author Contributions: Methodology, software and conceptualization, J.Y. and Q.C.; modification and writing-review and editing, C.Q.; investigation and data curation, J.Q. All authors have read and agreed to the published version of the manuscript.