Bidirectional-Convolutional LSTM Based Spectral-Spatial Feature Learning for Hyperspectral Image Classification

This paper proposes a novel deep learning framework named bidirectional-convolutional long short term memory (Bi-CLSTM) network to automatically learn the spectral-spatial feature from hyperspectral images (HSIs). In the network, the issue of spectral feature extraction is considered as a sequence learning problem, and a recurrent connection operator across the spectral domain is used to address it. Meanwhile, inspired from the widely used convolutional neural network (CNN), a convolution operator across the spatial domain is incorporated into the network to extract the spatial feature. Besides, to sufficiently capture the spectral information, a bidirectional recurrent connection is proposed. In the classification phase, the learned features are concatenated into a vector and fed to a softmax classifier via a fully-connected operator. To validate the effectiveness of the proposed Bi-CLSTM framework, we compare it with several state-of-the-art methods, including the CNN framework, on three widely used HSIs. The obtained results show that Bi-CLSTM can improve the classification performance as compared to other methods.


I. INTRODUCTION
C URRENT hyperspectral sensors can acquire images with high spectral and spatial resolutions simultaneously. For example, the Airborne Visible / Infrared Imaging Spectrometer (AVIRIS) sensor covers 224 continuous spectral bands across the electromagnetic spectrum with a spatial resolution of 3.7 meters. Such rich information has been successfully used in various applications such as urban mapping, environmental management, crop analysis and mineral detection.
For these applications, an essential step is image classification whose purpose is to identify the label of each pixel. Hyperspectral image (HSI) classification is a challenging task. There exist two important issues [1], [2]. The first one is the curse of dimensionality. HSIs usually contain several hundreds of spectral bands. These high-dimensional data with limited numbers of training samples can easily result in the Hughes phenomenon [3], which means that the classification accuracy starts to decrease when the number of features exceeds a threshold. The other one is the use of spatial information. The improvement of spatial resolutions may increase spectral variations among intra-class pixels while decrease spectral variations among inter-class pixels [4], [5]. Thus, only using spectral information is not enough to obtain a satisfying result.
To solve the first issue, a widely used method is to project the original data into a low-dimensional subspace, in which most of the useful information can be preserved. In the existing literatures, large amounts of works have been proposed [6], [7], [8]. They can be roughly divided into two categories: unsupervised FE methods and supervised ones. The unsupervised methods attempt to reveal low-dimensional data structures without using any label information of training samples. Typical methods include but are not limited to principal component analysis (PCA) [6], neighborhood preserving embedding (NPE) [9], and independent component analysis (ICA) [10]. Different from them, the supervised methods take advantages of the label information to learn the discriminative projections [11]. One typical method is linear discriminant analysis (LDA) [12], [13], which aims to maximize the inter-class distance and minimize the intra-class distance. In [7], a non-parametric weighted FE (NWFE) method was proposed. NWFE extends LDA by integrating nonparametric scatter matrices with training samples around the decision boundary [7]. Local Fisher discriminant analysis (LFDA) was proposed in [14], which extends the LDA by assigning greater weights to closer connecting samples.
To address the second issue, many works have been proposed to incorporate the spatial information into the spectral information [15], [16], [17]. This is because the coverage area of one kind of material or one object usually contains more than one pixel. Current spatial-spectral feature fusion methods can be categorized into three classes: feature-level fusion, decision-level fusion, and regularization-level fusion [2]. For feature-level fusion, one often extracts the spatial features and the spectral features independently and then concatenate them into a vector [4], [18], [19], [20]. However, the direct concatenation will lead to a high-dimensional feature space. For decision-level fusion, multiple results are first derived using the spatial and spectral information respectively and then combined according to some strategies such as the majority voting strategy [21], [22], [23]. For regularization-level fusion, a regularizer representing the spatial information is incorporated into the original object function. For example, in [24] and [25], Markov random field (MRF) modeling the joint prior probabilities of each pixel and its spatial neighbors was incorporated into the Bayesian classifier as a regularizer. Although this method works well in capturing the spatial information, optimizing the objective function in MRF is time consuming especially on high-resolution data.
The core idea of DL is to automatically learn high-level semantic features from data itself in a hierarchical manner. In [29] and [30], the autoencoder model has been successfully used for HSI classification. In general, the inputs of the autoencoder model is a high-dimensional vector. Thus, to learn the spatial feature from HSIs, an alternative method is flattening a local image patch into a vector and then feeding it into the model. However, this method may destroy the two-dimensional (2D) structure of images, leading to the loss of spatial information. Similar issues can be found in the deep belief network (DBN) [31]. To address this issue, convolutional neural network (CNN) based deep models have been popularly used [1], [32], [33].
They directly take the original image or the local image patch as network inputs, and use local-connected and weight sharing structure to extract the spatial features from HSIs. In [1], the authors designed a CNN network with three convolutional layers and one fully-connected layer. Besides, the input of the network is the first principal component of HSIs extracted by PCA. Although the experimental results demonstrate 4 that this model can successfully learn the spatial feature of HSIs, it may fail to extract the spectral features.
Recently, a three-dimensional (3D) CNN model was proposed in [32]. In order to extract the spectral-spatial features from HSIs, the authors consider the 3D image patches as the input of the network. This complex structure will inevitably increase the amount of parameters, easily leading to the overfitting problem with a limited number of training samples.
In this paper, we propose a bidirectional-convolutional long short term memory (Bi-CLSTM) network to address the spectral-spatial feature learning problem. Specifically, we regard all the spectral bands as an image sequence, and model their relationships using a powerful LSTM network [34]. Similar to other fully-connected networks such as autoencoder and DBN, LSTM can not capture the spatial information of HSIs. Inspired by CNNs, we replace the fully-connected operators in the network by convolutional operators, resulting in a convolutional LSTM (CLSTM) network. Thus, CLSTM can simultaneously learn the spectral and spatial features. Besides, LSTM assumes that previous states affect future sates, while the spectral channels in the sequence are correlated with each other. To address this issue, we further propose a Bi-CLSTM network. During the training process of the Bi-CLSTM network, we adopt two tricks to alleviate the overfitting problem. They are dropout and data augmentation operations.

II. METHODOLOGY
The flowchart of the proposed Bi-CLSTM model is shown in Fig. 1. Suppose a HSI can be represented as a 3D matrix X ∈ R m×n×l with m × n pixels and l spectral channels. Given a pixel at the spatial position (i, j) where 1 ≤ i ≤ m and 1 ≤ j ≤ n, we can choose a small sub-cube X ij ∈ R p×p×l centered at it.
The goal of Bi-CLSTM is to learn the most discriminative spectral-spatial information from X ij . Such information is the final feature representation for the pixel at the spatial position (i, j). If we split the sub-cube across the spectral channels, then X ij can be considered as a l-length sequence ( The image patches in the sequence are fed into the CLSTM one by one to extract the spectral feature via a recurrent operator and the spatial feature via a convolution operator simultaneously.
CLSTM is a modification of LSTM. The structure of CLSTM is shown in Fig. 2, where the left side zooms in its core computation unit named a memory cell. For the k-th image patch x k ij in the sequence After finishing these two parts, CLSTM multiplies the previous memory cell state C k−1 ij by F k ij , adds the product to I k ij •C k ij , and updates the information C k ij . Finally, CLSTM decides what information to output via the cell state C k ij and output gate O k ij . The above process can be formulated as the following equation: where σ is the logistic sigmoid function, ' * ' is a convolutional operator, '•' is a dot product, and b f , b i , b c and b o are bias terms. The weight matrix subscripts have the obvious meaning. For example, W hi is the hidden-input gate matrix, W xo is the input-output gate matrix etc.
In the existing literatures, LSTM has been well acknowledged as a powerful network to address the orderly sequence learning problem based on the assumption that previous states will affect future sates.
However, different from the traditional sequence learning problem, the spectral channels in the sequence are correlated with each other. In [35], bidirectional recurrent neural networks (Bi-RNN) was proposed to use both latter and previous information to model sequential data. Motivated by it, we use a Bi-CLSTM network shown in Fig. 1 to sufficiently extract the spectral feature. Specifically, the image patches are fed into the CLSTM network one by one with a forward and a backward sequence respectively. After that, we can acquire two spectral-spatial feature sequences. In the classification stage, they are concatenated into a vector and a softmax layer is used to obtain the probability of each class that the pixel belongs to.
It is well known that the performance of DL algorithms depends on the number of training samples.
However, there often exists a small number of available samples in HSIs. To this end, we adopt two data augmentation methods. They are flipping and rotating operators. Specifically, we rotate the HSI patches by 90, 180, and 270 degrees anticlockwise and flip them horizontally and vertically. Furthermore, we rotate the horizontally and vertically flipped patches by 90 degrees separately. As a result, the number of training samples can be increased by eight times. Besides the data augmentation method, dropout [36] is also used to improve the performance of Bi-CLSTM. We set some outputs of neurons to zeros, which means these neurons do not propagate any information forward or participate in the back-propagation learning algorithm.
Every time an input is sampled, network drops neurons randomly to form different structures. In the next section, we will validate the effectiveness of data augmentation and dropout methods.

A. Datasets
We test the proposed Bi-CLSTM model on three HSIs, which are widely used to evaluate classification algorithms.

B. Experimental Setup
We compared the proposed Bi-CLSTM model with several FE methods, including PCA, LDA, NWFE, RLDE [11], MDA [2], and CNN [32] . Additionally, we also directly use the original pixels as a benchmark.
For LDA, the within-class scatter matrix S W is replaced by S W + εI, where ε = 10 −3 , to alleviate the singular problem. The optimal reduced dimensions for PCA, LDA, NWFE and RLDE are chosen from [2,30]. For MDA, the optimal window size is selected from a given set {3, 5, 7, 9, 11}. For Bi-CLSTM, we build a bidirectional network with two CSLTM layers to extract features. Similar to CNN, the convolution operation are followed by max-pooling in Bi-CLSTM, and we empirically set the size of convolution kernel to 3 × 3 and the number of convolution kernel to 32. Without loss of generality, we initialize the state of CLSTM to zeros. The detailed configuration of Bi-CLSTM is listed in Table IV. For Indian Pines and KSC datasets, we randomly select 10% pixels from each class as the training set,    and use the remaining pixels as the testing set. The same as the experiments in [2], we randomly choose 3921 pixels as the training set and the rest of pixels as the testing set for the Pavia University dataset. The detailed numbers of training and testing samples are listed from Table I to Table III. In order to reduce the effects of random selection, all the algorithms are repeated five times and the average results are reported. The classification performance is evaluated by the overall accuracy (OA), the average accuracy (AA), the per-class accuracy, and the Kappa coefficient κ. OA defines the ratio between the number of correctly classified pixels to the total number of pixels in the testing set, AA refers to the average of accuracies in all classes, and κ is the percentage of agreement corrected by the number of agreements that would be expected purely by chance.

C. Parameter Selection
There are four important influence factors in Bi-CLSTM, including dropout, data augmentation, network framework, and the size of input image patches. Firstly, to find the optimal size of image patches, we fix   Finally, we also validate the effectiveness of dropout and data augmentation operators. We set the

D. Performance Comparison
To demonstrate the superiority of the proposed Bi-CLSTM model, we quantitatively and qualitatively compare it with the aforementioned methods.

IV. CONCLUSION
In this paper, we propose a novel bidirectional-convolutional long short term memory (Bi-CLSTM) network to automatically learn the spectral-spatial feature from hyperspectral images (HSIs). The input of the network is the whole spectral channels of HSIs, and a bidirectional recurrent connection operator across them is used to sufficiently explore the spectral information. Besides, motivated by the widely used convolutional neural network (CNN), fully-connected operators in the network is replaced by convolution operators across the spatial domain to capture the spatial information. By conducting experiments on three HSIs collected by different instruments (AVIRIS and ROSIS), we compare the proposed method with several feature extraction methods including CNN. The experimental results indicate that using spatial information improves the classification performance and results in more homogeneous regions in classification maps compared to only using spectral information. In addition, the proposed method can improve the OA, AA, and κ on three HSIs as compared to CNN. We also evaluate the influences of different components in the network, including dropout, data augmentation and patch size.