1. Introduction
Current hyperspectral sensors can acquire images with high spectral and spatial resolutions simultaneously. For example, the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor covers 224 continuous spectral bands across the electromagnetic spectrum with a spatial resolution of 3.7 m. Such rich information has been successfully used in various applications such as national defense, urban planning, precision agriculture and environment monitoring [
1].
For these applications, an essential step is image classification, whose purpose is to identify the label of each pixel. Hyperspectral image (HSI) classification is a challenging task. Two important issues exist [
2,
3]. The first one is the curse of dimensionality. HSI provides very highdimensional data with hundreds of spectral channels ranging from the visible to the short waveinfrared region of the electromagnetic spectrum. These highdimensional data with limited numbers of training samples can easily result in the Hughes phenomenon [
4], which means that the classification accuracy starts to decrease when the number of features exceeds a threshold. The other one is the use of spatial information. The improvement of spatial resolutions may increase spectral variations among intraclass pixels while decreasing spectral variations among interclass pixels [
5,
6]. Thus, only using spectral information is not enough to obtain a satisfying result.
To solve the first issue, a widely used method is to project the original data into a lowdimensional subspace, in which most of the useful information can be preserved. In the existing literature, large amounts of works have been proposed [
7,
8,
9,
10]. They can be roughly divided into two categories: unsupervised feature extraction (FE) methods and supervised ones. The unsupervised methods attempt to reveal lowdimensional data structures without using any label information of training samples. These methods retain overall structure of data and do not focus on separating information of samples. Typical methods include but are not limited to principal component analysis (PCA) [
7], neighborhood preserving embedding (NPE) [
11], and independent component analysis (ICA) [
12]. Different from these, the aim of supervised learning methods is to explore the information of labeled data to learn a discriminant subspace. One typical method is linear discriminant analysis (LDA) [
13,
14], which aims to maximize the interclass distance and minimize the intraclass distance. In [
8], a nonparametric weighted FE (NWFE) method was proposed. NWFE extends LDA by integrating nonparametric scatter matrices with training samples around the decision boundary [
8]. Local Fisher discriminant analysis (LFDA) was proposed in [
15], which extends the LDA by assigning greater weights to closer connecting samples.
To address the second issue, many works have been proposed to incorporate the spatial information into the spectral information [
16,
17,
18]. This is because the coverage area of one kind of material or one object usually contains more than one pixel. Current spatialspectral feature fusion methods can be categorized into three classes: featurelevel fusion, decisionlevel fusion, and regularizationlevel fusion [
3]. For featurelevel fusion, one often extracts the spatial features and the spectral features independently and then concatenates them into a vector [
5,
19,
20,
21]. However, the direct concatenation will lead to a highdimensional feature space. For decisionlevel fusion, multiple results are first derived using the spatial and spectral information, respectively, and then combined according to some strategies such as the majority voting strategy [
22,
23,
24]. For regularizationlevel fusion, a regularizer representing the spatial information is incorporated into the original object function. For example, in [
25,
26], Markov random field (MRF) modeling, the joint prior probabilities of each pixel and its spatial neighbors were incorporated into the Bayesian classifier as a regularizer. Although this method works well in capturing the spatial information, optimizing the objective function in MRF is timeconsuming, especially on highresolution data.
Recently, deep learning (DL) has attracted much attention in the field of remote sensing [
27,
28,
29,
30]. The core idea of DL is to automatically learn highlevel semantic features from data itself in a hierarchical manner. In [
31,
32], the autoencoder model has been successfully used for HSI classification. In general, the inputs of the autoencoder model are a highdimensional vector. Thus, to learn the spatial features from HSIs, an alternative method is flattening a local image patch into a vector and then feeding it into the model. However, this method may destroy the twodimensional (2D) structure of images, leading to the loss of spatial information. Similar issues can be found in the deep belief network (DBN) [
33]. To address this issue, convolutional neural network (CNN) based deep models have been popularly used [
2,
34]. They directly take the original image or the local image patch as network inputs, and use localconnected and weight sharing structure to extract the spatial features from HSIs. In [
2], the authors designed a CNN network with three convolutional layers and one fullyconnected layer. In addition, the input of the network is the first principal component of HSIs extracted by PCA. Although the experimental results demonstrate that this model can successfully learn the spatial feature of HSIs, it may fail to extract the spectral features. Recently, a threedimensional (3D) CNN model was proposed in [
34]. In order to extract the spectralspatial features from HSIs, the authors consider the 3D image patches as the input of the network. This complex structure will inevitably increase the amount of parameters, easily leading to the overfitting problem with a limited number of training samples.
In this paper, we propose a bidirectionalconvolutional long short term memory (BiCLSTM) network to address the spectralspatial feature learning problem. Specifically, we regard all the spectral bands as an image sequence, and model their relationships using a powerful LSTM network [
35]. Similar to other fullyconnected networks such as autoencoder and DBN, LSTM can not capture the spatial information of HSIs. Inspired from [
36], we replace the fullyconnected operators in the network by convolutional operators, resulting in a convolutional LSTM (CLSTM) network. Thus, CLSTM can simultaneously learn the spectral and spatial features. In addition, LSTM assumes that previous states affect future states, while the spectral channels in the sequence are correlated with each other. To address this issue, we further propose a BiCLSTM network. During the training process of the BiCLSTM network, we adopt two tricks to alleviate the overfitting problem. They are dropout and data augmentation operations.
To sum up, the main contributions of this paper are as follows. First, we consider images in all the spectral bands as an image sequence, and use LSTM to effectively model their relationships; second, considering the specific characteristics of hyperspectral images, we further propose a unified framework to combine the merits of LSTM and CNN for spectralspatial feature extraction.
2. Review of RNN and LSTM
Recurrent neural network (RNN) [
37,
38] is an extension of traditional neural networks and used to address the sequence learning problem. Unlike the feedforward neural network, RNN adds recurrent edges to connect the neuron to itself across time so that it can model a probability distribution over sequence data.
Figure 1 demonstrates an example of RNN. The input of the network is a sequence data
$\{{x}_{1},{x}_{2},\dots ,{x}_{T}\}$. The node updates its hidden state
${h}_{t}$, given its previous state
${h}_{t1}$ and present input
${x}_{t}$, by
where
${W}_{hx}$ is the weight between the input node and the recurrent hidden node,
${W}_{hh}$ is the weight between the recurrent hidden node and itself from the previous time step, and
b and
$\sigma $ are bias and nonlinear activation function, respectively.
As an important branch of the deep learning family, RNNs have recently shown promising results in many machine learning and computer vision tasks [
39,
40]. However, it has been observed that training RNN models to model the longterm sequence data is difficult. As can be seen from Equation (
1), the contribution of recurrent hidden node
${h}_{m}$ at time
m to itself
${h}_{n}$ at time
n may approach infinity or zero as the time interval increases whether
${W}_{hh}<1$ or
${W}_{hh}>1$. This will lead to the gradient vanishing and exploding problem [
41]. To address this issue, Hochreiter and Schmidhuber proposed LSTM to replace the recurrent hidden node by a memory cell. The memory cell contains a node with a selfconnected recurrent edge of a fixed weight one, ensuring that the gradient can pass across many time steps without vanishing or exploding. The LSTM unit consists of four important parts: input gate
${i}_{t}$, output gate
${o}_{t}$, forget gate
${f}_{t}$, and candidate cell value
${\tilde{C}}_{t}$. Based on these parts, memory cell and output can be computed by:
where
$\sigma $ is the logistic sigmoid function, ‘·’ is a matrix multiplication operator, ‘∘’ is a dot product operator, and
${b}_{f}$,
${b}_{i}$,
${b}_{C}$ as well as
${b}_{o}$ are bias terms. The weight matrix subscripts have obvious meanings. For instance,
${W}_{hi}$ is the hiddeninput gate matrix, and
${W}_{xo}$ is the inputoutput gate matrix etc.
3. Methodology
The flowchart of the proposed BiCLSTM model is shown in
Figure 2. Suppose an HSI can be represented as a 3D matrix
$\mathbf{X}\in {\mathbf{R}}^{m\times n\times l}$ with
$m\times n$ pixels and
l spectral channels. Given a pixel at the spatial position
$\left(i,j\right)$ where
$1\le i\le m$ and
$1\le j\le n$, we can choose a small subcube
${\mathbf{X}}_{ij}\in {\mathbf{R}}^{p\times p\times l}$ centered at it. The goal of BiCLSTM is to learn the most discriminative spectralspatial information from
${\mathbf{X}}_{ij}$. Such information is the final feature representation for the pixel at the spatial position
$\left(i,j\right)$. If we split the subcube across the spectral channels, then
${\mathbf{X}}_{ij}$ can be considered as an
llength sequence
$\{({x}_{ij}^{1},{x}_{ij}^{2},\cdots ,{x}_{ij}^{l}){x}_{ij}^{k}\in {\mathbf{R}}^{p\times p\times 1},1\le k\le l\}$. The image patches in the sequence are fed into the CLSTM one by one to extract the spectral feature via a recurrent operator and the spatial feature via a convolution operator simultaneously.
CLSTM is a modification of LSTM, which replaces the fullyconnected operators by convolutional operators [
36]. The structure of CLSTM is shown in
Figure 3, where the left side zooms in its core computation unit, called a memory cell. In the memory cell, ‘⊗’ and ‘⊕’ represent dot product and matrix addition, respectively. For the
kth image patch
${x}_{ij}^{k}$ in the sequence
${\mathbf{X}}_{ij}$, CLSTM firstly decides what information to throw away from the previous cell state
${C}_{ij}^{k1}$ via the forget gate
${F}_{ij}^{k}$. The forget gate pays attention to
${h}_{ij}^{k1}$ and
${x}_{ij}^{k}$, and outputs a value between 0 and 1 after an activation function. Here, 1 represents “keep the whole information” and 0 represents “throw away the information completely”. Secondly, CLSTM needs to decide what new information to store in the current cell state
${C}_{ij}^{k}$. This includes two parts: first, the input gate
${I}_{ij}^{k}$ decides what information to update by the same way as forget gate; second, the memory cell creates a candidate value
${\tilde{C}}_{ij}^{k}$ computed by
${h}_{ij}^{k1}$ and
${x}_{ij}^{k}$. After finishing these two parts, CLSTM multiplies the previous memory cell state
${C}_{ij}^{k1}$ by
${F}_{ij}^{k}$, adds the product to
${I}_{ij}^{k}\circ {\tilde{C}}_{ij}^{k}$, and updates the information
${C}_{ij}^{k}$. Finally, CLSTM decides what information to output via the cell state
${C}_{ij}^{k}$ and output gate
${O}_{ij}^{k}$. The above process can be formulated as the following equations:
where
$\sigma $ is the logistic sigmoid function, ‘∗’ is a convolutional operator, ‘∘’ is a dot product, and
${b}_{f},{b}_{i},{b}_{c}$ and
${b}_{o}$ are bias terms. The weight matrix subscripts have the obvious meaning. For example,
${W}_{hi}$ is the hiddeninput gate matrix, and
${W}_{xo}$ is the inputoutput gate matrix etc. To implement the convolutional and recurrent operator in CLSTM simultaneously, the spatial size of
${h}_{ij}^{k1}$ and
${C}_{ij}^{k1}$ must be the same as that of
${x}_{ij}^{k}$ (we use zeropadding [
42] to ensure that input will keep the original spatial size after convolution operation).
In the existing literature [
43,
44,
45], LSTM has been well acknowledged as a powerful network to address the orderly sequence learning problem based on the assumption that previous states will affect future states. However, different from the traditional sequence learning problem, the spectral channels in the sequence are correlated with each other. In [
46], bidirectional recurrent neural networks (BiRNN) was proposed to use both latter and previous information to model sequential data. Motivated by it, we use a BiCLSTM network shown in
Figure 2 to sufficiently extract the spectral feature. Specifically, the image patches are fed into the CLSTM network one by one with a forward and a backward sequence, respectively. After that, we can acquire two spectralspatial feature sequences. In the classification stage, they are concatenated into a vector denoted as
G and a Softmax layer is used to obtain the probability of each class that the pixel belongs to. Softmax function ensures the activation of each output unit sums to 1, so that we can deem the output as a set of conditional probabilities. Given the vector
G, the probability that the input belongs to category
c equals
where
W and
b are weights and biases of the Softmax layer and the summation is over all the output units. The pseudocode for the BiCLSTM model is given in Algorithm 1, where we use simplified variables to make the procedure clear.
It is well known that the performance of DL algorithms depends on the number of training samples. However, there often exists a small number of available samples in HSIs. To this end, we adopt two data augmentation methods. They are flipping and rotating operators. Specifically, we rotate the HSI patches by 90, 180, and 270 degrees anticlockwise and flip them horizontally and vertically. Furthermore, we rotate the horizontally and vertically flipped patches by 90 degrees separately.
Figure 4 shows some examples of flipping and rotating operators. As a result, the number of training samples can be increased by eight times. In addition, the data augmentation method, dropout [
47] is also used to improve the performance of BiCLSTM. We set some outputs of neurons to zeros, which means that these neurons do not propagate any information forward or participate in the backpropagation learning algorithm. Every time an input is sampled, network drops neurons randomly to form different structures. In the next section, we will validate the effectiveness of data augmentation and dropout methods.
Algorithm 1: Algorithm for the BiCLSTM model. 
