Convolutional Recurrent Neural Networks for Hyperspectral Data Classiﬁcation

: Deep neural networks, such as convolutional neural networks (CNN) and stacked autoencoders, have recently been successfully used to extract deep features for hyperspectral data classiﬁcation. Recurrent neural networks (RNN) are another type of neural networks, which are widely used for sequence analysis because they are constructed to extract contextual information from sequences by modeling the dependencies between different time steps. In this paper, we study the ability of RNN for hyperspectral data classiﬁcation by extracting the contextual information from the data. Speciﬁcally, hyperspectral data are treated as spectral sequences, and an RNN is used to model the dependencies between different spectral bands. In addition, we propose to use a convolutional recurrent neural network (CRNN) to learn more discriminative features for hyperspectral data classiﬁcation. In CRNN, a few convolutional layers are ﬁrst learned to extract middle-level and locally-invariant features from the input data, and the following recurrent layers are then employed to further extract spectrally-contextual information from the features generated by the convolutional layers. Experimental results on real hyperspectral datasets show that our method provides better classiﬁcation performance compared to traditional methods and other state-of-the-art deep learning methods for hyperspectral data classiﬁcation.


Introduction
In the last decade, with the advances of the computing power of computers and the availability of large-scale datasets, deep learning [1] techniques, such as deep belief networks (DBN), deep convolutional neural networks (CNN) and deep recurrent neural networks (RNN), have gained great success in a variety of machine learning tasks, such as computer vision, speech recognition, natural language processing, etc. For example, deep CNNs have been widely used for various computer vision tasks, such as large-scale detection and classification of object categories [2][3][4][5][6], deep CNN-based transfer learning [7,8] and speech recognition [9]. RNNs, another important branch of the deep neural networks family, were mainly designed for sequence modeling. The long short-term memory (LSTM) [10,11] network is a special type of RNN, which is able to capture very long-term dependencies embedded in sequence data. Both the regular RNN and the LSTM networks have been successfully used for time series data analysis, such as speech recognition [12][13][14][15], machine translation [16][17][18], etc.
CNN was first proposed in [19] for digit recognition where the network had two convolutional layers coupled with pooling layers. With a larger training dataset and more powerful computing hardware (GPUs), deep CNNs became popular since "AlexNet" [2], which had five convolutional layers, was used for ImageNet classification. Thus, deep CNN is usually used when the network has they have been used for hyperspectral data classification, introduce RNNs for hyperspectral data classification and describe how to combine CNN with RNN to get the proposed method CRNN for hyperspectral data classification. A spatial constraint based on decision fusion is also described in this section for spectral-spatial classification. The details of the experiments conducted for validating the proposed methods are given in Section 3, which includes descriptions of the datasets, the experimental setup and results. Discussions on the results are presented in Section 4. Section 5 summarizes the key ideas in this work and provides concluding remarks.

Materials and Methods
In this section, we first review the current most related works for hyperspectral image classification based on deep CNNs in Section 2.1. Then, we introduce the proposed methods for hyperspectral data classification, which include RNNs in Section 2.2 and CRNNs/CLSTMs in Section 2.3. Finally, Section 2.4 describes a spatial constraint based on decision fusion for spectral-spatial hyperspectral image classification.

CNN
As in [29], 1D CNNs have been successfully used for hyperspectral image pixel-level classification. Moreover, [30,31] made use of 2D CNNs for classification by taking a neighborhood window of size w × w (where w usually takes values about 20) around each labeled pixel and treat the whole window as a training sample, which is basically image-level classification instead of pixel-level classification. Additionally, this 2D CNN framework will need a large number of labeled pixels to generate enough 2D neighborhood regions to train a deep CNN. However, we usually do not have enough labeled pixels on a remotely-sensed hyperspectral image to generate, so many neighborhood regions for training a deep network. Another concern is that having a fixed size of neighborhood cannot guarantee that each neighborhood region contains only one class of object. Since different classes tend to have different spatial sizes, the optimal neighborhood size should also be class specific. Thus, 1D CNNs are more practical for remotely-sensed hyperspectral image classification, and we will only focus on them in the following discussion.
A graphical illustration of the 1D CNN we used in this work is shown in Figure 1 where a hyperspectral vector is fed to the input layer and then propagated through several successive convolutional and pooling layers for feature extraction. Each convolutional layer has multiple 1D convolutional filters (kernels). The size of the kernel is a hyperparameter and data dependent. The pooling layers are used for subsampling to reduce the dimensionality of the network, which can help reduce computation and control overfitting.
Let us denote an input hyperspectral vector as x = (x 1 , x 2 , ..., x T ) ∈ R T where T is the length of the input vector. In the first convolutional layer, a set of d filters {φ 1 , φ 2 , ..., φ d } of receptive field size (i.e., length of convolutional filters), r, are applied to the input vector via convolution operation ( * ) to get the feature map: where f t ∈ R d and g is a nonlinear activation function, such as tanh or a rectified linear unit (ReLU). ReLU is defined as g(x) = max(0, x), which has become the most used activation function in CNNs, and we also choose to use it in this work. The max pooling layer subsamples every depth slice of the input by the max operation. The most common form is a pooling layer with filters of a length of two applied with a stride of two. After applying the pooling layer, the depth (dimension) remains unchanged, but the length is reduced by half.
Finally, the extracted high-level features will be flattened to be a fixed-dimensional vector, which is then fully connected to the output layer for classification, where we have a softmax activation function to compute the predictive probabilities for all of the categories. This is done by: where w k 's and b k 's are the weight and bias vectors, and there are K categories. Training a neural network is to find the best parameters (weights of the network) to minimize the loss function, which in a classification task measures the compatibility between a prediction (e.g., the class scores in classification) and the ground truth label. The loss takes the form of an average over the losses for every training example: where N is the number of samples and L i is the loss for sample i. For the output layer with softmax activation, the cross-entropy loss (also known as negative log likelihood) is most widely used: The network is trained with stochastic gradient descent (SGD), and gradients are calculated by the back-propagation algorithm. A mini-batch strategy is utilized in our implementation to reduce loss fluctuation, so the gradients are calculated with respect to mini-batches. The algorithm will be running iteratively until the loss converges, when the change of training and validation loss is below some threshold.

RNN
Recurrent neural networks, which consists of successive recurrent layers, are sequential models, which map a sequence to another sequence. RNNs have a strong capability of capturing contextual information within a sequence. The contextual cues are stable and useful for classifying hyperspectral data. What is more, RNN is able to operate on sequences of arbitrary lengths, although this advantage is not utilized in this work. However, it is worth noting that our work can be extended to handle the problem where the input sequences have variable lengths.
The structure of a basic RNN with one recurrent layer is illustrated in Figure 2 where we have a sequence of vectors {x 1 , x 2 , ..., x T } as input; {h 1 , h 2 , ..., h T } is a sequence of hidden sates, and {o 1 , o 2 , ..., o T } is a sequence of outputs. A recurrent layer has a recursive function f , which takes as input one input vector x t and the previous hidden state h t−1 and returns the new hidden state as: and the outputs are calculates as: where W, U and V are the weight matrices, which are shared across all steps, and activation function tanh is the hyperbolic tangent function. This recursive function however is known to suffer from the problem of the vanishing gradient [37] for long input sequences, such as the speech signal or text document, which makes it difficult to learn for long-term dependencies. To overcome this problem, the long short-term memory (LSTM) RNN [10,11] was introduced, which uses a more complicated function that learns to control the flow of information, allowing the recurrent layer to capture long-term dependencies more easily. The structure of a basic LSTM unit is illustrated in Figure 3.
The LSTM unit consists of four sub-units: input gate, output gate, forget gate and new memory, which are computed by: where activation functions σ and tanh are logistic sigmoid and hyperbolic tangent functions, respectively. Based on these, the LSTM unit then computes the memory cell and output as: where the point-wise multiplication of two vectors is denoted with . A graphical illustration of the RNN framework (regular RNN or LSTM RNN) we used in this work is shown in Figure 4 where the input hyperspectral data x = (x 1 , x 2 , ..., x T ) is viewed as a sequence of 1D vectors, which is propagated through several recurrent layers to extract deep features. The hidden states of the last recurrent layer are a sequence of high-level features. As indicated in Figure 4, we only take the last hidden state of the last recurrent layer as the input to the classification layer, since the last hidden state should already contain all of the useful contextual information in previous time steps. Like in CNN, the softmax activation function is applied at the output layer, and the loss function is formulated as the cross-entropy. Training is done by mini-batch gradient descent. However, gradients of the loss function are calculated by the back-propagation through time (BPTT) algorithm [38].

CRNN
The CRNN is a hybrid of convolutional and recurrent neural networks [35,36]. It is composed of several convolutional (and pooling) layers followed by a few recurrent layers, as indicated in the graphical illustration of CRNN used in this work in Figure 5. CRNN has the advantages of both convolutional and recurrent networks. First, the convolutional layers are able to efficiently extract middle-level, abstract and locally invariant features from the input sequence. The pooling layers help reduce computation and control overfitting. Second, the recurrent layers extract contextual information from the feature sequence generated by the previous convolutional layers. Contextual information captures the dependencies between different bands in the hyperspectral sequence, which is more stable and useful for classification. Third, recurrent layers can handle variable-length input sequences, though we are not making using of this benefit in this work.  For the recurrent layers, we can either use the regular recurrent function or more complicated ones, like LSTM, which can capture very long dependencies more efficiently. However, since the lengths of input hyperspectral sequences are not very long (usually below 200) and the pooling layers additionally reduce the length greatly (usually below 50), we will not have long-term dependency and vanishing gradient problems in most cases. Thus, regular RNN should work as well as LSTM networks, which has been verified in our experiments.
Finally, as in RNN, the last hidden state of the last recurrent layer will be fully connected to the classification layer where a softmax activation function is applied. For training, as in CNN and RNN, the loss function is chosen as cross-entropy, and mini-batch gradient descent is used to find the best parameters of the network. The gradients in the convolutional layers are calculated by the back-propagation algorithm, and gradients in the recurrent layers are calculated by the back-propagation through time (BPTT) algorithm [38].

Spatial Constraint by Decision Fusion
The neural networks described in Sections 2.1-2.3 all extract deep spectral features for each hyperspectral pixel independently. In order to further improve the classification performance, we integrate the spatial constraint by linear opinion pools (LOP) [39,40] based on the fact that, in pixel-based image classification, spatially neighboring pixels tend to have similar categories. The spatial constraint encourages piecewise smooth segmentation of images by smoothing out noisy predictions due to noisy data or outliers.
Based on the predictive probabilities computed by the output layer (softmax) of the neural networks, LOP-based decision fusion calculates the posterior probability for each pixel x i as a weighted sum of the posterior probabilities of the spatial neighbors of that pixel: where N i are the spatial neighbors of pixel x i and ω j 's are weights for each neighbor that satisfy ∑ j∈N i ω j = 1. For simplicity, we use uniform weights for neighbors in our implementation. A flowchart illustrating the proposed classification system is shown in Figure 6.

Experimental Setup and Results
This section is devoted to illustrate the capabilities of the presented deep neural networks (CNN, RNN and CRNN) for two hyperspectral image datasets in remote sensing. A traditional method based on support vector machines (SVM) with the radial basis function (RBF) kernel is also implemented for comparison.

Datasets
Two modern high resolution hyperspectral images are used in our study. One covers an urban area over the University of Houston at Houston, Texas, and the other covers a mixed vegetation site at the Indian Pines area in Indiana.
The first dataset used in this work, the University of Houston (UH) hyperspectral image, was acquired by the NSF-funded National Center for Airborne Laser Mapping (NCALM) over the University of Houston campus and the neighboring urban area using the ITRES-CASI (Compact Airborne Spectrographic Imager) 1500 hyperspectral imager in 2012. The hyperspectral image contains 15 labeled land cover classes and consists of 144 spectral bands over the 364 nm-1046 nm wavelength range. It has a spatial dimension of 1905 × 349 with a spatial resolution of 2.5 m. Figure 7 shows the true color image of the UH dataset with the ground truth for all 15 classes, and Figure 8 shows the corresponding mean spectral signatures (radiance) for each class. The spectral signatures tell that different classes have different shapes and local structures. Thus, when we treat the hyperspectral pixels as sequences, RNNs can be used to extract discriminative contextual information from them, which is very useful for classification. This dataset is available online (http://hyperspectral.ee.uh.edu).

Grass-healthy
Grassstressed   The second dataset, the Indian Pines hyperspectral image, was acquired using the ProSpecTIRinstrument in May 2010 over an agriculture area in Indiana, USA. The image has a spatial dimension of 1342 × 1287 with a spatial resolution of 2 m. It consists of 180 spectral bands over the 400 nm-2500 nm wavelength range. There are 19 labeled classes contained in this dataset, and 16 of them belong to different vegetation types. Figure 9 shows the true color image with the corresponding ground truth for this dataset, and Figure 10 presents the mean spectral signatures (reflectance) of all 19 classes.

Experimental Setup
For both datasets, we randomly split the labeled samples into training and test sets. During training, 10% of the training samples are used as a validation set to learn the hyperparameters of the neural networks (i.e., layer size, number of layers, learning rate and mini-batch size) using a grid search strategy.
In our experiments, the CNN has four convolutional (with pooling layers); RNN and LSTM have three recurrent layers; CRNN and CLSTM have two convolutional layers (with pooling layers) and two recurrent layers. We also implemented the 2D CNN in the same way as [30,31] to extract spatial features from hyperspectral images. In particular, PCA was first employed on the whole image to reduce the dimensionality down to three, and then, we take a neighborhood region of size 11 × 11 around each labeled pixel to form 2D images, which will be fed to the input of the 2D CNN, which has three convolutional layers.
The configurations of all networks used in our experiments are summarized in Tables 1 and 2, where convolutional layers are denoted as "conv receptive field size -number of filters " and recurrent layers are denoted as "recur-feature dimension ". We implemented the neural networks using the TensorFlow [41] and Keras [42] framework. Experiments are carried out on a workstation with a 3.0-GHz Intel(R) Core i7-5960X CPU and an NVIDIA(R) GeForce Titan X GPU. The training process starts with the weights of all networks randomly initialized, and the initial learning rate is set to 10 −4 . For mini-batch stochastic gradient descent, we use a batch size of 128 in all experiments. During training, the learning rate will be decreased by half every 500 epochs until loss reaches convergence.

Results
Since the datasets are highly unbalanced, so we run our experiments with the same number of training samples for each class. For the University of Houston dataset, which has 15 classes, we run experiments with 50 samples per class (750 in total), 100 samples per class (1500 in total) and 200 samples per class (3000 in total). For the Indian Pines dataset, which has 19 classes, we run experiments with 100 samples per class (1900 in total), 200 samples per class (3800 in total) and 300 samples per class (5700 in total). The classification performances for different models (SVM, CNN, 2D CNN, RNN, LSTM, CRNN, CLSTM) and the corresponding methods with the LOP spatial constraint on the University of Houston dataset are presented in Table 3, and results on the Indian Pines dataset are presented in Table 4. The classification results on both datasets show that our proposed method, CRNN, achieved the best performance in all scenarios. We also show the training and validation loss and training and validation accuracy when training the neural networks on the University of Houston dataset and Indian Pines dataset in Figures 11 and 12, respectively. For every neural network, the training losses all converged to a level close to zero, and training accuracy converged to almost 100%. At the same time, the validation losses all converged to a low level, and the validation accuracies all reached a high number. It is also worth noting that CRNN and CLSTM converge faster than CNN and RNN. The corresponding number of parameters and training time (in minutes) for each network are summarized in Table 5. We can see that RNN and LSTM need much more training time than the others because the computational complexity for RNN/LSTM grows linearly with respect to the length of the input sequence, and most of the computations need to be done sequentially.  Finally, we show the classification maps for different models used in this work in Figures 13 and 14. It is easy to see that the classification map of 2D spatial CNN is smoother than the other 1D models, but many important details (like edges between different classes) got smoothed out, as well, which resulted in incorrect prediction around the edges. When adding LOP as the spatial constraint, the maps become smoother, while the important edges still were preserved.

Discussion
The classification results in Tables 3 and 4 show that our proposed method, CRNN, achieved the best performance in all scenarios. In particular, a few interesting and practically relevant observations can be made, which readers will find helpful when applying deep learning techniques to extract deep features from hyperspectral images: -CNN and CRNN/CLSTM achieved better classification results than the traditional method RBF-SVM in all scenarios, while the performances of RNN/LSTM are still worse than RBF-SVM. -The 2D spatial CNN performs better than SVM on the University of Houston dataset, but worse on the Indians Pines dataset. The reason is that the University of Houston dataset contains urban objects that have much more spatial features than the vegetation categories in the Indian Pines dataset. -As expected, the performances of CRNN/CLSTM are better than CNN because CRNN/CLSTM have the advantages of both convolutional networks and recurrent networks. -The fact that the performance of CRNN/CLSTM are significantly better than RNN/LSTM tells us that the middle-level features extracted by the convolutional layers in CRNN/CLSTM help the following recurrent layers to better capture the contextual information. -LSTM network has better performance than the regular RNN, especially for the University of Houston dataset, because LSTM networks are capable of capturing the long-term dependencies in the input sequence and, thus, avoid the gradient vanishing problem. -CLSTM performs no better than CRNN in all cases, meaning that CRNN does not have the long-term dependency and gradient vanishing problem because the length of the sequence is already much reduced by the two pooling layers before the recurrent layers. The reason why CLSTM is even worse than CRNN when the training set is small is that the CLSTM has much more parameters than CRNN, so it tends to overfit the training data and performs worse on test data. -The LOP-based spatial constraint further improved the performances of all of the 1D models.
To better understand the classification power of each neural network, we take the high dimensional features extracted by all models trained on the University of Houston dataset with 1500 training samples and use the t-SNE [43] algorithm to reduce the dimensionality to two. The results are visualized in Figure 15, where different colors stands for 15 classes in the University of Houston dataset. Similarly, the feature visualizations for the Indian Pines dataset are depicted in Figure 16. Features extracted by all models are more discriminative than the original hyperspectral data. Features extracted by the CRNN for different classes are better separated than the other models, which means that features extracted by CRNN are the most discriminative. The reason is that the first few convolutional (and pooling) layers in CRNN extract middle-level features where the spectral variation has been decreased, which makes it easier for the following recurrent layers to learn the contextual information. This can be verified in Figure 17, which shows an example of features extracted by the convolutional layers. Compared to the input signal, the extracted features are smoother since convolutional layers are able to remove local variations. Furthermore, the convolutional layer also successfully captures some important information, such as the locations of peaks and slopes, which are quite discriminative between different classes. Thus, recurrent layers can capture the contextual information more effectively from the features extracted by convolutional layers.

Conclusions
In this paper, we proposed to extract the contextual information in hyperspectral data by modeling the dependencies between different spectral bands using an RNN. In particular, we employed a CRNN model, which consists of several convolutional layers and a few recurrent layers, for hyperspectral data classification. The first convolutional layers are utilized to extract middle-level and locally invariant features, which are then fed to a few recurrent layers to additionally extract the contextual information between different spectral bands. By combining convolutional and recurrent layers, our CRNN model is able to extract more discriminative feature representations for classification, and it outperformed other state-of-the-art methods on real hyperspectral image datasets.
In the future, we would like to explore semi-supervised deep learning techniques for hyperspectral image classification because it is usually very expensive to get labeled data for remote sensing applications.