SPECTRAL-SPATIAL ATTENTION NETWORKS FOR HYPERSPECTRAL IMAGE CLASSIFICATION

Deep neural networks, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have been successfully used to extract deep features for many hyper-spectral tasks. In this study, we propose an spectral-spatial attention network for hyperspectral image classiﬁcation. In our method, RNN with attention can learn interspectral correlations within a continuous spectrum, CNN with attention is designed to focus on similar features between neighbor pixels in spatial dimensions. Experimental results demonstrate that our method can fully utilize spectral and spatial information to obtain competitive performance


Introduction
Hyperspectral imaging, also known as imaging spectroscopy, captures the electromagnetic energy reflected or emitted from the same area over hundreds of narrow, continuous spectral bands from the visible to infrared wavelength ranges [1][2][3][4].Hyperspectral images (HSIs) captured from land surface-observing aircrafts or satellites have become increasingly important in environmental monitoring, urban planning, mining, defense and agriculture due to their rich spectral information [2,5,6].These images are combined to form a three-dimensional (x, y, λ) hyperspectral data cube for processing and analyzing, where x and y represent two spatial dimensions of the scene, and λ represents the spectral dimension (comprising of a range of wavelengths).
Hyperspectral image classification, which assigns every pixel vector to a certain set of classes, is one of the major tasks in the analysis of HSIs, and it has received much attention from researchers.Numerous traditional methods, such as support vector machine (SVM) [7] and k-nearest neighbor (KNN) [2], have been proposed.However, these approaches disregard the correlations among pixels in spatial axes and cause a waste of spatial information.Jiang et al. [8] proposed an unsupervised superpixel wise principle component analysis to learn the intrinsic low-level features of different homogeneous regions by segmenting the entire HSI based on superpixel segmentation.It takes full advantage of spatial information contained in the HSIs.Thus, spectral-spatial based methods improve classification performance because they incorporate additional spatial information from an HSI.For example, Roscher et al. [9] took spectral as well as spatial information by an incremental learning strategy for import vector machines and discriminative random fields.Another highlight of this work was the concept of self-training for sequential classification of HSI, which was comprised of the inclusion of new training samples to increase the classification accuracy and the deletion of non-informative samples to be memory-and runtime-efficient.Li et al. [10] constructed a family of generalized composite kernels by utilizing spectral and spatial information from HSI data.Jiang et al. [11] developed a random label propagation algorithm, which constructed a spectral-spatial probability transfer matrix that simultaneously considered the spectral similarity and superpixel based spatial information to cleanse the label noise under the label propagation framework.

Motivation
Deep learning algorithms have been introduced to modern HSI analysis due to their outstanding predictive power, and they can extract more discriminative features and achieve a better performance than traditional shallow classifiers [2,12].Deep models, such as networks with 1D [13,14], 2D [15], and 3D [16] convolutional layers, have been proposed for hyperspectral data analysis.
Methods with a 1D network take spectra as input and only use spectral information to learn features.Mou et al. [13] utilized recurrent neural network (RNN) to model pixel spectra in an HSI as 1D sequence for classification, and they found that the modified gated recurrent unit (GRU) outperforms traditional approaches and the baseline convolutional neural network (CNN).Given that spatial information has been proven to be useful in improving the interpretation of HSI classification results, the study of classification models based on deep spectral-spatial features has been promoted.For example, Yang et al. [15] designed a two-CNN model to learn the spectral features and spatial features jointly.Cao et al. [17] used a CNN in combination with a Markov random field in a unified Bayesian framework to classify HSI pixel vectors.Spatial-spectral unified network [18] combined a spectral dimensional band grouping-based long short-term memory (LSTM) model with 2D CNN for spatial features and integrated the spectral finite element (FE), spatial FE, and classifier training into a unified neural network.The result showed that the full use of spectral and spatial information can considerably improve accuracy.
The attention mechanism, which becomes a vital part in human perception, is based on a reasonable assumption that human vision does not process an entire image at once, and it only focuses on specific parts of the entire visual space at "high resolution" while perceiving the surrounding in "low resolution" [19,20].Hence, this mechanism heightens the sensitivity to features containing the most valuable information.Several attempts have been exerted to incorporate attention mechanism as an effective technique processing into visual tasks to strengthen some features and to improve the performance as a result.It has been proven to be productive in many applications, including image captioning [21], matching [22][23][24][25] and saliency detection [26].
Attention mechanism enables models to focus on key pieces of the feature space and differentiate irrelevant information [27].It was first introduced for language translation [28], which learned to focus on particular words or phrases when translating sentences, showing large performance gains especially on long sequences.Considering the spectral dimension data in HSIs as sequence data, attention mechanism can capture the high spectral correlation between adjacent spectra by the above method completely.
Self-attention proposed by Lin et al. [29] uses attention scores to weight all features to obtain salient features.Pei et al. [30] designed a specific spatially-indexed attention mechanism among the convolutional layers to extract the salient facial regions in each individual image and a temporal attention layer to assign attention weights to each frame.Inspired by them, since local features at neighboring spatial positions have high relevance in HSI's spatial domain, adding attention mechanism is helpful in learning spatial dependence and saliency features.

Contribution
The major contributions in this paper involve the following three aspects:

•
We design a joint network with a spectral attention bi-directional RNN branch and a spatial attention CNN branch to extract spectral-spatial features for HSI classification.An attention mechanism is used to emphasize meaningful features along the two branches, as shown in Figure 1.Our goal is to improve representation ability by using the attention mechanism, namely, to focus on the correlations between adjacent spectral dimensions and the spatial dependency in spatial domain, as well as to suppress unnecessary features.

•
A bi-directional RNN with an attention mechanism is designed for spectral information in both backward and forward directions.For each pixel, a spectral vector is decomposed into a set of ordered single data and fed into GRU units one by one.Additional attention weights strengthen the spectral correlation between spectrum channels.We compare the attention RNN to the ordinary bi-directional RNN, and the experimental results in Tables 7-9 have proven its effectiveness for the classification with spectral information.

•
For spatial axes, we add attention to 2D CNN and train this model on the image patch around the pixel.Compared with the average consideration of each image region, the attention parameter assigns a greater weight to the key parts to make the model focus on the primary features.
The classification results of attention CNN and CNN in Tables 7-9 show that the central pixel is classified better by adding attention weight.
The remainder of this paper is organized as follows.Section 2 briefly introduces the related works.Section 3 describes the proposed method for HSI classification, including the two-branch network and co-training.The information of datasets used in this work and the experimental results are given in Section 4. Finally, Section 5 concludes this paper briefly.

Related Works
In this section, we mainly recall the background information of bidirectional RNN, CNN and attention mechanism.

Bi-Directional Recurrent Network
RNN, which extends conventional feedforward neural networks with loops in connections, has gained significant attention for solving many challenging problems involving sequential data analysis, such as speech recognition and language modeling [31,32].Unlike feedforward neural networks, RNN is called recurrent because of its recurrent hidden state, whose activation at each step depends on the previous computations.RNN has a memory function, which can remember the information about what has been calculated so far.
The architecture of RNN is illustrated in the left part of Figure 2.For a hidden layer in RNN, which maintains a hidden state at each time iteration, it receives the input vector x, and generates the output vector y.The unfolded structure of a bi-directional RNN (Bi-RNN), shown in the right part of Figure 2, presents the calculation process.Bi-RNN connects two hidden layers running in opposite directions to a single output, allowing them to receive information from both past and future states.Neither of these output states is connected to inputs of the opposite directions.By simultaneously employing both directions of input data, information both from the past and future can be used to calculate the output.LSTM [33] and GRU [34] are introduced to learn long-term dependencies and alleviate the vanishing gradient problem.These two architectures do not have any fundamental difference from RNN, but they use different functions to compute the hidden state.Compared with LSTM, GRU does not maintain a cell state C and uses two gates instead of three.GRUs have fewer parameters and thus may be trained a bit faster and need less data to generalize.

CNN
Another popular deep learning model for vision tasks is CNN [35].Fundamentally, the mammalian visual system has a spatial hierarchy.Inspired by this, CNN has a trainable multilayer architecture composed of a series of convolution layers, non-linearity layers, and pooling layers stacked alternately.It is used to learn low-level features such as edges or textures and high-level features with more discriminative information [36][37][38][39][40].A typical CNN structure is shown in Figure 3.In the convolution layer, rather than being fully connected to the input, each hidden layer unit is connected via shared weights to the local receptive field around the input, which might be k two-dimensional feature maps of size m × n.The convolution layer computes convolution of input feature maps x i with convolutional kernel W i of size l × l × q, followed by an element-wise nonlinear activation function.Activity of the i th feature map is C i = ∑ q j W i × X j + b i in the l th layer, where b i is the bias term for the i th feature map, X j is the j th channel of the previous layer.
The non-linear activation function summarizes the responses at several input locations, and it computes the output feature map p i = f (c i ) commonly via a rectified linear unit (ReLU) f (x) = max(0, x).The pooling layer computes the maximum or average value within a small patch of each feature map, and the most common type is max pooling.The pooling operation offers invariance by reducing the resolution of feature maps.After completing the stacked layers, fully connected layers and a softmax layer are usually adopted to predict the classification labels.Compared with other neural networks, CNN is easier to train with its fewer connections and parameters because of weight sharing and local connection scheme.

Attention Mechanism
Using attention mechanism, neural networks focus on a certain part of the given information, and every pixel has an independent weight, highlighting discriminative and effective features, and weakening information detrimental to classification.
Spatial Transformer Networks [41] which intelligently focused on a particular area of image was a special case of attention.Moreover, Kim et al. [42] used a joint residual attention model which utilized the attention mechanism to select the most valuable visual information so as to enhance language feature selection and feature extraction for visual question-answering problems.Additionally, Yang et al. [43] proposed an attention mechanism to extract additional meaningful information on the transition layer and passed to the next feature extraction block for subsequent feature exploitation.As for HSI classification, a proposed network [20] was constructed by stacking the proposed attention inception module and it could adaptively learn the network architecture by dynamically routing between the attention inception modules.They designed a novel neural network which has a "trunk branch" with a feedback attention mechanism and a "mask branch" with a gate control attention mechanism to perform pixel-wise classification for very high-resolution remote sensing images.
As noted in previous studies, many networks have adopted the attention mechanism to model the internal relationships and dependencies of the original data through global information, assigning higher priority to more informative areas.The acquired attention weight map can be used for feature recalibration.It simulates the biological process which causes human visual systems to be instantly attracted to a tiny piece of important information in an intricate image.We can relearn feature-based weights for more relevant and noteworthy information.

Methods
There are three subsections playing crucial roles in our methodology: A bidirectional RNN-based spectral attention feature learner, a CNN-based spatial attention feature learner and a co-training model.
In our work, attention is of much concern.For spectral classification, considering that each pixel can be represented as a continuous spectral curve that contains rich spectrum characters, we can focus on the inter-band relationship of features by attention.In spatial dimensions, we regard spatial features as complements to spectral ones; this branch improves the representation of interests and focuses on the inter-spatial relationships of features by exploiting spatial attention to CNN.Then, we concatenate two branches and feed them to the fully connected layers to learn high-level joint spectral-spatial features and acquire a prediction class after a softmax layer.

Attention with RNN for Spectral Classification
RNNs are popular architectures for modeling various sequential problems, and Bi-RNN is proposed to make full use of both latter and previous information.By considering all spectra of a hyperspectral pixel as a sequence, we develop a Bi-RNN model, containing a forward GRU layer and a backward GRU layer, as illustrated in Figure 4. Our model processes the input in both forward and backward directions to the same output layer with two separate hidden layers.
Its input is a spectral vector of one hyperspectral vector x, x = (x 1 , x 2 , . . ., x n ), and the bi-directional hidden vector is calculated as: Forward hidden state: Backward hidden state: where t ranges from the first spectral band 1 to the last n th one, the coefficient matrices ← − W and − → W are from the input at the present step, − → V is from the hidden state h t−1 at the previous step, ← − V is from h t+1 at the succeeding step, f is the nonlinear activation of the hidden layer, and the memory of the input as the output of this encoder is g t : where concat(•) is a function of concatenation between the forward hidden state and backward hidden state.Bi-RNN allows the spectral vector to be fed into one by one to learn continuous spectrum features with forward and backward directions.If we directly sum and average the data of each spectral band, it means that each spectral channel contributes equally to the classification task.The fact is that the spectrum is a continuous curve with peaks and troughs, rather than a straight line with a fixed value.Therefore, some bands in the spectrum should have a smaller weight while those key spectral bands should have a greater weight.Introducing the attention mechanism into the Bi-RNN, our model assigns an appropriate weight to each spectral channel and makes the model capture inner spectral relationships and classify much better.
Compared with the traditional RNN model that treats the input in the same manner, we add an attention layer to decode different spectral information to learn many characteristics.Our attention layer can be defined as follows: where W i and W i are transformation matrices, b i and b i are bias terms, and the softmax (•) is to map the non-normalized output to a probability distribution and constrain output to be in the interval (0, 1).
So we can compute the predicted label y t of pixel x as follows: where U(•) is a function of summing over all states which are weighted by their corresponding attention weights.Equation ( 4) is a one-layer neural network.This layer rearranges the state of Bi-RNN in its current vector space, and then the tanh activation transforms it to get e it as a new hidden representation of h t .The attention weight α is produced through the softmax layer, formulated as Equation ( 5), where we measure the importance of input based on the relevance of e t with another channel-wise vector.After obtaining the newly learned attention weights, we update the label representation vector y using the soft-attention operation show in Equation (6).
With the attention mechanism, our model adopts a more reasonable explanation that some spectral data play a key role and some of them are meaningless.Meanwhile, our Bi-RNN model can better characterize the spectrum features of hyperspectral origin data, pay more attention to the correlation of adjacent bands, and make the training model more accurate.

Attention with CNN for Spatial Classification
Our CNN model aims to extract robust spatial features.The attention mechanism we added on the spatial CNN focuses on the dependence between spatial neighboring pixels and the significant features on the entire input patch.Experiments in Section 4 for CNN and attention CNN show that attention weights contribute greatly to spatial classification.The attention CNN architecture is shown in Figure 5. First of all, in order to fuse the spatial information of all bands and suppress noise, we reduce the dimensions of an HSI to the low-dimensional subspace via principal component analysis (PCA).The tighter the relationship between the target pixel and its neighborhood, the smaller the patch created for the target pixel.After PCA, for instance, the first three components of the Pavia University dataset are reserved because they have almost 99.3% information.Around each pixel, we create a patch of size k × k × 3 as a neighbor region as the input of the spatial branch.With the addition of the attention mechanism, our CNN model can estimate the salience and correlation inner different image regions.
Different from the spectral attention, the spatial attention focuses on where the informative part is, which is a kind of complementary.CNN attention is added before the convolution layer.For the input neighbor region S of size m × n × c, we generate an efficient feature descriptor by utilizing the inner-spatial relationships of features.The spatial attention is denoted by a weight matrix α with the same size m × n as the feature map, and the element α ij of α bespeaks the attention weight for the pixel vector S ij composed of c PCA channels located at (i, j) in the neighbor region.
Particularly, the spatial attention weight map α is calculated by two steps.The first step is to get a distributed representation through a single-layer neural network, as Equation (7).In the second step, a sigmoid function calculates α ij which evaluates the impact between the i th position and j th position.Moreover, the more similar feature representations indicate the greater relevance of the two positions contributed.The definition of the spatial attention model is shown below: where W s ∈ R k×C and W z ∈ R k are transformation matrices that map image visual features, b s ∈ R k and b z ∈ R l are model biases, and σ(•) is a sigmoid function which also could constrain the attention weight to lie in the interval (0, 1).For each patch, the convolutional layer uses a sliding window as a kernel to move across, and it can locate similar features in this patch by calculating the point-to-point inner product.The pooling layer selects values to reduce the feature map dimension.The kernels of the convolutional layers are 5 × 5, and the strides of the max pooling layers are 2.The fully connected (FC) layer owns 1024 units.Table 1 lists the rest settings about the spatial attention network.

Merge
In our method, the last step concatenates the two branches to co-training them, and the complete framework is shown in Figure 6.The proposed Bi-RNN with attention network and the CNN with attention network are adopted as the spectral feature learner and the spatial feature learner, respectively.In order to exploit both spectral correlation and spatial features and extract the intergrated spectral-spatial features, we concatenate the last fully connected layer in the Bi-RNN with the one in the CNN to form a new fully connected layer, which is followed by another FC layer to represent the joint spectral-spatial features and a softmax regression layer to predict the probability distribution of each class.
Spectral RNN with attention mechanism focuses more on distinguishable essential characteristics and the inner spectral correlations, but the attentive spatial CNN supplements the neighbor information with spatial structure features and internal spatial relevance, enabling a more homogeneous classification map and a higher accuracy.The merge layer fuses and balances the spatial and spectral information, and its result has the largest diversity in class probability estimation.
Compared to the hand-crafted features, the deep joint spectral-spatial features trained in this end-to-end framework are more discriminative and robustness.The co-training network consisting of Bi-RNN and CNN, both of which have added attention mechanism, enhances the effectiveness of extracting features and promotes hyperspectral classification accuracy.

Experiment Results
In this section, we introduce three public datasets used in our experiment and the configuration of the proposed spectral-spatial attention network (SSAN).In addition, classification performance based on the proposed method and other comparative methods are presented.All the experiments are implemented with an NVIDIA RTX 2080Ti GPU, tensorflow-gpu 1.9.0 and Keras 2.1.0with python 3.6.

Data Description
To evaluate our method, we train and test it on three public HSI classification datasets, namely, the Pavia University dataset, the Pavia Center dataset and the Indian Pines dataset, which are widely used to evaluate classification algorithms.In our experiment, the training set is generated randomly from the ground reference data and the remaining reference samples consist of testing sets.For deep learning models, the training set consists of labeled samples and validation samples.To overcome the categories' imbalance problem, instead of splitting dataset by an average percentage of each class, we randomly select 100 labeled samples and 100 validation samples of each annotated class for training set in the Pavia Center dataset and Pavia University dataset, details are listed in Tables 2 and 3.As for the same problem in the Indian Pines dataset, some class samples of this dataset are less than 100.Table 4 provides the detail information about different classes and the corresponding training sets and testing sets.In the Pavia Center dataset, we choose four principal components that could consist of 99% information of the original data, and then extract image patches as CNN branch input.Similarly, three principal components are chosen in the Pavia Center dataset and four principle components for the Indian Pines dataset.

Parameter Setting
There are three main parameters that have significant impact on our experiment: Learning rate, spatial size and dropout.In this section, we evaluate the sensitivity of performance to different parameter settings of our proposed model in detail.
(1) Learning rate: Firstly, we test the impact of different learning rates.The learning rate controls the learning process and the amount of allocate error when updating model weights each time.At extremes, a learning rate could be too large and results in an oscillation over training epochs, or too small to be converged.The learning rate of our model is chosen from [0.0003, 0.0005, 0.0008, 0.001, 0.003, 0.005, 0.01], and the optimal learning rate based on the classification accuracy is 0.005 for the Pavia Center dataset, 0.0005 for the Pavia University dataset, and 0.0005 for the Indian Pines dataset.
(2) Spatial size: Spatial features learning from CNN badly depend on the size of the spatial neighbor region.As we have fixed the reduced channel number, we test spatial sizes [ 5, and all of them are acquired in 10,000 training iterations with batch size 128 and the optimal learning rate of each dataset.Larger size sof spatial input would supply more chance to learn more spatial features.Nevertheless, a larger size of spatial region would also bring negative effect with unnecessary information and a possibility of over-smoothing phenomenon.To make a fair comparison, we fix the spatial size of 27 × 27 in different classification methods.(3) Dropout: During training, the neural network develops co-dependency among neurons which lead to over-fitting of training data.Dropout is a regularization approach in neural networks which helps reduce interdependent learning and preventing over-fitting.We test it with different dropout proportions.The results in Table 6 represent that 60% dropout for the Pavia University dataset and 50% dropout for the Pavia Center dataset, and that the Indian Pines dataset acquires the highest accuracy.

Classification Results
To demonstrate the superiority and effectiveness of the proposed SSAN model, we compare it with traditional methods such as KNN and SVM, and advanced machine-learning methods such as CNN, RNN, RNN with attention (ARNN), and CNN with attention (ACNN).The comparative methods are summarized as follows: For a fair comparison, we utilize the same training and testing datasets for all methods, and all algorithms are executed twenty times.The average results which add the standard deviation obtained from the 20 runs are reported to reduce random selection effects.Overall accuracy (OA), average accuracy (AA), and the kappa coefficient k are used as the evaluation measurements for the compared methods.

Results on the Pavia Center Dataset
The classification maps of Pavia Center dataset from deep learning models and our proposed model are provided in Figure 7, and the corresponding accuracy indexes including OAs, AAs and kappa coefficients are presented in Table 7. Obviously, the performance of our proposed method is much better than other methods, and SSAN generates the highest OA, AA, kappa and the best classification results.From Table 7, comparing the OAs and AAs, we can see that most results are unbalanced, such as class Bitumen in SVM, and class Self-Blocking Bricks in RNN.Our method SSAN acquires more smooth and homogeneous results, and it proves that only using spectral or spatial information is insufficient for this task.Comparing ARNN and RNN, Self-Blocking Bricks is improved obviously from 50.46% to 76.69%, and classification accuracies of Asphalt, Tiles and Shadows in ACNN are increased respectively compared to CNN.Taking the accuracy of all classes into consideration, our method shows more robustness even with a small number of training samples and unbalance among classes.

Results on the Pavia University Dataset
Figure 8 shows the qualitative classification maps of deep learning networks and our method.Table 8 lists the index results and evaluation measurements quantitatively.It is obvious that the proposed SSAN surpasses other methods and owns the highest accuracy on most classes except class Asphalt and class Meadows, where the results of these two classes in our method have slightly lower precisions than ACNN.By adding attention mechanism, classification accuracies of Meadows and Self-Blocking Bricks in ARNN are improved significantly compared with RNN, while Trees, Bitumen and Shadows are better classified in ACNN in comparison to CNN.Viewing the classification maps, we notice that most ground objects are classified well and the house and road edges are clear.Nevertheless, a few scattered and diverse misclassifications are inside the natural vegetation area, which destroy the object integrity, especially class Bitumen and class Bare Soil.By adding attention mechanism and combining ARNN with ACNN, the results show that our SSAN model outperforms other approaches in acquiring more homogenized and favorable classification maps.The false-color images of the Indian Pine dataset and their corresponding ground-truth maps along with classification maps of the models are represented in Figure 9, and the corresponding accuracy indexes are shown in Table 9.The traditional approaches barely utilize the shallow spectral feature and neglect abundant spatial features, which leads to a fairly unimpressive classification performance.The classification maps of other methods present many noisy points and confuse class Soybean-mintill and class Building-Grass-Trees-Drives with other classes.Attention layer in RNN effectively improves classification accuracy of almost all categories in this dataset according to the results in ARNN and RNN.Similarly, with attention mechanism, Alfalfa, Grass-pasture-mowed, Soybean-clean and Buildings-Grass-Trees-Drives are classified much better in ACNN than CNN.From these classification maps we can see that some classes are hard to be correctly classified, and it brings challenges to the effectiveness and robustness of classifier.Comparing ARNN with RNN and ACNN with CNN, the attention weight, which captures spatial correlations between adjacent channels and spatial inner dependency, helps a lot in focusing on strongly related features and correcting severe misclassified pixels.Our proposed SSAN enhances the accuracy of indistinguishable classes and gains a more uniform and smooth result.One possible reason for misclassification is that some indistinguishable classes may have similar features either in the spectral or in the spatial domain.Another point worth considering is that some classes in the Indian Pine dataset are too unbalanced to learn sufficient differentiable features.In order to overcome these problems, our method surpasses other approaches in acquiring more homogeneous classification maps and manifests the highest accuracy.
The results indicate that the proposed method with the attention mechanism in two branches is effective in HSI classification.Obviously, the aforementioned traditional methods, such as SVM and KNN, demonstrate poor performance.Deep learning methods, such as CNN and RNN, are effective because of their discriminative features.A comparison of RNN and ARNN or CNN and ACNN indicates that the attention mechanism plays a significant role in our method.Within the attention weights, CNN focuses more on saliency features in the spatial domain, and RNN attempts to learn spectral correlations from adjacent spectrum.Our fusion network combines spatial and spectral dimensions and exhibits well-balanced results among all compared methods in all scenarios.

Conclusions
In this study, a novel two-branch co-training method is proposed to extract spectral-spatial features based on ARNN and ACNN for HSI classification.Inspired by the way humans perceive images that emphasizes informative features and suppresses unnecessary information, known as attention mechanism, we incorporate this mechanism into our model.ARNN and ACNN are trade on learning characteristics from spectral and spatial information, respectively, and they can grasp numerous interspectral correlations in the continuous spectrum domain and focus on similar spatial features between neighboring pixels in spatial dimension by adding attention weights.Specifically, we use bi-directional RNN in ARNN to learn forward and backward information in spectra.The co-training network can learn higher-level spatial-spectral joint characteristics and inherit features from both ARNN and ACNN.Analysis of experimental results on three public datasets demonstrates that our method not only performs better than the other methods, but also extracts more homogeneous discriminative feature representations.
Our work has proven the effectiveness of attention mechanism in HSI classification in this paper, and we plan to generalize our method for other more complex remote sensing applications, such as unmixing and change detection, in the near future.

Figure 2 .
Figure 2. The left part is a brief illustration of RNN, and the right part shows the unfold bidirectional RNN (Bi-RNN) structure.

Figure 4 .
Figure 4. Bi-RNN model with attention mechanism for spectral classification.Every pixel vector is regarded as a sequence, the hundreds of spectral bands are input into the gated recurrent unit (GRU) cell one by one.Both forward and backward features are captured by Bi-RNN and re-weighted by the attention layer.

Figure 5 .
Figure 5. CNN model with attention mechanism for spatial classification.The original HSI is firstly processed by principal component analysis (PCA) for dimensionality reduction.Spatial attention map is calculated from an initial input patch for CNN and superimposed on the patch to the subsequent network.

Figure 6 .
Figure 6.The whole structure of our proposed model.The spectral attention Bi-RNN branch and the spatial attention CNN branch are followed by a multi-layer merge network to extract conjoint spatial-spectral characteristics.

Table 1 .
Network settings in the spatial attention CNN.
The second dataset is obtained by the ROSIS sensor during a flight campaign over Pavia.The ROSIS-03 sensor recorded the orginal image in 115 spectral channels ranging from 430 to 860nm.Removing 12 noisy bands, the left 103 bands are adopted.The spatial size of the image is 610 × 340 pixels.The ground truth map contains nine different urban land-cover types with more than 1000 labeled pixels for each class.• Indian Pines: The third dataset is gathered by AVIRIS sensor over the Indian Pines test site in Northwestern Indiana.Removing bands that cover water absorption features, the remaining 200 bands with 145 × 145 pixels are used in this paper.The original data consists of observations from 16 identified classes representing the land cover types.
•Pavia Center: The first dataset is gained by ROSIS.We utilize 102 spectral bands after removing 13 noisy channels.The image is of 1096 × 715 pixels covering the center of Pavia.The available training samples contain nine urban land-cover classes.•PaviaUniversity:

Table 2 .
Number of training and testing samples in Pavia Center dataset.

Table 3 .
Number of training and testing samples in Pavia University dataset.

Table 4 .
Number of training and testing samples in Indian Pines dataset.

Table 5 .
Overall accuracy (OA) of the proposed method with different spatial sizes.

Table 6 .
OA of proposed method with different spatial sizes.

Table 7 .
Classification performance of different methods for the Pavia Center dataset.Bold indicates the best result.

Table 8 .
Classification performance of different methods for the Pavia University dataset.Bold indicates the best result.

Table 9 .
Classification performance of different methods for the Indian Pines dataset.Bold indicates the best result.