Hyperspectral Image Classiﬁcation Using Similarity Measurements-Based Deep Recurrent Neural Networks

: Classiﬁcation is a common objective when analyzing hyperspectral images, where each pixel is assigned to a predeﬁned label. Deep learning-based algorithms have been introduced in the remote-sensing community successfully in the past decade and have achieved signiﬁcant performance improvements compared with conventional models. However, research on the extraction of sequential features utilizing a single image, instead of multi-temporal images still needs to be further investigated. In this paper, a novel strategy for constructing sequential features from a single image in long short-term memory (LSTM) is proposed. Two pixel-wise-based similarity measurements, including pixel-matching (PM) and block-matching (BM), are employed for the selection of sequence candidates from the whole image. Then, the sequential structure of a given pixel can be constructed as the input of LSTM by utilizing the ﬁrst several matching pixels with high similarities. The resulting PM-based LSTM and BM-based LSTM are appealing, as all pixels in the whole image are taken into consideration when calculating the similarity. In addition, BM-based LSTM also utilizes local spectral-spatial information that has already shown its effectiveness in hyperspectral image classiﬁcation. Two common distance measures, Euclidean distance and spectral angle mapping, are also investigated in this paper. Experiments with two benchmark hyperspectral images demonstrate that the proposed methods achieve marked improvements in classiﬁcation performance relative to the other state-of-the-art methods considered. For instance, the highest overall accuracy achieved on the Pavia University image is 96.20% (using both BM-based LSTM and spectral angle mapping), which is an improvement compared with 84.45% overall accuracy generated by 1D convolutional neural networks.


Introduction
Hyperspectral remote-sensing images (HSI) can entail both abundant spectral and spatial information, which generally provides enhanced capability of distinguishing different objects from one another, relative to multispectral images, and play an important role in a variety of research domains, such as precision agriculture [1], land-use monitoring [2,3], change detection [4,5], and environment measurements [6]. For such subfields, classification is a critical technology, where each pixel in an HSI temporal features from multiple images. Ienco et al. [42] utilized RNN and LSTM to perform land-cover classification on multi-temporal satellite images. In [43], Sharma et al. proposed a patch-based RNN framework incorporating both spectral and spatial information within a local window to classify Landsat 8 images. Furthermore, single-image-based RNN methods are also applied to HSIs. Mou et al. [44] proposed a novel RNN-based HSI classification algorithm by using a parametric rectified hyperbolic tangent function (PRetanh). In this framework, each individual pixel in the HSI can be regarded as one sequential feature for the RNN input layer. Wu et al. [45] investigated the combination of CNN and RNN layers on the spectral feature domain and employed the convolutional RNN (CRNN) model for HSI classification. The utilization of a CNN can extract patch-level local invariant information among spectral bands, which provides spatial contextual features for the following RNN layers. Shi et al. [46] proposed another strategy to design the sequential data in RNN model instead of taking spectral vector from all bands as one sequential data, but taking advantage of spatial neighbors. For this method, local spectral-spatial features were first extracted by exploiting a 3DCNN on a local image patch, and then sequences were built based on an eight-directional construction.
Although the aforementioned RNN-based DL models have significantly contributed to HSI processing efforts, there are still some critical problems that need to be addressed. The first issue is the limitation of training sample. Acquiring sufficient labeled training data for HSI classification is often difficult and time consuming. Moreover, satisfactory DL-based classification accuracy has always relied upon very large sets of training samples. Therefore, obtaining convincing HSI classification results by utilizing limited training data for DL models is a challenging task. However, unlabeled samples, which are relatively easier to acquire than labeled samples, have already been investigated for HSI classification purposes under semi-supervised classification frameworks [19][20][21][22]. Such investigations illustrate the potential effectiveness of unlabeled data for such a purpose. Another critical issue involves the construction of sequential data for the RNN model. In [44,45], the respective authors analyzed the HSI from the perspective of a sequential point, meaning that each pixel is considered to be a data sequence since all pixels in the HSI are sampled densely from the entire spectrum, and they are expected to have dependencies between different bands. Nonetheless, such dependencies still need to be explored in order to more fully exploit the integrity of the full spectral signature. In order to distinguish different classes, it is frequently advantageous to utilize the information encapsulated within the entire reflectance spectrum, as is the case with many conventional classification methods. Furthermore, exploiting the spectral feature directly in the RNN model will introduce more parameters that need to be computed and optimized in the training step.
In this paper, we propose a novel LSTM-based HSI classification framework with spatial similarity measurements (SSM), which is inspired by [47], where the LSTM model and the spatial location are combined simultaneously. First, the sequential feature for each pixel is constructed by selecting candidates from the whole image based on the similarity between each candidate and the target pixel. This selection method mainly relies upon two different similarity measurements where spectral and spatial information are considered, namely pixel-matching-based (PM) spatial similarity measurements and block-matching-based (BM) spatial similarity measurements, respectively. LSTM entails the significant capability of handling sequential data, and it achieves outstanding performance in NLP. The proposed similarity-measuring strategies provide an innovative framework to extract sequential features for HSI classification by employing all pixels in the entire HSI, regardless of whether the pixel candidates to be a given sequential feature are labeled or not. The proposed sequential feature effectively encodes the dependency of the target pixel with regard to its contexts, where the pixel-level spectral similarities and the patch-, or block-level contextual similarities are naturally encoded, respectively. More specifically, LSTM assumes that closer "time steps" (which denote selected pixels/blocks with higher similarity) in general have stronger feature sharing, while also allowing for longer-term dependency that can account for non-local similarity and long-tail effects. The motivation for extending one pixel to its sequential features is essentially to find a new feature embedding of the pixel, which makes reference to other similar pixels. The action of re-ordering those pixels in terms of their similarity measures is to ensure that all obtained sequential features admit "comparable" formats in terms of monotonically-decreasing similarity to the original pixel (i.e., the target pixel, or the given pixel of interest). In this framework, the influence of unlabeled data in an HSI is enhanced compared with conventional supervised-learning methods due to our proposed spatial selection, where any pixel can be selected as a candidate to construct sequential features. In view of global searching based on the whole image, more supportive pixels are incorporated with a wider receptive domain. In addition, spatial contextual information, which has already been utilized to reduce the "salt-and-pepper" phenomenon in remote-sensing/HSI classification [48], is also investigated here in the BM-based framework. Similarity measurements of two pixels is implemented by using the neighboring points of two pixels instead of their spectral feature vectors. Compared with the PM-based method, the BM-based scheme can obtain more typical sequential feature representation combining both spectral and spatial features together. In summary, this is the first study to propose such methods operating over the entire image. This is important because these new, novel methods can incorporate additional information collected throughout the whole image by exploiting unlabeled pixels, instead of utilizing only the limited prior information in the form of labeled pixels. Figure 1 illustrates the framework of our proposed method. The organization of the remainder of this paper is as follows: Section 2 provides a brief introduction to the original RNN and LSTM models. Section 3 describes the proposed LSTM classification framework based on pixel matching and block matching. Experimental results are discussed in Section 4. Section 5 presents the conclusions.

RNN
The recurrent neural network (RNN) has shown great capability in time-sequence data processing, including NLP [49] and speech recognition [40]. A significant characteristic of a time sequence is that there is typically a strong relationship between a given sample and the previous samples. In the hidden Markov model (HMM), which is a widely-utilized sequence model in language processing, the probability of a specific state depends only on its previous state, instead of all previous states. Let X = [x 1 , x 2 , · · · , x t ] be the sequence data where t is the label of state. x 1 represents the data at the first state, and x t represents the data at the t state. The Markov assumption can be formulated as where P(·) expresses the conditional probability. Compared with HMM, RNN is quite similar with HMM since the computation on the current state relies on the previous state. In contrast to conventional ANNs, the RNN has a circular processing on the sequential data, which means that such processing will be applied on each data instance in the sequence, and the result at each state relies on the previous state. This circular processing also represents the parameter sharing. Parameter sharing is a prevalent method to control the number of parameters in a DL scheme. Still given X = [x 1 , x 2 , · · · , x t ] as sequential data, the hidden state s t can be represented as where W xs is the weight matrix from input data to the hidden state, and W ss is the weight matrix from the current state to the next state, respectively. b t is the bias variable. s t denotes the hidden state at time step t, and f s (·) represents the nonlinear activation function. The calculation of the output at state t is quite similar with Equation (2) as follows: where W sy is the weight matrix from the hidden state to the output. b y is the bias, and f y (·) is the nonlinear activation function. The hidden state s t can be viewed as the memory of the RNN model, as it is calculated based on the previous state through forward propagation. Meanwhile, sequential data in the previous states are taken into consideration as well. In such forward propagation, some parameters, including three different weight matrices W xs , W ss , and W sy , are shared across all steps, which is quite different from a traditional neural network. The parameter-sharing scheme reduces the number of trainable parameters, and makes the total computation more efficient.

LSTM
In Equation (2), the calculation of the hidden state depends on the previous state. However, with the increase in the length of the sequence data, gradient vanishing and gradient exploding will be introduced in this recurrent model due to the forward and backward propagations of weight matrices. To address this issue, long short-term memory was developed with a more sophisticated recurrent neuron. In LSTM, each recurrent neuron can be regarded as a cell state. Similar to the conventional RNN, LSTM also employs the previous state as the input to the current state. However, with LSTM, there are three gates, including forget gate, update gate, and output gate, to control the update of the current neuron. Figure 2 illustrates the basic structure of the LSTM recurrent unit. The first part of LSTM is the forget gate which determines if the previous state will be retained or not. Still given X = [x 1 , x 2 , · · · , x t ] as sequential data, y t and s t are the output and the hidden state at step t, respectively. The common computation for forget gate f t is as follows: where the W (·) terms denote the weight matrices, and the b (·) term is the bias variable. σ(·) is the logistic sigmoid function. The following step is to compute the update gate u t and a new candidate state values t : where tanh(·) is the hyperbolic tangent function. Then, the new hidden step s t can be updated by using the aforementioned equations Finally, the output gate and the output of the current neuron yields:

Spatial Similarity Measurements in LSTM
For the LSTM model, the sequential feature is a critical issue when training LSTM since determining the representative feature will improve classification performance and reduce the training-time cost. In this section, the spatial similarity measurement-based LSTM model will be introduced as a method to construct sequential features. First, two different strategies utilized in SSM will be discussed, named PM-based and BM-based schemes. For each of them, when computing the similarity between pixels, two distance measurements are investigated, which are Euclidean distance (EU) and spectral angle mapper (SAM). Furthermore, we will introduce the way of constructing sequence as the input of LSTM model.

Pixel Matching
Measuring the similarity in the pixel data vectors between different pixels is a common technology in many HSI analysis applications, such as endmember-based analysis [50], manifold learning [51,52], and graph-based semi-supervised learning [22]. Since HSIs have abundant spectral information, typically entailing hundreds of bands, spectral features collected from all bands have the most discriminative capability to distinguish different ground objects or materials encompassed within the given image, and have been most widely utilized in HSI classification [53]. In the pixel-matching scheme, pairwise spectral similarity measurements are applied for all pixels. Suppose we have HSI data X ∈ R L×C×B with L rows, C columns, and B bands. X can be rewritten as X = [x 1 , x 2 , ..., x N ] ∈ R B×N with row-major order, where N is the total number of pixels in the HSI, which equals L * C. For any pixels x i in X, the distances measured between x i and all pixels of X will be computed as follows: where d(x i , x j ) denotes the distance between x i and x j , and d(·) is the distance calculation function.
There are multiple methods available to compute pairwise distance. In this paper, Euclidean distance and the spectral angle mapper (SAM) are utilized in the pixel-matching scheme. Euclidean distance is a well-known distance measurement, and its definition is given as follows: The other distance measure adopted in this study is SAM, which is investigated well in endmember-based HSI classification. It can be defined as follows: In the following text of this paper, we use d (·) () to represent either the EU or SAM distance calculation function in the pixel-wise measurement. Therefore, Equation (10) can be rewritten as follows:

Block Matching
Although spectral features provide rich, significant information that facilitates discrimination of different ground objects in HSI, classification accuracy based on utilization of spectral features alone is not always satisfactory due to the "salt-and-pepper" phenomenon [48]. Given Tobler's First Law of Geography [54], incorporation of spatial contextual information has attracted increasing attention in the research literature in recent years and has exhibited the capability to reduce "salt-and-pepper" noise and improve classification performance, including yielding smoother classification maps. In the current study, we utilize the image patch distance (IPD), proposed in [55], for similarity measurement that considers local spatial information. The IPD method was developed based on the Hausdorff distance [56], also referred to as the Pompeiu-Hausdorff distance. Instead of using spectral features to measure the pairwise similarity between two pixels, spatial neighbors within the respective local window of such two pixels will also be employed. Let w be the local window size, and the s i represents the block neighborhood, where pixel x i is centered within the w × w spatial window. All block sets S can be defined as follows: Given ∀s i , s j ∈ S, we first calculate the distances between one arbitrary pixel x m from s i and s j and then select the minimum distance as follows: where d(·) is the distance-measuring function. Correspondingly, the minimum distance between one arbitrary pixel x m from s j and s i can be computed in the following manner: Therefore, the definition of block matching between s i and s j is: Finally, the ultimate BM measurements are defined as:

Sequential Feature Extraction
After measuring the pairwise similarity among pixels in the whole image, the sequential feature for each pixel is extracted so that it can be fed into the RNN model directly. Based on the aforementioned two matching schemes, given one pixel x i ∈ X, its corresponding matching vector is where d M (·) denotes the pairwise matching function introduced in previous sections. Note that the order of d M (x i , X) is determined only based on its location in the image. To characterize a representative sequential feature, d M (·) is reordered based on the degree of similarity, and then the corresponding pixels are selected as the sequential representation. With the definition given in Equation (19), we assume we have one pixel , is built as follows: where d ij s f denotes the distance-measuring result between x i and x j . d s f (x i , X) is the ascending sort of d M (x i , X). More specifically, d 1 s f is the minimum value among d s f (x i , X), and d 2 s f is the second-smallest value. During ascending sorting, pixels most similar to x i among all pixels in the whole image will be selected as the sequential representation of x i . Note that not all candidates in d s f (x i , X) will be considered, and the parameter l is defined to control how many candidates can be selected or to determine the length of such sequence. The first l pixels with distance-measuring results from d 1 s f to d l s f will be selected, and the first elements of this sequence is X). Given the sequence length l, the final sequential feature of x i can be defined: where x ij sf represents x j , whose distance-measuring result is located at the jth place.

Datasets
In this study, two benchmark HSI datasets were utilized, including Pavia University, and Salinas images, as displayed in Figure 3 and Table 1. The Pavia University image was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor. It consists of 102 spectral bands, with a spectral range from 430 nm to 860 nm. The image spatial resolution is 1.3 m, and the total image size is 610 × 340 pixels. For the Pavia University image extent, nine (9) classes were considered in the classification experiments. The Salinas image was acquired via the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), and the image contains 512 lines × 217 samples, with the spatial resolution of 3.7 m. After removing 20 noise and water-absorption bands, 204 spectral bands remained for subsequent analysis. The ground-reference data for the Salinas image entails 16 classes.

Experimental Design
To evaluate the performance of the proposed SSM-based LSTM methods, three algorithms, including SVM, 1DCNN [35], and 1DLSTM [44], are investigated as baseline algorithms. For SVM, the radial basis function (RBF) is utilized as kernel function, and the parameters of SVM are acquired by cross validation. For the following two deep-learning algorithms, they are 1D-based architectures, where spectral features are fed into the classifier directly. For the 1DCNN, two convolutional layers, two max pooling layers, and one fully-connected (FC) layer are selected due to the limited available training samples. For the 1DLSTM architecture, three LSTM layers and one FC layer are adopted. Different model structures are implemented based on different images, and the specific parameter settings for the Pavia Univeristy and Salinas images are summarized in Table 2, where the convolutional layer is represented as "Conv(number of kernels)-(kernel size)", maxpooling layer is performed as "Maxpooling-(kernel size)", and LSTM layer is denoted as "LSTM-(kernel size)". Regarding the proposed methods, both pixel-matching and block-matching are investigated, and, for each matching scheme, EU and SAM are employed as distance measurements. Therefore, there are four different LSTM-based classification frameworks investigated here, named LSTM_PM_EU, LSTM_PM_SAM, LSTM_BM_EU, and LSTM_BM_SAM. For the LSTM structure, we use four recurrent layers and two fully-connected layers, and the length of the sequential feature is 20. The size of the recurrent layers are 32, 64, 128, and 256, respectively, and the size of first fully-connected layer is 50. The second fully-connected layer is applied for the purpose of classification, and its length equals the number of classes. During training of the recurrent model, the batch size is set to 20, and the number of epochs is 500.
In order to evaluate classification performance quantitatively, all ground-reference data for each image is randomly split into training and testing sample sets. In our experiments, we randomly select 200 samples per class as training data, with the remaining ground-reference data used as testing data. Ten replications of the experiments with such random selections were performed, and all classification accuracies were averaged across the ten replications. Furthermore, three other quantitative indicators were also adopted for the evaluation, including overall accuracy (OA), average accuracy (AA), and the Kappa coefficient (Kappa) [57]. The pixel-matching and block-matching experiments were implemented on the Texas A&M High Performance Research Computing (HPRC) system, and the remaining experiments, such as training LSTM models and classification accuracy assessments, were carried out on a local workstation with a 3.2 GHz Intel(R) core i7-8700 Central Processing Unit (CPU), and a NVIDIA(R) GeForce GTX 1070 graphics card. Table 2. Parameter settings for 1D-CNN and 1D-LSTM, where the convolutional layer is represented as "Conv(number of kernels)-(kernel size)", maxpooling layer is performed as "Maxpooling-(kernel size)", and LSTM layer is denoted as "LSTM-(kernel size)".

Pavia University Image
Salinas Image

Classification Results: Pavia University Image
The first set of experiments is conducted on the Pavia University image. The quantitative results are shown in Table 3, where values in bold are the highest class-specific accuracies and the standard deviations are also presented, which are calculated based on ten OAs obtained from the aforementioned ten (10) experimental replications. The classified images are displayed in Figure 4 for qualitative analysis, which are obtained from the fifth trial. As shown in Table 3, the block-matching-based method LSTM_BM_SAM achieved best performance, with 96.20% OA, 94.65% AA, and 94.91% Kappa. For the first three benchmark algorithms, the highest OA (i.e., 84.45%) is obtained from 1DCNN. Regarding our newly-proposed pixel-matching-based LSTM frameworks, the OA of LSTM_PM_SAM is 84.56%, exhibiting limited improvement relative to SVM, 1DCNN, and 1DLSTM, and the classification performance of LSTM_PM_EU even decreases relative to 1DCNN and 1DLSTM. However, after incorporating spatial information via similarity measurements, LSTM_BM_EU and LSTM_BM_SAM obtain marked improvements over all non-block-matching methods, with 95.96% and 96.20% OA, respectively. Within each matching method, the performance of SAM is always better than that of the Euclidean distance. Regarding the AA of each class, class 7 (Bitumen, Red) is more difficult to discriminate compared with other classes due to the mixed spectral features. The proposed LSTM_BM_SAM improves the original SVM OA by more than 35%.
From the classification maps shown in Figure 4, marked improvements in classification performance are visually apparent. In Figure 4b-f, "salt-and-pepper" noise is still obvious due to the lack of incorporation of spatial contextual information in the classification. Within the red-rectangle annotation, many class 2 (Meadow, Bright Green) pixels are misclassified as class 6 (Bare Soil, Yellow), and class 3 (Gravel, Brown), as shown in Figure 4b-f. However, the classification maps derived from LSTM_BM_EU and LSTM_BM_SAM (Figure 4g,h) are spatially smooth and generally correctly classified, where most discrete, spurious/misclassified points are eliminated if they are located within an otherwise homogeneous area. Therefore, combining spatial contextual information can yield marked alleviation of image misclassification. Similar to what is observed within the red-rectangle annotation, more accurate and homogeneous classification results can be achieved within the red-circle annotation as well. Such results demonstrate the validity and capability of combining spatial and spectral features together when measuring the similarity between two pixels, and the effectiveness of constructing a sequential feature for a specific pixel based on such similarity between that target pixel itself and candidates from the whole image.

Classification Results: Salinas Image
For the Salinas image, the results are similar to those attained and described in Section 4.3. The quantitative results are shown in Table 4, where, again, values in bold are the highest class-specific accuracies. The OAs of the SAM-based method are lower than those associated with its corresponding Euclidean distance-based method, and the performance of the block-matching strategy is always better (more accurate) than that of the pixel-matching scheme, where spatial contextual information is ignored. The best classification performance is still obtained from LSTM_BM_SAM, with OA = 90.63%, AA = 93.95%, and Kappa accuracy = 89.55%. Class 15 (Vineyard_untrained, Violet) is the class with the lowest accuracy due to the high spectral and thematic similarity between this vineyard class and other grape fields, and the best classification result for this class is acquired from 1DCNN (among SVM, 1DCNN, and 1DLSTM), with 59.34% accuracy. However, LSTM_BM_EU and LSTM_BM_SAM markedly improve classification accuracy by utilizing spatial features, where class 15 accuracies increase by 7.99% and 10.51%, respectively, compared with the 1DCNN result.
Manual interpretation of the classification maps shown in Figure 5 enables us to determine why class 15 (Vineyard_untrained) entails the lowest OA. Note that many class 15 (Vineyard_untrained, Violet) pixels are misclassified as class 8 (Grapes_untrained, Baby Blue). Pixel-matching-based methods, including LSTM_PM_EU and LSTM_PM_SAM, still yield much discrete noise within the red-circle annotation in Figure 5. However, LSTM_BM_EU and LSTM_BM_SAM produce more homogeneous and smoothed classification results for the Grapes_untrained class, especially within the red-circle annotation, illustrated in Figure 5g,h. Within the red-rectangle annotation, we can see that it is difficult to classify class 10 (Corn_senesced_green_weeds, Brown), for example, and many pixels are misclassified in Figure 5b-f. However, such misclassification is markedly minimized when applying block-matching-based methods, i.e., LSTM_BM_EU (Figure 5g) and LSTM_BM_SAM (Figure 5h). Such improvement illuminates the advantages of utilizing spatial contextual information when measuring the pixel-wise distances, especially when it is challenging to discriminate between two classes with very similar spectral features.

Parameter Sensitivity Analysis
The influence of different parameter values associated with our proposed methods is investigated in this section, including the length of sequential feature l, and the size of the local window w, utilizing block-matching-based methods. The effect of varying the value of l is tested on LSTM_PM_EU, LSTM_PM_SAM, LSTM_BM_EU, and LSTM_BM_SAM, and the effect of varying the value of w is tested using LSTM_BM_EU and LSTM_BM_SAM.
For the first parameter, l, five different lengths (10, 20, 30, 40, and 50) are investigated, while the window size utilized in LSTM_BM_EU and LSTM_BM_SAM is fixed at 5. The results are shown in Figure 6. Regarding the Pavia University data, the best performances for those four proposed methods are obtained with different sequential feature lengths (Figure 6a). Sequential feature lengths of 20 and 10 result in the highest classification OAs for LSTM_PM_EU (82.70%) and LSTM_PM_SAM (86.68%), respectively. For the two block-matching-based methods, the highest OA for LSTM_BM_EU is 96.54% (when l is 50), and setting l to 20 yields the highest-accuracy result for LSTM_BM_SAM. For the pixel-matching-based algorithms, the classification performance of the Euclidean-distance measure is always better than that of SAM. Furthermore, selecting a smaller sequential length (i.e., 10 or 20) is suitable for these two methods. Regarding the block-matching-based methods, Euclidean distance performs better than SAM, except for l is 20. Smaller sequential lengths used with SAM result in higher OAs, but such lengths are not suitable when using the Euclidean distance measure. Nevertheless, the difference in the resultant classification accuracies when employing these two distance measurements in the block-matching scheme is smaller than what it is in pixel-matching scheme. Moreover, the standard deviations for those four methods, across the five sequential lengths, are 1.0020, 1.4280, 0.5342, and 0.4805, for LSTM_PM_EU, LSTM_PM_SAM, LSTM_BM_EU, and LSTM_BM_SAM, respectively, which illustrates that, for the Pavia University data, the block-matching scheme is less sensitive than the pixel-matching method to the sequential-length parameter value.
For the Salinas data, the best choices for l vary depending on the algorithm. As shown in Figure 6b, the length of 40 yields highest accuracies in LSTM_PM_EU and LSTM_BM_SAM. LSTM_PM_SAM achieves best OA when utilizing 20 as the sequential length. Length of 30 results in the highest OA in LSTM_BM_EU. Different from what we observed from Figure 6a, within the pixel-matching-based schemes, SAM always performs better than Euclidean distance except l = 20. Additionally, smaller length provides a better performance for LSTM_PM_SAM (l = 20) but not applicable in LSTM_PM_EU. For the blocking-matching schemes, SAM is the more robust distance measurement since it performs better at four lengths (10,20,40,50), and obtains the highest accuracy with l = 40. Euclidean distance only yields better result compared with SAM at the length of 30 and that is the highest OA among all lengths. For the standard deviations of those four methods, they are 0.3658, 0.4739, 0.9334, and 0.8140, respectively, which exhibits that pixel-matching-based methods is less sensitive than block-matching-based ones. However, due to the higher OAs obtained by LSTM_BM_EU and LSTM_BM_SAM, block-matching schemes are still the applicable methods to classify the Salinas data. Another investigation of parameter l is its influence on the training time of the LSTM model since a larger l will introduce more parameters to be learned in the LSTM model and will result in more processing time. The training times of different approaches are given in Table 5. Those training times are the average values obtained from the 10 replications. Different methods with the same l have similar training times for both the Pavia University and Salinas images. However, the training time differs when a different l is applied within one LSTM model. As an example, consider the application of the LSTM_BM_SAM to the Pavia University image; the training time is 32.00 min when l is 10. The training time increases along with utilization of larger l, and it reaches 142.07 min, which is more than four times the minimum training time consumption. Fortunately, BM-based methods are less sensitive compared with PM-based methods regarding the selection of different l, which can be obtained from Figure 6. To balance the computation time cost and classification performance, choosing a smaller l (e.g., 10 or 20) is an appropriate strategy for our proposed methods, even though PM-based methods are relatively more sensitive with respect to parameter l. For the second parameter, w, four different window sizes (5 × 5, 7 × 7, 9 × 9, 11 × 11) and two methods (LSTM_BM_EU and LSTM_BM_SAM) are chosen for the comparison experiments, where l is fixed at 20. These results are shown in Figure 7. For the Pavia University data-based results (Figure 7a), the overall classification accuracy of LSTM_BM_SAM is higher than that of LSTM_BM_EU. LSTM_BM_EU obtains a higher OA only when w is set at 7, and that is also the best accuracy across all four window sizes. LSTM_BM_SAM achieves the best performance with a window size of 5 (the smallest window size considered), and the OA decreases as the window size increases. Regarding the Salinas data-based results, given in Figure 7b, the classification accuracies for LSTM_BM_SAM are still higher than those for LSTM_BM_EU, and both of those two methods realize the most accurate results with a window size of 5. Compared with parameter l, the optimal value for w is easier to determine in order to achieve a more accurate classification result. It is predictable that incorporating local spatial contextual information is likely to help better measure the similarity between two pixels, yielding improved classification performance. However, with increasing window size, too many neighboring pixels are included in the calculation, resulting in over-smoothed classification maps, and class spatial boundaries are not preserved. As a consequence, selection of a relatively small window size should introduce sufficient-though not too much-spatial information, leading to higher classification accuracies.

Conclusions
In this paper, we propose a novel LSTM HSI classification framework, where unlabeled data are well-exploited in order to construct sequential features from a single HSI. Instead of using spectral features as the sequential data structure of LSTM, similar pixels collected from the entire image are used to construct the respective sequential features. Specifically, when constructing a sequential feature, the similarity between a target pixel and all other pixels in the image is considered. To better depict the similarity between two pixels, two similarity-measuring strategies-pixel-matching and block-matching-are adopted here, where individual spectral features are utilized in the pixel-matching-based schemes, and both spatial and spectral information are employed in block-matching-based schemes. Such schemes take full advantage of unlabeled data in the HSI, as labeled data are almost always limited in nature and difficult to acquire for HSI classification. Moreover, block-matching-based schemes also consider spatial contextual information in the classification process, and it is demonstrated in this research that such schemes are effective in increasing HSI classification. Our proposed methods produce markedly more accurate results when operating on two well-known, extensively-studied HSI datasets compared with other selected baseline algorithms. Particularly regarding the Pavia University image, the LSTM_BM_SAM achieves the best classification performance, with 96.20% OA, which is 11.75% higher than the best result obtained by the three benchmark algorithms, which in this case was 1DCNN, with 84.45% OA. Furthermore, that OA is also higher than those from other three proposed methods (LSTM_PM_EU, LSTM_PM_SAM, and LSTM_BM_EU), with an OA increase of 13.50%, 11.64%, and 0.24%, relative to those respective methods. Additionally, in these experiments, BM-based methods always yield better results compared with their corresponding PM-based methods, which demonstrates the effectiveness and capability of the utilization of spatial contextual information.
Regarding the proposed block-matching method, fixed window sizes are applied for classification. In the future, we will explore adaptive window-size applications, intended to eliminate the phenomenon of over-smoothing in the classified images and to preserve the respective boundaries between different classes. In addition, measuring pixel-wise similarity from the entire HSI more efficiently still needs to be investigated in future research.
The proposed methods in this study combine similarity measurements and recurrent neural networks, and, although in the present study we focus on encoding spatial contextual information, future work may involve implementing these methods in a temporal context (i.e., in a true multi-temporal remote-sensing context).