A Bidirectional Deep-Learning-Based Spectral Attention Mechanism for Hyperspectral Data Classiﬁcation

: Hyperspectral remote sensing presents a unique big data research paradigm through its rich information captured across hundreds of spectral bands, which embodies vital spatial and temporal information about the underlying land cover. Deep-learning-based hyperspectral data analysis methodologies have made signiﬁcant advancements over the past few years. Despite their success, most deep learning frameworks for hyperspectral data classiﬁcation tend to suffer in terms of computational and classiﬁcation efﬁcacy as the data size increases. This is largely due to their equal emphasis criteria on the rich spectral information present in the data, albeit all of the spectral information not being essential for hyperspectral data analysis. On the contrary, this redundant information present in the spectral bands can deter the performance of hyperspectral data analysis techniques. Therefore, in this work, we propose a novel bidirectional spectral attention mechanism, which is computationally efﬁcient and capable of adaptive spectral information diversiﬁcation through selective emphasis on spectral bands that comprise more information and suppress the ones with lesser information. The concept of 3D-convolutions in tandem with bidirectional long short-term memory (LSTM) is used in the proposed architecture as spectral attention mechanism. A feedforward neural network (FNN)-based supervised classiﬁcation is then performed to validate the performance of our proposed approach. Experimental results reveal that the proposed hyperspectral data analysis model with spectral attention mechanism outperforms other spatial- and spectral-information-extraction-based hyperspectral data analysis techniques compared.


Introduction
The increase in data volume, velocity, and diversity has lately given rise to the term "Big Data", which symbolizes the multifaceted issues faced by many of the scientific and applied domains. In the context of remote sensing the current data acquisition sources for Earth observation generate vast amounts of data, which are typically images acquired at various scales (high/low) and resolutions (i.e., spatial, spectral/temporal) [1]. Over the past decade, machine learning and deep learning methodologies have gained wide recognition for hyperspectral data analysis in remote sensing applications [2,3]. Deeplearning-based feature extraction and classification methodologies in hyperspectral remote sensing applications using convolutional neural networks (CNNs) [4], recurrent neural networks (RNNs) [5], and their variations foster the automation of processes due to their potential for progressively learning the attributes and information present in the highdimensional hyperspectral data [6]. Needless to say, the more complex deep-learning-based classification/object detection frameworks are, the higher their subsequent computational overhead is expected to be. However, this high computational cost is not desirable as we gravitate towards more automated/real-time hyperspectral data analysis applications [7].
The Earth's land cover is a dynamic canvas on which human beings and natural systems are always interacting. Land use/land cover (LULC) classification and its dynamics, which partially result from land surface processes, have considerable effects on biotic diversity, soil degradation, terrestrial ecosystems, and the ability of biological systems to support human needs [8]. Thus, land cover classification, and its dynamics with remote sensing data, is an important field in environmental change research at different scales. The efficient assessment and monitoring of land cover changes are indispensable to advance our understanding of the mechanisms of change and model the effects of these changes on the environment and associated ecosystems at different scales [9].
Remote sensing techniques represent some of the most effective tools to obtain information on LULC classification and dynamics (i.e., temporal-spatial changes and the transformation of landscapes). Many methods can detect land cover changes based on optical and radar imagery with different spatial and spectral resolutions. Existing techniques for accomplishing land cover classification can be broadly grouped into three general types, namely supervised classification algorithms, unsupervised classification algorithms, and a mixture of supervised and unsupervised classification techniques [8]. A large amount of high-dimensional, high-spatial-spectral resolution hyperspectral remote sensing data is becoming available due to the fast development of satellite and sensor technology, and the above-mentioned supervised and unsupervised classification methods could swiftly obtain cardinal information from the remote sensing data, thus playing an important role in hyperspectral imagery applications [10]. This being said, over time, classification frameworks based on high spatial-spectral resolution hyperspectral remote sensing data using machine learning algorithms such as neural networks have made a great impact in the field of remote sensing and our work is directly related to this.
Conventional machine-learning-based hyperspectral imagery classification and object detection frameworks are heavily inclined towards operating on spectral information as features [11]. Most of the spectral-information-reliant frameworks suggested in the literature include some form of similarity-or dissimilarity-distance-measure-based band grouping [12], and traditional supervised classification paradigms using k-nearest neighbors [13], maximum likelihood criterion [14], logistic regression [15], random forest classification [16], bagging and boosting techniques like AdaBoost [17], etc., have proved to be effective in classifying HSI data. Most of these spectral-information-based methodologies lack the potential to capture and utilize the corresponding spectral variability and information effectively, that is readily available in the high-dimensional hyperspectral data. This problem of information extraction and processing such high-dimensional data is not contemporary, however, it has started to gain more importance lately due to the surge in the volume of data (big data) and its acquisition methodologies. The big data attributes directly imply high dimensionality and data redundancy, which in turn exacerbates the curse of dimensionality caveat [18].
As a consequence, dimensionality reduction (DR) plays a prominent role in hyperspectral data analysis [19]. In general, DR techniques help combat the intensive data learning overhead by projecting high-dimensional data from their original feature space to a lower dimensional subspace and preserving all the vital information present in the data. Additionally, DR also brings down the computational requirements by a considerable factor. In the literature, various DR techniques such as principal component analysis (PCA) [20], linear discriminant analysis (LDA) [21], random projections (RPs) [4,6,22], etc., have gained increased attention due to their demonstrated computational efficacy and ability to preserve vital information present in the hyperspectral data. In addition, recent literature has proven that the integration of any form of additional information, such as spatial or contextual, alongside an efficient DR technique, to the available spectral information can improve the efficacy of hyperspectral data analysis [23,24].
In hyperspectral imaging, the relationship between the acquired spectral information and underlying land cover material is inherently nonlinear. To combat this issue, deep learning and machine learning algorithms have generally been adopted as fundamental feature extraction tools for effectively addressing/modeling data with nonlinear intrinsic relationships in the past few years. As a result, such deep learning techniques have shown promising results in the realm of hyperspectral data learning and representation for classification [2], object recognition [25], and other remote sensing applications. However, one of the major shortcomings of these techniques is how the data are presented to the deep learning framework for generalization. Generally, the information extracted from each spectral band is assigned an equal emphasis or importance without any consideration to the significance of spectral information/features on the final data analysis outcome [26,27]. Moreover, this form of antiquated equal importance designation to all the spectral bands can lead to the inclusion of inherent noise or redundant spectral information, which can not only be detrimental to hyperspectral data analysis but also affect the generalization capability of the underlying deep-learning-based HSI analysis framework. Thereby, it can severely inhibit the automation capabilities and efficacy of the methodology [7,23,28].
Therefore, this work leverages the benefits of DR techniques in conjunction with the ability of spatial-spectral representation provided by deep learning techniques to formulate an adaptive spectral attention framework for hyperspectral data analysis. In this framework, the input high-dimensional hyperspectral cube is first reduced to its lower dimensional subspace using principal component analysis (PCA). PCA is an unsupervised linear feature extraction method that uses orthogonal transformation to explore the correlation between the HSI spectral bands in order to extract their intrinsic properties. It is based on the notion that contiguous bands of HSI data are highly correlated and typically convey the same information about the ground objects in order to function efficiently [29]. Following the PCA-based DR, this reduced dimensional data are input to the proposed 3D-convolution [30] and bidirectional LSTM [31] based spectral attention and classification mechanism. The proposed spectral attention model delivers enhanced hyperspectral data learning, which prioritizes spectral information that is significant for hyperspectral data analysis and suppresses the redundant spectral bands. In addition, an FNN-based supervised classification [32] is incorporated to analyze the performance of this automated hyperspectral data analysis model. Therefore, the novel contributions of the proposed work are summarized as follows: • A lightweight spectral feature extraction methodology for hyperspectral data analysis is proposed using 3D-convolutions in conjunction to an effective dimensionality reduction technique using PCA. • The acquired spectral features, which are now a better representation of the temporal information in a lower dimensional subspace, are fed into a bidirectional LSTM-based attention framework, followed by an FNN-based supervised classification. • Hence, the proposed spectral-attention-driven classification framework is driven towards improved automated hyperspectral data analysis, while also addressing big data challenges such as high computational and memory overhead. • This work also presents variations of the proposed deep-learning-based feature extraction and classification frameworks to include the spectral-only, spatial-only, and spectral-spatial information extraction models. A comprehensive performance study of the several spatial-spectral-information-based hyperspectral data analysis frameworks is also conducted.
The rest of the paper is organized as follows: the proposed spectral attention-based classification methodology is discussed in Section 2 followed by several deep-learningbased classification techniques used for comparison briefed in Section 3. In Section 4, we experimentally demonstrate and validate the efficacy of the proposed spectral attention model, which offers enhanced hyperspectral data analysis through automated extraction of significant spectral information extraction and suppression of the redundant spectral bands. Finally, we summarize the effectiveness of our proposed automated hyperspectral data analysis model in Section 5.

Proposed Classification Methodology BI-DI-SPEC-ATTN
The goal of our work is to improve the spectral-information-based classification network's representational capacity by explicitly modeling the significance of spectral bands. The motivation behind this is to employ a gating mechanism to recalibrate the strength of distinct spectral bands in the input, i.e., to selectively emphasize the information from beneficial spectral bands while suppressing less relevant ones. While the necessity of a gating methodology for spectral attention mechanism is important to revamp the underlying classification framework's efficacy, we strongly feel that the inclusion of a computationally effective DR technique to render high dimensional HSI data in a lower dimensional subspace, which enhances the representation of features in the projected data space, is equally cardinal.
Hence, in the proposed data analysis framework, PCA is employed as a DR technique to reduce the spectral dimension of the input raw hyperspectral data cube. PCA is used to project a high dimensional hyperspectral data to its lower dimensional feature space to preserve crucial information present in the data. It also directly provides DR benefits such as a reduction of inherent noise and redundant information present in the data. Consequently, an input hyperspectral data cube X of spatial dimensions M × N and spectral dimension P is now dimensionally reduced to size (M × N × D). The proposed model then extracts 3D patches of pixels to preserve the spectral information from the input data in the shape of (3 × 3 × D), on which a 3D convolution operation with 32 kernels of shape (3 × 3 × 30) is applied to extract and preserve the corresponding local neighborhood interactions between pixels and their spectral correlation. The spatial dimension of the convolutional kernel was set to (3 × 3) to make it experimentally less computationally expensive for the framework to convert a spatially windowed input to a spectral vector, which is the input to the bidirectional LSTM in the successive stage of the HSI analysis framework. However, the choice was not frantically made. The spatial size of (3 × 3) and the spectral dimensional size of 30 for the 3D-convolutional kernel were empirically compared against many other choices and were chosen because they produced the best trade-off between computational efficacy and execution time during experimentation. This is followed by another convolution operation with 32 kernels of shape (1 × 1 × 64). As a result, the output from this function has a shape of (K × 1).
Successively, this pixel vector is passed through a bidirectional LSTM-based spectral attention gating mechanism as described in Equations (1)-(5). This attention gating mechanism selectively emphasizes the relevant informative pixels and suppresses the irrelevant bands. For any time step t, given a minibatch input X t ∈ R n×d (n, number of samples; d, number of inputs in each example), and a hidden layer activation function φ, assuming that the forward and backward hidden states for this time step are − → H t ∈ R n×h and ← − H t ∈ R n×h , respectively, where h is the number of hidden units, the mathematical representation of the attention gating mechanism is illustrated below.
Next, we obtain the final hidden state output H t by multiplying − → H t and ← − H t as shown in Equation (1). The same operations described in Equations (1)-(3) are repeated twice before the final output of the attention gating mechanism O t3 is obtained as shown in Equations (4) and (5).
The softmax activation function has been used on the output of the second bidirectional LSTM layer. The output of this softmax activation produces an activation map (which consists of probabilities ranging between 0 and 1), which directly reflects the importance of the output features. These probabilities are then multiplied with the output of the 3D convolution layer, which affects the weighting of the individual pixels in the (K × 1)shaped input vector by selectively emphasizing the pixels that contain more information and suppressing the ones with less information.
These constructed features are now used as an input for an FNN with 3 layers of 100, 50, and C nodes, respectively, with a dropout of 0.2 between the first two dense layers for supervised classification. Here, C denotes the number of classes in the dataset. The overall 3D-convolution and bidirectional LSTM-based spectral attention and classification framework-BI-DI-SPEC-ATTN-is illustrated in Figure 1.

PCA-3D-CNN
The PCA-3D-CNN deep learning methodology is considered to understand the effects of a spectral-only feature extraction framework, wherein a conventional DR technique such as PCA is used in tandem with supervised classification using CNN for hyperspectral data analysis. The emphasis here is to understand the effect of CNN-based conventional spectral feature extraction techniques such as PCA on hyperspectral data analysis. In this approach, the hyperspectral data in its original dimensionality P is projected onto a D-dimensional subspace using PCA for the spectral feature extraction. The resultant low-dimensional data are windowed into a size of (3 × 3 × D) followed by a 3D-CNN model for supervised classification. All the network parameters were empirically estimated for optimal results [6]. In the PCA-3D-CNN model, the first layer is a 3D-convolutional layer with 16 filters with dimension (3 × 3 × 32) followed by a flatten layer that is carried forward into an FNN with 100, 50, and C nodes with a dropout between every two layers with a value of 0.2, where C denotes the number of classes in the dataset.

SPEC-3D-CNN
Convolutions in a 2D-CNN can only capture 2-dimensional spatial information, and disregard the information along the spectral/temporal dimension. To address this concern, Ji et al. extended the idea of 2D-CNN used for 2D images to a 3D convolution in both space (2D) and time for video classification [33] and this acted as an inspiration for the HSI data classification methodology SPEC-3D-CNN. This methodology is identical to PCA-3D-CNN in its motivation to understand the contribution of spectral features exclusively on the proposed hyperspectral data analysis framework. However, unlike PCA-3D-CNN, there is no DR or spectral feature extraction technique employed on the original hyperspectral data. Here, the hyperspectral data in its original spectral dimensions P are directly introduced to the 3D-CNN classification architecture discussed in Section 3.1. This implies that the shape of the input to the 3D-CNN in the SPEC-3D-CNN methodology is (3 × 3 × P). This SPEC-3D-CNN model was specifically considered to study the effects of DR techniques or lack thereof on the automation performance of hyperspectral data analysis.

SPAT-2D-CNN
The aim of this model is to understand the contribution of spatial information alone on the CNN architecture. In this exclusive spatial feature extraction methodology, spatial contextual information is exploited by constructing features for a data point around its spatial neighborhood with the aid of 2D convolutional kernels. In this model, a (3 × 3) spatial neighborhood is considered, which is consistent with the other comparison methodologies defined in Section 3. This windowed data are now introduced as inputs to a 2D-CNN-based classification architecture. All CNN model hyperparameters and layers were empirically estimated for best performance. As in the SPEC-3D-CNN architecture, no DR technique was used on the original hyperspectral data in this SPAT-2D-CNN framework.

SVM-CK
In this work, we validate our proposed CNN architectures against the traditional composite kernel SVM (SVM-CK) for an inclusive spatial-spectral information extraction framework. The spatial features are extracted by calculating the spatial mean over a (3 × 3) window surrounding the pixel under consideration and its corresponding linear spatial kernel is computed [4,23]. Whereas for spectral features, the hyperspectral pixel vectors are directly used as spectral feature vector and RBF is used as the spectral kernel. Thus, the SVM-CK model incorporates both spatial and spectral features present in the hyperspectral data to provide enhanced classification performance. All the experiments related to the SVM-CK-model-based hyperspectral classification were conducted using LIBSVM on raw hyperspectral data without the use of any dimensionality reduction technique.

Experimental Results
In this section, all the datasets used for experimentation are briefly discussed alongside a detailed report on the experimental setup used for all the experiments conducted in this research work. Additionally, the efficiency of the proposed spectral attention and classification architecture BI-DI-SPEC-ATTN is validated and compared against four other models namely, PCA-3D-CNN, SPEC-3D-CNN, SPAT-2D-CNN, and SVM-CK as described in Section 3.

Datasets
All experiments were conducted on two airborne visible/infrared imaging spectrometer (AVIRIS) datasets-Salinas and Indian Pines-and a reflective optics system imaging spectrometer (ROSIS) dataset-University of Pavia [34]. The Salinas dataset is composed of 224 spectral bands, out of which 20 water absorption bands have been discarded. It has a spatial resolution of (512 × 217). This dataset comprises 16 classes related to vegetables, vineyard fields, and bare soils. The Indian Pines dataset was acquired by an AVIRIS sensor over the Indian Pines test site in northwestern Indiana. This dataset has a spatial dimension of 145 × 145 and 224 spectral bands (200 after removal of the water-absorption bands) with a spatial resolution of 20 m spanning 16 land cover classes. The Pavia University dataset has 103 spectral bands each having a spatial dimension of (610 × 340) with a spatial resolution of 1.3 m spanning nine classes of land covers. For each dataset, the training set was randomly chosen spanning from 5% through 50%.

Parameter Tuning and Experimental Setup
For our proposed methodology to function optimally, we have several parameters that need to be adjusted: the size of the reduced dimensional space using DR (D), the learning rate, the optimizer, etc. The reduced dimension D for the PCA computation was empirically found to be 100 for the Salinas and Indian Pines datasets and 50 for the Pavia University datasets, respectively. The length of the LSTM input vector K was empirically set to 256 for the Salinas and Indian Pines datasets and 128 for the Pavia University dataset, respectively. All parameters in the proposed approach were experimentally set to their optimal values to produce the best classification results. These parameters related to both the proposed methodology and frameworks used for comparison were tuned well enough to not leave any room for improvement for the classification results on all three datasets.
The objective function used in all our experimentation was the categorical crossentropy with a learning rate of 0.0001 and a decay of 10 −6 . The choice to pick categorical cross-entropy as the objective function was straightforward, as the nature of the problem we address in this work is multiclass classification. However, this was not the case when choosing an optimal learning rate during experimentation. Numerous values of learning rates, such as 0.00005, 0.0001, 0.0003, 0.0005, 0.001, and 0.005 were investigated. Upon rigorous experimentation, it was determined that a learning rate of 0.0001 with a decay of 10 −6 produced optimal results on all three datasets, and fluctuated the least when the results were averaged over three trials. Additionally, choosing a suitable batch size can effectively improve the memory utilization while training the classification model and improve the convergence accuracy of the architecture. We experimented by setting the batch size to multiple values, namely, 16, 32, 64, and 128, with a batch size of 32 producing the optimal results on all three datasets.
All the experiments used the Adam optimizer as it produced optimum results on all the datasets that are discussed in this work. In a normal gradient descent optimizer, the weights are adjusted based on the gradient calculated in the same epoch. However, with the Adam optimizer, the weights are adjusted based on the moving average of gradients calculated in current and previous epochs. The moments adjustment as per the Adam algorithm is calculated as a moving average of previous and current gradients and then those moments are used to update the weights. Gradient descent, RMSprop, and Adam optimizers, which are well known in the literature, were pitted against each other during experimentation and the Adam optimizer produced the best classification results on the Salinas, Indian Pines and Pavia University datasets.
To avoid any bias induced by random sampling of pixels, the classification results were averaged over three trials and the average accuracies along with execution time of the models are presented. All experiments were implemented using python on an Intel(R) Core(TM) i7-7700HQ processor with 16 GB RAM machine, and no GPU training was involved. For the purpose of training on all three datasets, samples were picked randomly from each class label in equal proportion and experimental results across different train/test ratios spanning from 5% through 50% were documented.

Discussion
Tables 1-3 denote the specific number of training and testing samples used for experimentation with 10% of training data across all three datasets discussed in this paper. Figures 2-4 illustrate the classification maps for 10% of training data for the proposed bidirectional LSTM-based spectral attention and classification analysis methodology BI-DI-SPEC-ATTN for the Salinas, Indian Pines, and Pavia University datasets, along with the frameworks used for comparison. It can be further inferred from Tables 4-6 that BI-DI-SPEC-ATTN gave superior classification performance over other frameworks that are discussed for both the Indian Pines and Pavia University datasets. Table 7 shows the overall execution time of all the models in this work for 10% of training data. It can be clearly noted from Figures 2-4 that our proposed BI-DI-SPEC-ATTN methodology has more coherent classification regions and fewer misclassifications with a competitive com-putational efficiency when compared to other methods discussed, at a reasonable trade-off between computational time and classification performance compared to other spatial-only, spectral-only and spatial-spectral-information-based feature extraction models.   Table 4. Class-specific accuracies of Indian Pines dataset with 10% of training data for the proposed methodology and other models used for comparison.   Table 5. Class-specific accuracies of Pavia University dataset with 10% of training data for the proposed methodology and other models used for comparison.   Table 6. Class-specific accuracies of Salinas dataset with 10% of training data for the proposed methodology and other models used for comparison. Our proposed framework produces the best classification results with an overall accuracy of 97.78%, 94.07%, and 97.80% on the Salinas, Indian Pines, and Pavia University datasets, respectively, for just 10% of training samples selected, which can be reaffirmed from the figures and tables documented in Section 4. Even though other classification methodologies discussed in this work are efficient, with many of them being state-of-the-art techniques, they lack the ability to capture distinctive features and information between different classes across the three datasets discussed in this work to produce effective classification results in comparison with the proposed BI-DI-SPEC-ATTN methodology. While the state-of-the-art composite kernel SVM-based classification technique (SVM-CK) discussed produced good classification results on the Salinas and Pavia University datasets with its ability to incorporate both spatial and spectral features present in the hyperspectral data through a 3 × 3 window-based average spatial kernel, coupled with an RBF spectral kernel, it under-performs when applied on the Indian Pines dataset, producing larger misclassification regions compared to all the other methodologies. Additionally, the 3D-CNN-based classification methodologies discussed in our work, namely, PCA-3D-CNN and SPEC-3D-CNN, produced superior classification results overall against their counterparts, namely, the 2D-CNN-architecture-based classification techniques SPAT-2D-CNN and SVM-CK, owing to their ability to effectively incorporate both spatial and temporal features that are critical for effective classification of hyperspectral data. Finally, the results produced by the proposed bidirectional LSTM-based attention and classification framework outperformed all the methodologies discussed in this work demonstrating the importance and feasibility of constructing the relationship between features and weighing them with the aid of an effective attention methodology. This was followed by a solid FNN-based network for classification of the constructed features, which produced results that bolstered the efficacy of the proposed technique. The efficacy of our proposed methodology BI-DI-SPEC-ATTN can be further affirmed from the overall classification accuracy plots as depicted in Figures 5-7 for the Salinas, Indian Pines, and Pavia University datasets, respectively. Our proposed approach BI-DI-SPEC-ATTN significantly outperformed all other methods compared, especially against the conventional principal-component-analysis-based spectral feature analysis model (PCA-3D-CNN), a 2D-convolutional-neural-network-based hyperspectral data classification model (SPAT-2D-CNN) and a conventionally used SVM-based spatial-spectral information inclusion model (SVM-CK). Our proposed methodology BI-DI-SPEC-ATTN presents a pragmatic and an efficient attention-based classification framework to automate the feature selection process through varied levels of importance/weighting assigned to spectral bands in a dataset, based on their quality of information. Thus, the BI-DI-SPEC-ATTN model provides superior classification performance not only with just 10% of training samples but also at various different training-testing ratios as demonstrated above in Figures 5-7. Therefore, our BI-DI-SPEC-ATTN model serves as an effective framework for automated decision making with excellent classification performance for hyperspectral data analysis applications.

# Class Name BI-DI-SPEC-ATTN PCA-3D-CNN SPEC-3D-CNN SPAT-2D-CNN SVM-CK
With the wide range of experiments and analysis that we conducted, it would definitely be worthy to denote the importance of PCA as a dimensionality reduction technique alongside being a principal feature extraction component in our work. It not only reduced the computational complexity of our spectral attention and classification methodology (BI-DI-SPEC-ATTN), but also acted as an efficient lightweight spectral feature extraction technique and a noise reduction component. The importance of a dimensionality reduction technique such as PCA for DR, information retrieval, and as a linear orthogonal transformation technique that transforms the data to a new coordinate system, has been justified in the literature over time in HSI applications. PCA is explicitly not designed for noise removal but instead, it is designed to reduce the dimensionality of the feature space with which the underlying deep learning regression/classification model approximates. We can think of PCA as a tuning knob to smoothly decide how much information we want to retain, which is impossible to achieve if one works directly with the original features. Since we cannot directly decide which features to retain and the ones to eliminate, as the original features have no order of priority or usability, PCA comes into play.
As a result, eliminating some of the PCs with lower variances, i.e., with lower eigenvalues, usually helps the model to generalize better. PCs with higher eigenvalues capture the principal information about the dataset and thus adding more and more PCs ends up appending information to the existing reduced dimensional data space. Thus, removing some PCs with lower eigenvalues actually acts as a regularization technique to minimize the redundancy of the information present in the data. Hence, in this work, we aimed to alleviate the inherent process noise and data redundancy present in the hyperspectral data using PCA to enhance the data learning outcomes of deep learning methods [4,6].     Alongside PCA, the bidirectional LSTM-based feature importance weighting/attention module, which operates by selectively emphasizing the feature values and correlating the output sequence of feature vectors of the high dimensional hyperspectral data with the results of selective learning, constitutes our proposed attention and classification framework BI-DI-SPEC-ATTN.
Additionally, Table 7 denotes the overall execution time (includes training, validation, and testing) for all the methodologies discussed in this work for the Salinas, Indian Pines and Pavia University datasets with 10% of training data. Overall, the experimental results presented in this paper demonstrate that the proposed bidirectional and 3D-CNN-oriented spectral-attention-based classification architecture (BI-DI-SPEC-ATTN) required only a small number of training samples for effective classification, while also providing robust performance with all the datasets used in the experimentation phase.

Conclusions
In this work, a novel deep-learning-based bidirectional spectral attention and classification mechanism was introduced. Compared to the traditional deep-learning-based hyperspectral data analysis approaches, our work explores the ability of a gated spectral attention mechanism to adaptively diversify spectral bands by selectively emphasizing the more informative bands and suppressing the less useful ones for a superior classification performance. Experimental results demonstrated that the proposed BI-DI-SPEC-ATTN methodology yielded outstanding classification performance while being robust under a limited training samples scenario, when compared to other spatial-and spectral-only based feature extraction and classification approaches. Our spectral attention based hyperspectral data analysis framework, BI-DI-SPEC-ATTN, further illustrated the efficacy and potential to learn and prioritize features in the high-dimensional HSI data and extract important relationships between the spectral features, which reinforced the goal of effective and efficient automation in hyperspectral remote sensing applications.