Next Article in Journal
Geotechnologies in Biophysical Analysis through the Applicability of the UAV and Sentinel-2A/MSI in Irrigated Area of Common Beans: Accuracy and Spatial Dynamics
Next Article in Special Issue
Enhancing Semi-Supervised Few-Shot Hyperspectral Image Classification via Progressive Sample Selection
Previous Article in Journal
Space Domain Awareness Observations Using the Buckland Park VHF Radar
Previous Article in Special Issue
A Framework for Fine-Grained Land-Cover Classification Using 10 m Sentinel-2 Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Spectral-Spatial-Sensorial Attention Network with Controllable Factors for Hyperspectral Image Classification

1
School of Computer Science, Hubei University of Technology, Wuhan 430068, China
2
School of Geosciences, Yangtze University, Wuhan 430100, China
3
Institute of Geological Survey, China University of Geosciences, Wuhan 430074, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Remote Sens. 2024, 16(7), 1253; https://doi.org/10.3390/rs16071253
Submission received: 7 February 2024 / Revised: 24 March 2024 / Accepted: 29 March 2024 / Published: 1 April 2024
(This article belongs to the Special Issue Deep Learning for Spectral-Spatial Hyperspectral Image Classification)

Abstract

:
Hyperspectral image (HSI) classification aims to recognize categories of objects based on spectral–spatial features and has been used in a wide range of real-world application areas. Attention mechanisms are widely used in HSI classification for their ability to focus on important information in images automatically. However, due to the approximate spectral–spatial features in HSI, mainstream attention mechanisms are difficult to accurately distinguish the small difference, which limits the classification accuracy. To overcome this problem, a spectral–spatial-sensorial attention network (S3AN) with controllable factors is proposed to efficiently recognize different objects. Specifically, two controllable factors, dynamic exponential pooling (DE-Pooling) and adaptive convolution (Adapt-Conv), are designed to enlarge the difference in approximate features and enhance the attention weight interaction. Then, attention mechanisms with controllable factors are utilized to build the redundancy reduction module (RRM), feature learning module (FLM), and label prediction module (LPM) to process HSI spectral–spatial features. The RRM utilizes the spectral attention mechanism to select representative band combinations, and the FLM introduces the spatial attention mechanism to highlight important objects. Furthermore, the sensorial attention mechanism extracts location and category information in a pseudo label to guide the LPM for label prediction and avoid details from being ignored. Experimental results on three public HSI datasets show that the proposed method is able to accurately recognize different objects with an overall accuracy (OA) of 98.69%, 98.89%, and 97.56%, respectively.

Graphical Abstract

1. Introduction

A hyperspectral image (HSI) is a three-dimensional cube composed of spectral and spatial information. Among them, the spectral information consists of hundreds of continuous narrow bands that record the reflectance values of light from visible to infrared [1]. The spatial information consists of pixels that describe the distribution of land cover [2]. The abundant spectral and spatial information improves the reliability and stability of object analysis [3]. Therefore, the interpretation of HSI is widely used in precision agriculture, land management, and environmental monitoring [4].
HSI classification attempts to assign labels for each pixel and obtains the category of different objects [5]. In the early stages, some classical machine learning models were proposed for HSI classification, such as k-means clustering [6], multinomial logistic regression (MLR) [7], random forest (RF) [8], and support vector machine (SVM) [9], et al., which extract the representative features and assign the categories with sufficient labeled samples [10]. However, these models are difficult to capture the correlation of spectral and spatial information and to distinguish the approximate features.
Deep learning models apply neural networks to automatically extract local and global features, and fully consider the contextual semantics to obtain the abstract representation of spectral–spatial features [11]. As the commonly used model, the convolutional neural network (CNN), with the characteristics of local connection and weight sharing, extracts spectral–spatial features simultaneously [12]. 2D-CNN and 3D-CNN are used in HybridSN to capture the semantic features of the HSI patches [13]. However, adjacent patch samples contain a large overlapping area, which entails expensive computational costs [14]. To reduce the overlapping computation, spectral–spatial residual networks (SSRN) [15], spectral–spatial fully convolutional networks (SSFCN) [16], and fast patch-free global learning (FPGA) [17] were proposed to take HSI cubes or entire images as training samples and upgrade pixel-to-pixel classification to image-to-image classification. The deep learning models utilize convolutional layers to expand the receptive field, andextract the correlation of long-range features to improve the classification accuracy [18]. However, these models face the problem of detail loss during feature extraction, i.e., the details within a few pixels gradually decrease and are likely to disappear after several down-sampling.
To retain detailed features that provide sufficient information for label prediction, attention mechanisms have attracted widespread interest for their ability to emphasize important objects, which are like the eye of models, automatically capturing the important objects and ignoring the background, and the capability of feature extraction is significantly improved with a small computation sacrifice [19]. In addition, the pooling operation and multilayer perceptron (MLP) are utilized to quickly evaluate the importance of features and assign the corresponding weights [20]. Some classical attention mechanisms, such as “squeeze and excitation” (SE) [21], convolutional block attention module (CBAM) [22], and efficient channel attention (ECA) [23], measure the importance of spectral and spatial features with weights, as well as delineate the attention regions. However, the pooling operation is difficult to distinguish the continuous and approximate features of HSI, and results in inaccurate attention regions, which is described as “attention escape”.
The inaccurate attention region is mainly caused by the imprecise evaluation of attention weights, and adjusting the evaluation manner could effectively mitigate this problem [24]. Therefore, controllable factors are introduced to enlarge the differences in approximate features and enhance the sensitivity of attention mechanisms [25]. For example, the deep square pooling operation was proposed to increase the difference in continuous features by using pixel-wise squares to generate more discriminative weights [26]. The coordinate attention mechanism (CAM) was proposed to locate important objects to efficiently capture the long-range dependency of features [27]. The residual attention mechanism (RAM) introduced residual branches to fuse shallow features and control the spatial semantics padding of trunk branches [28]. These approaches adjust the evaluation manner of attention mechanisms and obtain more accurate attention weights to emphasize the important objects [29]. However, controllable factors lack the dynamic adjustment ability to adapt to the complex and continuous feature environment of HSI.
In this paper, a spectral-spatial-sensorial attention network (S3AN) with controllable factors is proposed for HSI classification. In S3AN, attention mechanisms and convolutional layers are encapsulated in the redundancy reduction module (RRM), feature learning module (FLM), and label prediction module (LPM). Specifically, dynamic exponential pooling (DE-Pooling) and adaptive convolution (Adapt-Conv), as controllable factors, participate in weight sharing and convey information via interfaces to balance the control effect of modules. In RRM, the spectral attention mechanism converts the spectral features into band weights, evaluates them by reconstructive convolution (Rec-Conv), and selects important bands to construct dimension-reduced features. In FLM, the spatial attention mechanism with double branches is utilized to pad details and emphasize spatial features, and cross-level feature learning (CFL) is utilized to extract the abstract representation of deep and shallow features. In LPM, the sensorial attention mechanism is utilized to search for the coordinates of labeled pixels and guides transition convolution (Trans-Conv) for pixel-wise classification. Finally, a lateral connection is used to fuse the three functional modules, gradually optimizing the representation of features and improving the classification accuracy. The main contributions of this paper are listed as follows:
  • A S3AN with controllable factors is proposed for HSI classification. The S3AN integrates redundancy reduction, feature learning, and label prediction processes based on the spectral-spatial-sensorial attention mechanism, which refines the transformation of features and improves the adaptability of attention mechanisms in HSI classification;
  • Two controllable factors, DE-Pooling and Adapt-Conv are developed to balance the differences in spectral–spatial features. The controllable factors are dynamically adjusted through backpropagation to accurately distinguish continuous and approximate features, and improve the sensitivity of attention mechanisms;
  • A new sensorial attention mechanism is designed to enhance the representation of detailed features. The category information in the pseudo label is transformed into the sensorial attention map to highlight important objects, and position details and improve the reliability of label prediction.

2. Related Work and Motivations

HSI classification, as a pixel-wise classification task, relies on the contextual semantic extraction of spectral–spatial features [30]. To improve classification accuracy, CNN and attention mechanisms have attracted widespread interest for their ability to extract contextual semantics and emphasize important objects.

2.1. HSI Classification Based on CNN

The CNN-based methods utilize convolutional layers to automatically extract spectral–spatial features to implement end-to-end HSI classification and achieve satisfactory performance [31]. A Conv-Deconv network (CDN) was proposed for HSI classification, which integrated feature extraction and feature recovery processes based on encoder-decoder structure for unsupervised spectral–spatial feature learning [32]. To avoid overfitting, SSRN proposed a spectral–spatial residual network to take 3D cubes as training samples, which avoids complex feature engineering of HSI [15]. To reduce the complexity of the model, HybridSN fused 3D-CNN and 2D-CNN to analyze the correlation of spatial and spectral information and obtain a more abstract level of spatial representation [13]. Dynamic low-rank and sparse priors constrained deep autoencoders (DLRSPs-DAEs) fully utilized the low-rank and sparse property of HSI, and combined the low-rank sparse model (LRSM) with deep auto-encoder (DAE) to capture the important features of HSI [33]. Further, fast patch-free global learning (FPGA) was proposed for HSI classification, which was based on a global stochastic stratified (GS2) sampling strategy and FreeNet to achieve an end-to-end classification of the entire image [17]. However, these methods ignore the dimensional mutation problem in the prediction layer, i.e., the semantic feature map is suddenly transformed into the classification result that is the same size as the original image. During up-sampling, the feature map is padded with many irrelevant elements, which corresponds to increased noise and even suppresses the representation of details [34].

2.2. HSI Classification Based on Attention Mechanism

The attention mechanism-based methods highlight the important spectral–spatial features and play an active assistance role in HSI classification [35]. Double-branch multi-attention network (DBMA) constructed a branching framework based on the spectral attention mechanism and spatial attention mechanism to extract 3D cube features simultaneously [36]. To boost the interaction of features, the residual spectral–spatial network (RSSN) introduced residual blocks in the spatial and spectral network, and combined them with contextual semantics to optimize the representation of features [37]. Spectral–spatial attention networks (SSAN) embedded the attention mechanism into RNN and CNN, respectively, thus achieving a sufficient learning of continuous spectral information and the spatial correlation of adjacent pixels [38]. Deep self-representation learning framework (DLSF) adaptively removed anomalous pixels from HSI by an alternating optimization strategy, and introduced a subspace recovery autoencoder (SRAE) to sense the local anomalous pixels by using spatial detail information [39]. In addition, the band selection methods that contained spectral attention mechanisms, such as BS-Net [40] and TAttRecNet [41], also became popular for HSI processing. However, the attention mechanism in these methods is insensitive to the difference in approximate features, which affects the generation of attention weights and limits the final classification accuracy.

2.3. Motivations

Since some CNN-based HSI classification methods tend to ignore detail features during down-sampling, which makes it difficult for details to provide knowledge for label prediction [42]. To address this problem, a spatial attention mechanism with double branches is applied to pad shallow features; skip connections are introduced in CFL to interact with deep and shallow features; a sensorial attention mechanism and Trans-Conv are added to emphasize the important objects and retain sufficient details for label prediction.
The inaccurate evaluation manner decreases the sensitivity of attention mechanisms, and results in difficulty in generating clear category boundaries for attention mechanism-based HSI classification methods [43]. To address this problem, controllable factors are introduced to dynamically adjust the differences in spectral–spatial features. Among them, DE-Pooling is utilized to enlarge the differences in approximate features to obtain more distinguishable feature weights. Adapt-Conv is utilized to enhance the interaction efficiency of feature weights and capture the correlation of adjacent and long-range features. The main purpose of controllable factors is to improve the sensitivity of attention mechanisms to generate accurate attention regions.
Hence, S3AN integrates RMM, FLM, and LPM for redundancy reduction, feature learning, and label prediction, respectively. The controllable factors, DE-Pooling and Adapt-Conv, are utilized to adjust the delineation of attention regions and update the control effect by the backpropagation to balance the feature extraction ability. RRM combines spectral attention mechanism and Rec-Conv to select important bands for the construction of dimension-reduced features, and reduces the computational cost of feature learning. FLM fuses spatial attention mechanism and CFL to extract spectral–spatial contextual semantics and learn the abstract representation of features. In addition, LPM attempts to introduce the sensorial attention mechanism and Trans-Conv to mitigate dimensional mutation to improve the final classification accuracy.

3. Materials and Methods

S3AN designs functional modules based on attention mechanisms and convolutional layers, where the attention mechanism reweights feature maps to generate feature masks, and combines with convolution layers to extract abstract representation. A lateral connection is utilized to integrate these modules and interfaces are applied to convey feature maps and feedback information. As shown in Figure 1, the HSI cube is defined as X R C × S × S , where S and C are the input size and the number of bands, respectively. The original image is transformed by RRM, the spectral feature is converted into a weight vector by spectral attention mechanism; Rec-Conv is utilized to update the weights; and the top B bands with the highest weights are selected to construct the dimension-reduced feature that replaces the original image. Then, the dimension-reduced feature is delivered to FLM via the interface, and the spatial attention mechanism is applied to pad some shallow details; the spectral–spatial contextual semantic features are extracted by CFL, and the pseudo label is created through multi-scale feature fusion. Further, the pseudo label is transformed into a sensorial attention map by the sensorial attention mechanism, and Trans-Conv is guided to focus on the labeled pixels to optimize the representation of semantic features. Finally, the classification result is obtained by LPM.

3.1. Controllable Factors

To improve the ability of attention mechanisms to distinguish approximate spectral–spatial features, two controllable factors are proposed; that is, dynamic exponential pooling (DE-Pooling) and adaptive convolution (Adapt-Conv), and the detailed structure is shown in Figure 2.

3.1.1. DE-Pooling

DE-Pooling adds a dynamic exponent computation before the global average pooling to control the fluctuation of spectral–spatial features [44]. As shown in Figure 2, taking the DE-Pooling of spectral feature as an example, the HSI cube is split into hundreds of bands, and then each band X i is enlarged by an exponential multiple. This enlargement process highlights the differences in information between adjacent bands, making each band independently weighted. As seen in the band weights change, the approximate feature becomes more distinguishable. To adjust the spectral feature within a suitable range, the dynamic exponent is adjusted based on spectral attention weights, and the adjustment process is written as
γ = f A = i = 1 C A i C C × A m a x C + r a n d 0 , 2
where f A denotes the mapping of dynamic exponent to spectral attention weights, A i C denotes the i-th spectral attention weight, and A m a x C denotes the maximum spectral attention weight. In addition, the original physical properties of HSI spectral–spatial features are altered as the dynamic exponent γ continues to increase. Therefore, to avoid the infinite enlargement of the spectral features, expecting the feature value x to change within the interval ( x , x 3 ) , r a n d 0 , 2 is added to limit the variation in the dynamic exponent.

3.1.2. Adapt-Conv

To relate local and global features and improve the interaction efficiency of band weights, Adapt-Conv utilizes 1D-convolution instead of MLP and sets an adaptive convolutional kernel to control the information interaction range [45]. Adapt-Conv for band weight interaction is shown in Figure 2, where the adaptive convolutional kernel size is set to 3. To achieve adaptive adjustment of the kernel size, the mapping is conducted to describe the relationship between the number of bands and the kernel size. Considering that there may also be a positive proportional mapping between the number of bands and the kernel size, the mapping is written as
C = 2 θ × k + b
where the kernel size k, θ and b are controllable parameters. Since the number of bands C is usually close to being a power of 2, the mapping relation is defined as a nonlinear function 2 θ × k + b . Thus, after a given number of bands C, the kernel size k is calculated using the inverse function, which is written as
k = g C = l o g 2 C θ + b θ o d d
where g C denotes the mapping of the kernel size k to the number of bands C. Since the convolutional kernel slides with the center as an anchor point, whereas odd convolutional kernel has a natural center point. Therefore, the odd operation t o d d is set in Adapt-Conv, and it is taken as an odd number close to t.

3.2. Redundancy Reduction Module (RRM)

RRM converts the reduction in spectral redundancy to a band reconstruction task, i.e., recovering the complete image with a few important bands [40]. Therefore, the band importance is evaluated by the spectral attention mechanism, and the spectral attention weights are updated by Rec-Conv, selecting the bands that are essential for spectral reconstruction to construct the dimension-reduced feature.
As shown in Figure 3, the HSI cube X is fed into the spectral attention mechanism, and DE-Pooling fully considers the difference in spectral features and assigns a unique band weight w i to each band. Further, the band weights W C are conveyed to Adapt-Conv for local and global features interaction and obtain the final spectral attention map A C . Therefore, the generation of the spectral attention map is written as
A C = σ A d a p t c o n v D E p o o l X
where σ is the sigmoid activation function, A d a p t c o n v denotes the adaptive convolution, and D E p o o l denotes the dynamic exponential pooling. Furthermore, a band-wise multiplication operation is applied to create the interaction between the HSI cube and the spectral attention map to obtain the spectral feature mask M C = X A C .
Rec-Conv aims to improve the weights of important bands and suppress the representation of redundant bands. This structure consists of two C o n v 3 × 3 and a nearest interpolation function for recovering the spectral feature mask. Then, calculate the loss between the original and recovered images, and update the band weights. After several iterations, the important bands will obtain the higher weight. Finally, the top B bands with the highest weights are selected by sorting, and their indexes are conveyed to FLM.

3.3. Feature Learning Module (FLM)

In FLM, the spatial attention mechanism emphasizes global spatial features by the trunk branch and pads the details by the residual branch. Then, CFL is applied to extract the contextual semantics by skip connection and multi-scale feature fusion.
As shown in Figure 4, the dimension-reduced feature X ^ R B × S × S is received by the spatial attention mechanism and allocated to two branches for processing. In the trunk branch, DE-Pooling is utilized to balance the difference in spatial features and assign spatial attention weights to each pixel. Then, the spatial information interaction is performed by a C o n v 1 × 1 + B N + R e L U convolutional combination, and obtain the spatial attention map A S . In the residual branch, the input feature X ^ is fed into two convolution combinations of C o n v 3 × 3 + B N + R e L U to extract a shallow feature map R R B 4 × S × S . Therefore, the generation of the spatial attention map is written as
A S = σ F 1 × 1 D E p o o l X ^
where F 1 × 1 denotes a convolution operation with a filter size of 1 × 1 . The pixel-wise multiplication operation is applied to create the interaction of spatial attention map and dimension-reduced feature to obtain the trunk feature mask T X ^ A S . Further, the trunk feature mask T and the shallow feature map R are concatenated by weighted fusion to obtain a spatial feature mask M S = T + λ R .
CFL following the encoder-decoder structure is mainly applied to extract contextual semantics. Among them, the encoder utilizes several C o n v 3 × 3 + B N + R e L U convolutional combinations for down-sampling and feature learning to obtain abundant semantic information. The decoder utilizes several nearest interpolation functions for up-sampling and spatial feature recovery to obtain the semantic feature map that is of the same size as the original image.
Furthermore, the pseudo label is generated by multi-scale feature fusion. The shallow feature map (feature1) and deep feature maps (feature2, feature3) extracted by the encoder are selected for feature concatenation. Since the size of the deep feature maps are 1/2 and 1/4 of the input image, respectively, and not directly usable for creating the pseudo label. Therefore, two nearest neighbor interpolation functions are used to recover the size of the deep feature maps, when the feature maps are of uniform size, they are then concatenated by using the c o n c a t operation. In this way, the obtained feature maps with object locations that are similar to the input image. Further, the argmax function value is computed to obtain the possible category information in each pixel. In general, the pseudo label is an advanced prediction result that provides the location and category information of ground objects.

3.4. Label Prediction Module (LPM)

In LPM, the sensorial attention mechanism is applied to search for the labeled pixels in the semantic feature map, and Trans-Conv is applied to transition the semantic feature map into the classification result. Therefore, the module improves the stability and reliability of label prediction by object position and feature transition.
As shown in Figure 5, the sensorial attention mechanism extracts the location and category information in the pseudo label and guides the model to focus on the location where objects are likely to be present, so that important details are not ignored by the model. Specifically, the pseudo label is fed to the sensorial attention mechanism, and its row and column elements are extracted by using DE-Pooling and Adapt-Conv. Then, the row and column sensorial attention weights are cross-multiplied to generate the complete sensorial attention map A L . Thus, the generation of the sensorial attention map is written as
A L = σ A d a p t c o n v D E p o o l R o w × A d a p t c o n v D E p o o l C o l u m n
where R o w and C o l u m n indicate the row and column elements of the pseudo label, respectively. Then, a pixel-wise multiplication operation is applied to create the interaction of the sensorial attention map and the semantic feature map Y ^ to obtain the sensorial feature mask M L = Y ^ A L .
Trans-Conv is applied to mitigate the dimensional mutation in the prediction layer and has a flexible design concept. Its depth is determined by the size of the semantic feature map Y ^ and the dimensional reduction coefficient r. The appropriate Trans-Conv layers both increase the model depth and retain details for label prediction. Therefore, the mapping of the Trans-Conv layers l and the semantic feature dimension D is established as
l = 1 2 l o g r D c l a s s
where r denotes the dimensional reduction coefficient. Since the number of channels of the semantic feature map is reduced to c l a s s , so the dimensional distance is D c l a s s . Finally, the output of Trans-Conv is performed for label prediction to obtain the classification result P.

3.5. S3AN for HSI Classification

To meet the requirements of HSI classification, S3AN designs RRM, FLM, and LPM based on attention mechanisms with controllable factors to process the original image. Among them, controllable factors and detail processing approaches are utilized to address the problems of “attention escape” and detail loss. A lateral connection is applied to integrate three functional modules, and the interfaces are used to convey feature maps and feedback information between the modules. Therefore, the original image X is transformed by these modules to obtain the classification result P. To train the model and adjust the controllable factors, the cross-entropy is utilized as the loss function, which is written as
l o s s Y , P = 1 N i = 1 N j = 1 M y i , j l o g p i , j
where Y denotes ground truth and P denotes the classification result. M and N refer to the number of categories and samples of the training datasets, respectively. y i , j denotes the sign function with a result of 0 or 1. p i , j denotes the probability of the pixel i belongs to category j.

4. Experimental Results

4.1. Datasets Description

To comprehensively evaluate the performance of the proposed method, three public HSI datasets are used for comparative experiments [46]. Details of the number of samples and dataset division for each category are summarized in Table 1.
  • Indian Pines dataset: The Indian Pines dataset was collected by AVIRIS imaging spectrometer in a piece of Indian Pine in Indiana, USA, with a spatial resolution of 20 m. The image has 200 bands and 145 × 145 pixels and contains 16 different categories of land cover.
  • Salinas dataset: The Salinas dataset was collected by the AVIRIS imaging spectrometer in the Salinas Valley, California, USA, with a spatial resolution of 3.7 m. The image has 204 bands and 512 × 217 pixels and contains 16 different categories of land cover;
  • WHU-Hi-HanChuan dataset: The WHU-Hi-HanChuan dataset was collected by the Headwall Nano Hyperspec imaging spectrometer aboard the drone platform in Hanchuan, Hubei Province, China, with a spatial resolution of about 0.0109 m. The image has 274 bands and 1217 × 303 pixels and contains 16 different categories of land cover.

4.2. Experimental Setup

  • Operation environment: All experiments are based on the PyTorch library and run on Tesla M40 GPUs. The experimental results are the average of 10 independent runs;
  • Evaluation metrics: Five metrics are used to evaluate the performance of HSI classification with respect to classification accuracy and computational efficiency, such as the per-class accuracy, the overall accuracy ( O A ), the average accuracy ( A A ), the K a p p a coefficient ( K a p p a ), and the inference time. Specifically, evaluation metrics are calculated as follows:
    P e r c l a s s A c c u r a c y = T P T P + F P
    O A = T P + T N T P + T N + F P + F N
    A A = P e r A c c 1 + P e r A c c 2 + + P e r A c c N N
    P E = ( T P + F P ) ( T P + F N ) + ( F P + T N ) ( F N + T N ) ( T P + T N + F P + F N ) 2
    K a p p a = O A P E 1 P E
    where T P denotes True Positive, F P denotes False Positive, T N denotes True Negative, and F N denotes False Negative. Note that P E is an intermediate variable in the calculation of K a p p a . P e r A c c denotes per-class accuracy. In addition, the inference time denotes the time required by the model to ergodic the entire HSI. The shorter of the inference time indicates that the model is closer to the actual application requirements.
  • Parameters setting: For the parameters of controllable factors, the dynamic exponent is initially set to 2; the adaptive convolution kernel size is initially set to 3; and the number of bands selected is set to 32. In addition, the Adam optimizer trains the model with a learning rate of 0.001, the loss function is cross-entropy and the training epoch is set to 200.

4.3. Classification Results

S3AN is compared with some state-of-the-art methods for HSI classification, which include HybridSN [13], DBDA [36], SSRN [15], SSFCN [16], CBW [43], CTN [46], and FPGA [17]. The classification results of different methods on three datasets are detailed in Table 2, Table 3 and Table 4, and the classification maps obtained by different methods are illustrated in Figure 6, Figure 7 and Figure 8.

4.3.1. Classification Results on Indian Pines Dataset

Table 2 shows that CNN-based methods obtain reasonable classification results, where the O A of S3AN and FPGA reaches 98.69% and 96.74%, respectively. It is shown that CNN-based methods have advantages in capturing the correlation of spectral–spatial features. As for the attention mechanism-based methods, CBW and CTN, with 95.66% and 96.59% of O A . In addition, SSFCN is able to directly deal with the entire HSI, and it achieves 89.75% and 87.98% of O A and A A , respectively. However, for objects with small samples, such as categories 1 and 6, the classification accuracies of SSFCN are only 56.24% and 74.96%. It is illustrated that the Indian Pines dataset has imbalanced training samples, and objects with a larger number of samples are beneficial for feature learning. In addition, S3AN introduces the RRM module to transform HSI into dimension-reduced features. Hence, the inference time of the entire image is reduced to 2.36 s, which is much lower than that of SSFCN and CTN.
As shown in Figure 6, there are fewer misclassifications in the classification map of S3AN compared with other methods. For example, in the Soybean-m and Corn-n regions, S3AN shows a good visualization and obtains better classification results than its competitors. Meanwhile, attention mechanism-based methods, such as CTN and CBW, have better visualization performance than SSFCN. However, some misclassification still occurs in DBDA and SSRN because the approximate features are difficult to distinguish. In contrast, S3AN introduces controllable factors to balance the differences in spectral–spatial features and enhance the sensitivity of attention mechanisms. Therefore, it is suitable for recognizing objects with small samples, such as the Corn and Soybean categories.

4.3.2. Classification Results on Salinas Dataset

Table 3 shows that the attention mechanism-based methods achieve better classification results than the CNN-based methods. Among them, the O A of S3AN obtains about a 10% improvement, compared to HybridSN and SSFCN. It shows that attention mechanism-based methods are able to focus on the important objects to improve the classification accuracy. Moreover, S3AN introduces the sensorial attention mechanism to search for the labeled pixels, and guides the model to focus on the objects of each category simultaneously. Therefore, compared to FPGA, S3AN is more stable in the classification of each category and has a 2.11% improvement in A A . Further, the inference time of S3AN for the entire image is 25.09 s, which is about 20 s faster than DBDA and SSRN.
As shown in Figure 7, the classification map of S3AN shows the distribution of different objects is more clear than other methods. This indicates that the attention mechanism with controllable factors plays an active role in feature learning. Although DBDA and SSRN utilize attention mechanisms for HSI classification, their classification maps still contain some regions that are not accurately recognized. For approximate features that are difficult to distinguish, such as Vineyard and Grape categories, these attention mechanism-based methods still suffer from some misclassification. It is thought that unsuitable evaluation manners in attention mechanisms lead to inaccurate attention regions, which influence classification results. In contrast, CTN and S3AN are able to accurately locate important objects and obtain classification maps that are close to ground truth.

4.3.3. Classification Results on HanChuan Dataset

Table 4 shows that the methods with label processing, such as FPGA and S3AN, obtain better classification results than other methods. Among them, GS2 is introduced into FPGA for label-stratified sampling, with the O A and K a p p a coefficients reaching 96.62% and 0.9505, respectively. S3AN processes the pseudo label by sensorial attention mechanism, which further improves the O A to 97.56%. It is indicated that the location and category information of different objects in the pseudo label plays a positive role in HSI classification and contributes to the results of the S3AN. Since the HanChuan dataset contains a large number of labeled samples, obtaining classification results requires a long inference time. S3AN performs redundancy reduction before feature learning, and completes inference in only 121.09 s, with a much lower time cost than SSRN, SSFCN, and DBDA. In addition, S3AN introduces Trans-Conv to mitigate the dimensional mutation, which provides a ramp for channel transition, and it retains some details for label prediction, therefore, a better consistency is obtained with a K a p p a coefficient of 0.9769 compared to other methods.
As shown in Figure 8, HybridSN and CTN show few misclassifications in the Tree and Roof categories. Since these two categories of objects are in shadow and glare, it is difficult to recognize them directly even by human eyes, which brings challenges to classification. Compared to CTN, S3AN is more concerned with the processing of details and accurately distinguishing Grass and Watermelon categories. Moreover, in the shadow areas, the classification result of S3AN is better than other methods and has sharper category boundaries of details. This indicates that S3AN is efficient in enhancing the representation of details and is robust enough to recognize objects in shadow areas.

4.3.4. Confusion Matrix

As shown in Figure 9, to show the classification ability of S3AN, the results of confusion matrices on three datasets are visualized. From Figure 9b, it is seen that on the Indian Pines dataset, a high classification accuracy is obtained for all categories except for the Corn category. Notice that the S3AN also accurately recognizes the Grass-p category that contains a small number of samples. For the Salinas dataset, the proposed method showed misclassification for Grap objects due to the extreme similarity of the Grap and Vinyard categories, but it is still able to accurately recognize ground objects in other categories. From Figure 9c, it is observed that the HanChuan dataset has an imbalance of samples, where most of the samples are distributed in the categories of Strawberry and Water, which increases the difficulty of HSI classification, and the S3AN still obtains a satisfactory classification results. Moreover, the proposed method is able to accurately recognize objects with a small percentage of samples, such as the categories Sorghum, Watermelon, and Bright, which suggests that the methods for retaining details in the S3AN play a positive effect.

5. Discussion

5.1. Discussion of Controllable Factors

To analyze the influence of controllable factors, S3AN with no controllable factors is set as the baseline and gradually adds DE-Pooling and Adapt-Conv to observe the change in classification result.
As shown in Table 5, the O A of the baseline is lower on three datasets, it is suggested that the spectral-spatial-sensorial attention mechanism without controllable factors makes it difficult to distinguish the continuous and approximate features, and results in the inaccurate delineation of attention regions. Then, DE-Pooling is added into attention mechanisms, and the O A is increased from 65.03% to 96.18% on the Salinas dataset. It is shown that DE-Pooling significantly improves the sensitivity of attention mechanisms and controls the updating of attention weights to balance the differences in spectral–spatial features. Then, DE-Pooling and Adapt-Conv are added into attention mechanisms simultaneously, and further improve the classification result. In particular, the O A reaches 98.41% on the Indian Pines dataset, which indicates that boosting the weight interaction contributes to enhancing the sensitivity of the attention mechanism and improves the classification result.

5.2. Discussion of Sensorial Attention Mechanism

The sensorial attention mechanism mainly contributes to positioning the labeled pixels and emphasizing the details in the semantic feature map [47]. To verify its effectiveness, experiments are conducted based on S3AN with the presence or absence of the sensorial attention mechanism as the variable.
As shown in Figure 10, Figure 11 and Figure 12, for the semantic feature map without the sensorial attention mechanism, little pixels are highlighted to turn into the attention regions, and the difference in adjacent features is insufficient. In contrast, the semantic feature map with sensorial attention mechanism guidance expresses the important object areas and emphasizes details within a few pixels. Note the areas of the red box, the sensorial attention mechanism significantly highlights the labeled pixels and distinguishes the approximate features with different degrees of attention regions. Specifically, for the HanChuan dataset, the sensorial attention mechanism also focuses on the Roof and Tree areas in the shadow. Although the shadow areas affect the representation of spatial features and increase the difficulty of distinguishing the approximate features, the sensorial attention mechanism still accurately positions the objects based on the category information of the pseudo label. In addition, with the emphasis on the sensorial attention mechanism, the delineation of attention regions in the semantic feature map is close to the real situation. Therefore, the experimental results demonstrate that the sensorial attention mechanism is efficient to emphasize the details of semantic feature maps, and adapts to HSI classification.

5.3. Discussion of Trans-Conv Layers

Trans-Conv mitigates the dimensional mutation by adding convolutional layers in the prediction layer to achieve the transition of details. To analyze the effect of depth for Trans-Conv on the classification result, different numbers of Trans-Conv layers are added to the state-of-the-art methods, and the O A variations are observed to determine the appropriate depth of Trans-Conv.
As shown in Figure 13, for the Indian Pines dataset, there is an additional improvement of about 1% in O A for HybridSN, SSFCN, and S3AN when using only one Trans-Conv layer. The O A of S3AN reaches about 97% with the addition of two Trans-Conv layers, which is due to convolutional layers further extracting details while transforming the feature dimensions. However, when the number of Trans-Conv layers is set to three, the decrease in O A is rapid. Inappropriate Trans-Conv layers change the abstract semantic information and result in a decrease in the representation of details. Moreover, for the Salinas and HanChuan datasets with two Trans-Conv layers set up, the O A of the different methods reaches about 96%. The appropriate Trans-Conv layers gradually decreasing the dimension are able to retain details, and contribute to improving the classification accuracy. Therefore, the dimensional reduction coefficient r is set to two, which means the dimension of the feature map decays by half.

5.4. Discussion of Selected Bands

RRM selects the important bands to construct dimension-reduced features based on spectral attention weights, to analyze the effect of the number of selected bands on classification results, experiments are conducted based on different numbers of selected bands and observe the variation in the classification accuracy.
As shown in Table 6, when the number of selected bands is 8, the O A of S3AN on 3 datasets is limited to about 60% because important bands are not fully selected for feature learning. Then, the A A is significantly increased from 60% to about 80% when the number of selected bands is set to 16. Further, the O A reaches about 97% and the classification result gradually stabilizes when the number of selected bands is increased to 36. Figure 14 illustrates the trend of classification results for the different numbers of selected bands. The variation in classification results shows that the insufficient number of selected bands makes it difficult to obtain a satisfactory O A , and the inference time is increased by too many bands. Therefore, an appropriate number of selected bands is beneficial for model convergence and improving the speed of inference. Further, some continuous bands are selected by RRM, since the physical characteristics of the object are saved in these bands and have better feature representation. S3AN applies redundancy reduction as a pre-processing to improve the speed of inference without sacrificing O A as much as possible, so that the inference time on 3 datasets is reduced to 2.95 s, 25.59 s, and 120.55 s, respectively.

6. Conclusions

In this paper, an effective S3AN is proposed for HSI classification. Driven by controllable factors (DE-Pooling and Adapt-Conv), attention mechanisms are able to distinguish differences in approximate spectral–spatial features and to generate more reliable regions of interest. To reduce the computational cost, the controllable spectral attention mechanism accurately highlights representative bands in the HSI and reduces spectral redundancy. The controllable spatial attention mechanism cooperates with cross-layer feature learning to automatically extract local contextual semantics, and enhances the ability of deep and shallow feature interaction. In addition, the controllable sensorial attention mechanism explores the location and category information of ground objects, which further enhances the HSI classification results. The experimental results on three public HSI datasets show that the proposed method enables fast and accurate HSI classification.
Based on the experimental results, it is known that the proposed controllable attention mechanisms are adaptable to the complex feature environment of HSI. However, all results are obtained under the condition of labeled datasets, which require a lot of time for labeling. In contrast, producing unlabeled datasets reduces the workload, and it is interesting to explore the self-supervised HSI classification in the future.

Author Contributions

Conceptualization, methodology, software, validation, writing—original draft, S.L. Supervision, methodology, investigation, validation, funding acquisition, resources, writing—review & editing, M.W. Investigation, validation, data curation, C.C. Supervision, data curation, X.G. Supervision, writing—review & editing, Z.Y. Supervision, investigation, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Key Laboratory of Intelligent Health Perception and Ecological Restoration of Rivers and Lakes, Ministry of Education, Hubei University of Technology under Grant No. HGKFZP014, the National Natural Science Foundation of China under Grant No. 41901296, and the Hubei University of Technology Research and Innovation Program No. 21067.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. He, L.; Li, J.; Liu, C.; Li, S. Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines. IEEE Trans. Geosci. Remote Sens. 2018, 56, 1579–1597. [Google Scholar] [CrossRef]
  2. Paoletti, M.; Haut, J.; Plaza, J.; Plaza, A. Deep learning classifiers for hyperspectral imaging: A review. ISPRS J. Photogramm. Remote Sens. 2019, 158, 279–317. [Google Scholar] [CrossRef]
  3. Dong, Y.; Liang, T.; Zhang, Y.; Du, B. Spectral–spatial weighted kernel manifold embedded distribution alignment for remote sensing image classification. IEEE Trans. Cybern. 2021, 51, 3185–3197. [Google Scholar] [CrossRef] [PubMed]
  4. Zhou, Y.; Peng, J.; Chen, C. Dimension reduction using spatial and spectral regularized local discriminant embedding for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2014, 53, 1082–1095. [Google Scholar] [CrossRef]
  5. Zhou, Y.; Wei, Y. Learning hierarchical spectral–spatial features for hyperspectral image classification. IEEE Trans. Cybern. 2016, 46, 1667–1678. [Google Scholar] [CrossRef] [PubMed]
  6. Zhang, F.; Du, B.; Zhang, L.; Zhang, L. Hierarchical feature learning with dropout k-means for hyperspectral image classification. Neurocomputing 2016, 187, 75–82. [Google Scholar] [CrossRef]
  7. Wang, M.; Wu, C.; Wang, L.; Xiang, D.; Huang, X. A feature selection approach for hyperspectral image based on modified ant lion optimizer. Knowl. Based Syst. 2019, 168, 39–48. [Google Scholar] [CrossRef]
  8. Xia, J.; Ghamisi, P.; Yokoya, N.; Iwasaki, A. Random forest ensembles and extended multiextinction profiles for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 202–216. [Google Scholar] [CrossRef]
  9. Wang, M.; Yan, Z.; Luo, J.; Ye, Z.; He, P. A band selection approach based on wavelet support vector machine ensemble model and membrane whale optimization algorithm for hyperspectral image. Appl. Intell. 2021, 51, 7766–7780. [Google Scholar] [CrossRef]
  10. Fang, L.; Li, S.; Kang, X.; Benediktsson, J. Spectral–spatial hyperspectral image classification via multiscale adaptive sparse representation. IEEE Trans. Geosci. Remote Sens. 2014, 52, 7738–7749. [Google Scholar] [CrossRef]
  11. Zhou, H.; Zhang, X.; Zhang, C.; Ma, Q. Quaternion convolutional neural networks for hyperspectral image classification. Eng. Appl. Artif. Intell. 2023, 123, 106234. [Google Scholar] [CrossRef]
  12. Zhao, Z.; Hu, D.; Wang, H.; Yu, X. Convolutional transformer network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 19, 6009005. [Google Scholar] [CrossRef]
  13. Roy, S.; Krishna, G.; Dubey, S.; Chaudhuri, B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef]
  14. Dalal, A.; Cai, Z.; Al-Qaness, M.; Dahou, A.; Alawamy, E.; Issaka, S. Compression and reinforce variation with convolutional neural networks for hyperspectral image classification. Appl. Soft Comput. 2022, 130, 109650. [Google Scholar]
  15. Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
  16. Xu, Y.; Du, B.; Zhang, L. Beyond the patchwise classification: Spectral-spatial fully convolutional networks for hyperspectral image classification. IEEE Trans. Big Data 2020, 6, 492–506. [Google Scholar] [CrossRef]
  17. Zheng, Z.; Zhong, Y.; Ma, A.; Zhang, L. FPGA: Fast patch-free global learning framework for fully end-to-end hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 5612–5626. [Google Scholar] [CrossRef]
  18. Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. Hyperspectral image transformer classification networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5528715. [Google Scholar] [CrossRef]
  19. Huang, Z.; Wang, X.; Wei, Y.; Huang, L.; Shi, H.; Liu, W.; Huang, T. Criss-cross attention for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6896–6908. [Google Scholar] [CrossRef] [PubMed]
  20. Wang, X.; Zhu, J.; Feng, Y.; Wang, L. MS2CANet: Multiscale spatial–spectral cross-modal attention network for hyperspectral image and LiDAR classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5501505. [Google Scholar] [CrossRef]
  21. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
  22. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  23. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1153–11542. [Google Scholar]
  24. Shi, C.; Wu, H.; Wang, L. A feature complementary attention network based on adaptive knowledge filtering for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5527219. [Google Scholar] [CrossRef]
  25. Xing, C.; Duan, C.; Wang, Z.; Wang, M. Binary feature learning with local spectral context-aware attention for classification of hyperspectral images. Pattern Recognit. 2023, 134, 109123. [Google Scholar] [CrossRef]
  26. Zhao, Z.; Wang, H.; Yu, X. Spectral-spatial graph attention network for semisupervised hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 5503905. [Google Scholar] [CrossRef]
  27. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  28. Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
  29. Wang, W.; Liu, F.; Liu, J.; Xiao, L. Cross-domain few-shot hyperspectral image classification with class-wise attention. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5502418. [Google Scholar] [CrossRef]
  30. Roy, S.; Deria, A.; Shah, C.; Haut, J.; Du, Q.; Plaza, A. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503615. [Google Scholar] [CrossRef]
  31. Zhao, F.; Li, S.; Zhang, J.; Liu, H. Convolution transformer fusion splicing network for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5501005. [Google Scholar] [CrossRef]
  32. Mou, L.; Ghamisi, P.; Zhu, X. Unsupervised spectral–spatial feature learning via deep residual Conv–Deconv network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2017, 56, 391–406. [Google Scholar] [CrossRef]
  33. Lin, S.; Zhang, M.; Cheng, X.; Shi, L.; Gamba, P.; Wang, H. Dynamic low-rank and sparse priors constrained deep autoencoders for hyperspectral anomaly detection. IEEE Trans. Instrum. Meas. 2024, 73, 2500518. [Google Scholar] [CrossRef]
  34. Pan, B.; Xu, X.; Shi, Z.; Zhang, N.; Luo, H.; Lan, X. DSSNet: A simple dilated semantic segmentation network for hyperspectral imagery classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1968–1972. [Google Scholar] [CrossRef]
  35. Pu, C.; Huang, H.; Yang, L. An attention-driven convolutional neural network-based multi-level spectral–spatial feature learning for hyperspectral image classification. Expert Syst. Appl. 2021, 185, 115663. [Google Scholar] [CrossRef]
  36. Li, R.; Zheng, S.; Duan, C.; Yang, Y.; Wang, X. Classification of hyperspectral image based on double-branch dual-attention mechanism network. Remote Sens. 2020, 12, 582. [Google Scholar] [CrossRef]
  37. Zhu, M.; Jiao, L.; Liu, F.; Yang, S.; Wang, J. Residual spectral–spatial attention network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 449–462. [Google Scholar] [CrossRef]
  38. Sun, H.; Zheng, X.; Lu, X.; Wu, S. Spectral–spatial attention network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3232–3245. [Google Scholar] [CrossRef]
  39. Cheng, X.; Zhang, M.; Lin, S.; Li, Y.; Wang, H. Deep self-representation learning framework for hyperspectral anomaly detection. IEEE Trans. Instrum. Meas. 2023, 73, 5002016. [Google Scholar] [CrossRef]
  40. Cai, Y.; Liu, X.; Cai, Z. BS-Nets: An end-to-end framework for band selection of hyperspectral image. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1969–1984. [Google Scholar] [CrossRef]
  41. Nandi, U.; Roy, S.; Hong, D.; Wu, X.; Chanussot, J. TAttMSRecNet: Triplet-attention and multiscale reconstruction network for band selection in hyperspectral images. Expert Syst. Appl. 2023, 212, 118797. [Google Scholar] [CrossRef]
  42. Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
  43. Zhao, L.; Yi, J.; Li, X.; Hu, W.; Wu, J.; Zhang, G. Compact band weighting module based on attention-driven for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9540–9552. [Google Scholar] [CrossRef]
  44. Lee, H.; Kwon, H. Going deeper with contextual CNN for hyperspectral image classification. IEEE Trans. Image Process. 2017, 26, 4843–4855. [Google Scholar] [CrossRef] [PubMed]
  45. Gao, M.; Qian, P. Exponential linear units-guided Depthwise separable convolution network with cross attention mechanism for hyperspectral image classification. Signal Process. 2023, 210, 108995. [Google Scholar] [CrossRef]
  46. Yang, H.; Yu, H.; Zheng, K.; Hu, J.; Tao, T.; Zhang, Q. Hyperspectral image classification based on interactive transformer and CNN with multilevel feature fusion network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5507905. [Google Scholar] [CrossRef]
  47. Shivam, P.; Biplab, B. Adaptive hybrid attention network for hyperspectral image classification. Pattern Recognit. Lett. 2021, 144, 6–12. [Google Scholar]
Figure 1. The overall architecture of S3AN. S3AN is mainly divided into three modules, i.e., RRM, FLM, and LPM. RRM selects important bands for redundancy reduction; FLM extracts contextual semantics for feature learning; LPM positions global objects for label prediction.
Figure 1. The overall architecture of S3AN. S3AN is mainly divided into three modules, i.e., RRM, FLM, and LPM. RRM selects important bands for redundancy reduction; FLM extracts contextual semantics for feature learning; LPM positions global objects for label prediction.
Remotesensing 16 01253 g001
Figure 2. The details of controllable factors. DE-Pooling balances the differences in approximate features, Adapt-Conv enhances the interaction efficiency of feature weights.
Figure 2. The details of controllable factors. DE-Pooling balances the differences in approximate features, Adapt-Conv enhances the interaction efficiency of feature weights.
Remotesensing 16 01253 g002
Figure 3. The details of RRM, where X denotes the HSI cubes, W C denotes the band weights, A C denotes the spectral attention map, and M C denotes the spectral feature mask.
Figure 3. The details of RRM, where X denotes the HSI cubes, W C denotes the band weights, A C denotes the spectral attention map, and M C denotes the spectral feature mask.
Remotesensing 16 01253 g003
Figure 4. The details of FLM, where X ^ denotes the dimension-reduced feature, A S denotes the spatial attention map, M S denotes the spatial feature mask, and Y ^ denotes the semantic feature map.
Figure 4. The details of FLM, where X ^ denotes the dimension-reduced feature, A S denotes the spatial attention map, M S denotes the spatial feature mask, and Y ^ denotes the semantic feature map.
Remotesensing 16 01253 g004
Figure 5. The details of LPM, where r o w denotes the row sensorial attention weights, c o l u m n denotes the column sensorial attention weights, A L denotes the sensorial attention map, and M L denotes the sensorial feature mask.
Figure 5. The details of LPM, where r o w denotes the row sensorial attention weights, c o l u m n denotes the column sensorial attention weights, A L denotes the sensorial attention map, and M L denotes the sensorial feature mask.
Remotesensing 16 01253 g005
Figure 6. Classification maps of the different methods on the Indian Pines dataset.
Figure 6. Classification maps of the different methods on the Indian Pines dataset.
Remotesensing 16 01253 g006
Figure 7. Classification maps of the different methods on the Salinas dataset.
Figure 7. Classification maps of the different methods on the Salinas dataset.
Remotesensing 16 01253 g007
Figure 8. Classification maps of the different methods on the HanChuan dataset.
Figure 8. Classification maps of the different methods on the HanChuan dataset.
Remotesensing 16 01253 g008
Figure 9. The confusion matrices visualization of S3AN on the three datasets.
Figure 9. The confusion matrices visualization of S3AN on the three datasets.
Remotesensing 16 01253 g009
Figure 10. Visualization of attention regions for semantic feature map on Indian Pines dataset: (a) false-color image, (b) ground truth, (c) visualization of feature map without sensorial attention mechanism, (d) visualization of feature map with sensorial attention mechanism. Blue indicates lower attention values and red indicates higher attention values.
Figure 10. Visualization of attention regions for semantic feature map on Indian Pines dataset: (a) false-color image, (b) ground truth, (c) visualization of feature map without sensorial attention mechanism, (d) visualization of feature map with sensorial attention mechanism. Blue indicates lower attention values and red indicates higher attention values.
Remotesensing 16 01253 g010
Figure 11. Visualization of attention regions for semantic feature map on Salinas dataset: (a) false-color image, (b) ground truth, (c) visualization of feature map without sensorial attention mechanism, (d) visualization of feature map with sensorial attention mechanism.
Figure 11. Visualization of attention regions for semantic feature map on Salinas dataset: (a) false-color image, (b) ground truth, (c) visualization of feature map without sensorial attention mechanism, (d) visualization of feature map with sensorial attention mechanism.
Remotesensing 16 01253 g011
Figure 12. Visualization of attention regions for semantic feature map on HanChuan dataset: (a) false-color image, (b) ground truth, (c) visualization of feature map without sensorial attention mechanism, (d) visualization of feature map with sensorial attention mechanism.
Figure 12. Visualization of attention regions for semantic feature map on HanChuan dataset: (a) false-color image, (b) ground truth, (c) visualization of feature map without sensorial attention mechanism, (d) visualization of feature map with sensorial attention mechanism.
Remotesensing 16 01253 g012
Figure 13. The O A of state-of-the-art methods with different Trans-Conv layers on three datasets.
Figure 13. The O A of state-of-the-art methods with different Trans-Conv layers on three datasets.
Remotesensing 16 01253 g013
Figure 14. The classification results of different numbers of selected bands on three datasets.
Figure 14. The classification results of different numbers of selected bands on three datasets.
Remotesensing 16 01253 g014
Table 1. Number of train samples and test samples for the Indian Pines, Salinas, and HanChuan datasets.
Table 1. Number of train samples and test samples for the Indian Pines, Salinas, and HanChuan datasets.
ClassIndian PinesSalinasHanChuan
NameTrainTestNameTrainTestNameTrainTest
1Alfalfa541Brocoli-12011808Strawberry447340,262
2Corn-n1431285Brocoli-23733353Cowpea227520,478
3Corn-m83747Fallow1981778Soybean10299258
4Corn24213Fallow-r1391255Sorghum5354818
5Grass-p48435Fallow-s2682410Water-s1201080
6Grass-t73657Stubble3963563Watermelon4534080
7Grass-m325Celery3583221Greens5905313
8Hay-w48430Graps-u112710,144Trees179816,180
9Oats218Soil-v-d6205583Grass9478522
10Soy-n97875Corn-w3282950Red roof10529464
11Soy-m2452210Lettuce-4107961Gray roof169115,220
12Soy-c59534Lettuce-51931734Plastic3683311
13Wheat20185Lettuce-692824Bare soil9128204
14Woods1261139Lettuce-7107963Road185616,704
15Buildings39347Vinyardu7276541Bright-o1141022
16Stone-s984Vinyardv1811626Water754067,861
Total9249325Total541548,714Total25,753257,530
Table 2. Classification results by different methods on the Indian Pines dataset.
Table 2. Classification results by different methods on the Indian Pines dataset.
ClassHybridSNDBDASSRNSSFCNCBWCTNFPGAS3AN
190.25 ± 0.8695.15 ± 0.2089.07 ± 0.9556.24 ± 2.6093.33 ± 1.6495.23 ± 0.5392.37 ± 0.1599.16 ± 0.42
287.74 ± 0.2993.76 ± 1.1593.59 ± 0.6389.65 ± 0.7594.71 ± 0.8594.28 ± 0.2096.55 ± 0.1291.55 ± 0.15
389.19 ± 0.9289.61 ± 0.1792.25 ± 0.9995.45 ± 0.4497.24 ± 0.2394.61 ± 1.4192.38 ± 0.3796.33 ± 0.33
485.15 ± 1.9492.89 ± 0.4290.53 ± 1.3692.62 ± 0.6899.03 ± 0.0798.14 ± 0.1696.85 ± 0.2598.18 ± 0.48
591.18 ± 0.0894.66 ± 1.9494.31 ± 2.5695.44 ± 1.2589.61 ± 1.2198.62 ± 0.2794.26 ± 0.3197.75 ± 0.07
693.53 ± 0.2596.45 ± 0.7689.75 ± 0.7274.96 ± 1.9295.75 ± 2.9197.60 ± 0.3295.08 ± 1.2599.60 ± 0.36
789.29 ± 0.8895.33 ± 1.1393.09 ± 1.0285.66 ± 0.9699.62 ± 0.1596.15 ± 0.2198.33 ± 0.8895.89 ± 0.20
884.41 ± 0.6497.01 ± 0.4695.66 ± 1.7993.50 ± 0.4599.89 ± 0.0499.30 ± 0.0698.74 ± 0.6597.03 ± 0.15
990.96 ± 1.1295.62 ± 0.2191.15 ± 0.5691.75 ± 2.2199.33 ± 0.2189.99 ± 2.4999.51 ± 0.4296.96 ± 0.26
1095.54 ± 1.2892.74 ± 0.8590.69 ± 0.2184.85 ± 0.9591.69 ± 1.3599.18 ± 0.2590.52 ± 1.1399.51 ± 0.12
1191.78 ± 2.2194.11 ± 1.1588.26 ± 2.2489.15 ± 0.4194.28 ± 0.8496.83 ± 0.3093.66 ± 0.1695.45 ± 1.65
1292.76 ± 0.8696.35 ± 0.0697.75 ± 1.0993.96 ± 0.5097.95 ± 0.3998.50 ± 0.3391.77 ± 3.7196.33 ± 3.74
1395.10 ± 0.4294.18 ± 0.5190.33 ± 0.5789.31 ± 0.6199.49 ± 0.2899.45 ± 0.1596.05 ± 0.8598.45 ± 0.58
1469.89 ± 2.9889.72 ± 0.2991.51 ± 0.6691.99 ± 1.1599.57 ± 0.0499.21 ± 0.3893.33 ± 2.5999.76 ± 0.23
1595.41 ± 0.6888.96 ± 1.8689.36 ± 0.3092.25 ± 1.1898.55 ± 0.1698.86 ± 0.0495.00 ± 0.2398.50 ± 0.95
1690.17 ± 0.1195.99 ± 0.5896.95 ± 0.2990.96 ± 0.8696.34 ± 0.3194.99 ± 2.7596.15 ± 0.5896.88 ± 2.21
OA (%)90.85 ± 0.1595.29 ± 0.2793.63 ± 0.0589.75 ± 0.5495.66 ± 0.2496.59 ± 0.3496.74 ± 0.1798.69 ± 0.13
AA (%)89.52 ± 0.4293.91 ± 0.2192.14 ± 0.7387.98 ± 0.4296.64 ± 0.5096.93 ± 0.2095.03 ± 0.2097.33 ± 0.45
Kappa0.9095 ± 0.0040.9431 ± 0.0020.9308 ± 0.0030.8585 ± 0.0020.9505 ± 0.0020.9559 ± 0.0040.9524 ± 0.0040.9841 ± 0.002
Time (s)8.78 ± 2.046.51 ± 1.037.62 ± 1.8519.35 ± 2.553.58 ± 0.8712.26 ± 1.465.4 ± 1.272.36 ± 0.99
Note that the values in bold are the highest.
Table 3. Classification results by different methods on the Salinas dataset.
Table 3. Classification results by different methods on the Salinas dataset.
ClassHybridSNDBDASSRNSSFCNCBWCTNFPGAS3AN
199.70 ± 0.0494.89 ± 0.5999.01 ± 0.1663.82 ± 2.3098.66 ± 0.0599.50 ± 0.2195.99 ± 0.5499.32 ± 0.40
295.85 ± 0.1298.22 ± 0.7271.74 ± 3.5699.55 ± 0.4299.85 ± 0.1194.96 ± 0.8599.49 ± 0.3698.75 ± 0.38
393.34 ± 0.3792.85 ± 1.3199.48 ± 0.1296.74 ± 0.5698.80 ± 0.2599.39 ± 0.1499.16 ± 0.6099.93 ± 0.06
475.66 ± 2.6495.89 ± 0.3299.13 ± 0.1487.99 ± 1.1298.58 ± 0.6097.71 ± 0.6086.43 ± 1.5999.51 ± 0.18
586.84 ± 1.9587.26 ± 2.2785.16 ± 1.5786.56 ± 2.3796.89 ± 0.3297.47 ± 0.5995.62 ± 0.5596.65 ± 1.65
685.29 ± 0.6283.79 ± 0.3084.10 ± 2.9290.55 ± 0.8587.28 ± 1.7499.27 ± 0.3394.97 ± 0.2095.74 ± 0.33
790.34 ± 0.5885.75 ± 0.4998.92 ± 0.5390.24 ± 0.9599.64 ± 0.0387.38 ± 0.8995.79 ± 0.5698.41 ± 0.40
890.17 ± 0.6286.11 ± 1.6589.85 ± 0.3693.61 ± 0.4498.80 ± 0.1699.61 ± 0.1790.61 ± 1.6595.22 ± 2.85
989.59 ± 0.6691.18 ± 0.3899.15 ± 0.3995.35 ± 0.0796.23 ± 0.8598.77 ± 0.0597.55 ± 0.9798.79 ± 0.78
1087.68 ± 3.3189.99 ± 1.7891.59 ± 0.5185.49 ± 0.3895.35 ± 0.6490.61 ± 0.6198.43 ± 0.4395.46 ± 0.25
1182.55 ± 0.4792.35 ± 0.2887.99 ± 0.8389.71 ± 0.2391.32 ± 0.7996.89 ± 0.5594.24 ± 0.4598.22 ± 1.12
1287.60 ± 2.8098.96 ± 0.2294.15 ± 0.9579.75 ± 0.3993.38 ± 0.8594.37 ± 0.3495.33 ± 1.9297.75 ± 0.51
1395.16 ± 1.1595.19 ± 0.6597.82 ± 0.5595.07 ± 1.1585.92 ± 2.3899.43 ± 0.1996.87 ± 0.4596.36 ± 0.33
1486.33 ± 0.2290.55 ± 0.4497.71 ± 0.6099.51 ± 0.1793.91 ± 0.3398.06 ± 0.2295.61 ± 0.1699.73 ± 0.09
1583.46 ± 0.5891.76 ± 0.1992.87 ± 0.7190.96 ± 1.6895.57 ± 0.4585.39 ± 2.3897.79 ± 0.6697.55 ± 0.38
1692.32 ± 0.1494.33 ± 0.6196.85 ± 0.2889.65 ± 0.2093.55 ± 0.2097.77 ± 0.2599.03 ± 0.2299.85 ± 0.25
OA (%)90.49 ± 0.5593.55 ± 0.3292.68 ± 0.3990.92 ± 0.1495.08 ± 0.2596.08 ± 0.2397.96 ± 0.6598.59 ± 0.18
AA (%)88.87 ± 0.8691.82 ± 0.3592.85 ± 0.2589.66 ± 0.3095.23 ± 0.3496.03 ± 0.1895.84 ± 0.3597.95 ± 0.32
Kappa0.9205 ± 0.0040.933 ± 0.0050.9033 ± 0.0020.8993 ± 0.0040.9413 ± 0.0030.9468 ± 0.0050.9774 ± 0.0040.9792 ± 0.004
Time (s)40.48 ± 5.4540.33 ± 7.5353.88 ± 5.95102.99 ± 9.8832.69 ± 4.5042.68 ± 9.0637.99 ± 3.3725.09 ± 2.70
Note that the values in bold are the highest.
Table 4. Classification results by different methods on the HanChuan dataset.
Table 4. Classification results by different methods on the HanChuan dataset.
ClassHybridSNDBDASSRNSSFCNCBWCTNFPGAS3AN
197.14 ± 0.7495.41 ± 0.5199.33 ± 0.3299.50 ± 0.3299.04 ± 0.3799.32 ± 0.0799.18 ± 0.2799.41 ± 0.25
298.26 ± 0.3699.71 ± 0.2097.43 ± 0.6298.17 ± 0.4593.90 ± 0.8897.43 ± 0.8298.60 ± 0.5596.04 ± 0.77
390.58 ± 1.1884.97 ± 0.9597.28 ± 0.9599.75 ± 0.2296.58 ± 1.7897.27 ± 1.1299.39 ± 0.6499.69 ± 0.12
499.90 ± 0.0686.97 ± 0.4899.83 ± 0.3198.39 ± 0.1796.12 ± 1.1099.83 ± 0.0999.15 ± 0.4199.75 ± 0.17
589.29 ± 0.9598.02 ± 0.2666.65 ± 3.7512.03 ± 5.8986.44 ± 2.4086.65 ± 3.8599.62 ± 0.2399.96 ± 0.02
670.72 ± 2.2581.72 ± 2.2799.79 ± 0.1578.83 ± 1.3393.89 ± 0.5697.99 ± 0.2598.41 ± 0.5198.40 ± 0.28
789.65 ± 0.6099.82 ± 0.0395.59 ± 0.2999.06 ± 0.1596.75 ± 0.4995.59 ± 0.5599.45 ± 0.1399.92 ± 0.05
896.87 ± 0.4993.40 ± 0.6596.59 ± 0.6697.16 ± 0.2699.97 ± 0.0296.59 ± 0.3898.13 ± 0.1997.99 ± 1.12
997.12 ± 0.8594.37 ± 1.2199.22 ± 0.5499.68 ± 0.1494.99 ± 0.3899.22 ± 0.3299.58 ± 0.2698.70 ± 0.55
1099.57 ± 0.3198.68 ± 0.3397.94 ± 0.3696.51 ± 0.2596.67 ± 0.6597.94 ± 0.3999.33 ± 0.7599.59 ± 0.13
1191.34 ± 0.2589.54 ± 0.5082.97 ± 0.8995.26 ± 0.2771.84 ± 2.7882.97 ± 0.7799.06 ± 0.5099.78 ± 0.18
1273.88 ± 4.6286.46 ± 0.9590.09 ± 1.4485.57 ± 1.5893.70 ± 0.4990.09 ± 0.6088.03 ± 2.4283.7 ± 1.42
1390.31 ± 0.3788.17 ± 1.6942.76 ± 5.1297.92 ± 1.4096.83 ± 0.7196.49 ± 0.5998.09 ± 1.3797.71 ± 0.65
1497.58 ± 0.1598.35 ± 0.7899.47 ± 0.3199.77 ± 0.2195.99 ± 0.6799.46 ± 0.3399.57 ± 0.2599.65 ± 0.16
1598.68 ± 0.6399.13 ± 0.5184.05 ± 0.5565.87 ± 3.3575.41 ± 3.8184.04 ± 1.557.96 ± 3.8968.11 ± 2.85
1699.85 ± 0.0599.73 ± 0.0799.39 ± 0.1697.56 ± 0.5795.53 ± 0.5599.93 ± 0.0499.32 ± 0.2599.58 ± 0.07
OA (%)94.61 ± 0.3193.4 ± 0.6592.42 ± 0.7593.43 ± 0.4695.45 ± 0.4696.35 ± 0.2996.62 ± 0.5697.56 ± 0.52
AA (%)92.55 ± 0.2892.32 ± 0.4490.52 ± 0.6688.81 ± 0.6092.78 ± 0.3595.05 ± 0.4692.68 ± 0.6096.23 ± 0.31
Kappa0.9447 ± 0.0040.9204 ± 0.0040.9191 ± 0.0060.9288 ± 0.0070.9429 ± 0.0060.9668 ± 0.0040.9505 ± 0.0060.9769 ± 0.004
Time (s)209.37 ± 9.96404.83 ± 20.38371.49 ± 10.44508.42 ± 20.51246.42 ± 16.40313.26 ± 17.65169.79 ± 3.15121.09 ± 6.98
Note that the values in bold are the highest.
Table 5. Classification results by different controllable factors.
Table 5. Classification results by different controllable factors.
DatasetMethodDE-PoolingAdapt-ConvOA (%)AA (%)Kappa
Indian PinesBaseline--83.19 ± 1.1481.32 ± 1.680.7816 ± 0.075
DE-Pooling-96.57 ± 0.6596.85 ± 1.600.9379 ± 0.097
DE-Pooling + Adapt-Conv98.41 ± 0.4497.55 ± 0.600.9591 ± 0.059
SalinasBaseline--65.03 ± 3.3563.77 ± 2.700.5952 ± 0.031
DE-Pooling-95.18 ± 0.6594.35 ± 0.890.9331 ± 0.016
DE-Pooling + Adapt-Conv98.09 ± 0.3897.42 ± 0.550.9799 ± 0.012
HanChuanBaseline--71.83 ± 4.7869.52 ± 3.960.6607 ± 0.036
DE-Pooling-95.61 ± 0.5595.22 ± 0.270.9494 ± 0.031
DE-Pooling + Adapt-Conv97.51 ± 0.1196.89 ± 0.280.9673 ± 0.003
Table 6. Classification results of different numbers of selected bands on three datasets.
Table 6. Classification results of different numbers of selected bands on three datasets.
DatasetNumberSelected BandOA (%)AA (%)KappaTime (s)
Indian Pines8[102,104,…,198]61.33 ± 5.5660.51 ± 4.700.5933 ± 0.0931.66 ± 0.30
16[56,102,…,199]82.60 ± 2.3383.95 ± 1.160.7827 ± 0.0621.85 ± 0.31
24[18,56,…,199]97.45 ± 0.1896.37 ± 0.250.9615 ± 0.0052.07 ± 0.29
32[12,18,…,199]97.63 ± 0.2196.59 ± 0.090.9689 ± 0.0032.41 ± 0.29
36[12,17,…,199]97.25 ± 0.3094.35 ± 0.270.9500 ± 0.0032.95 ± 0.30
Salinas8[37,38,…,197]53.77 ± 5.5359.73 ± 3.950.5630 ± 0.05212.32 ± 1.56
16[12,19,…,200]79.09 ± 2.6182.14 ± 3.350.7949 ± 0.07915.61 ± 2.05
24[8,9,12,…,200]97.51 ± 0.2296.75 ± 0.370.9665 ± 0.01218.44 ± 3.32
32[4,5,6,…,200]98.05 ± 0.1698.00 ± 0.110.9811 ± 0.00725.89 ± 2.98
36[4,5,6,…,200]98.16 ± 0.2995.58 ± 0.360.9503 ± 0.01529.59 ± 3.75
HanChuan8[0,3,10,...,254]65.30 ± 2.4967.72 ± 3.600.6447 ± 0.07164.35 ± 8.83
16[0,3,10,…,272]76.11 ± 3.5579.55 ± 3.700.7605 ± 0.06389.05 ± 7.99
24[1,3,10,…,272]96.89 ± 0.1995.80 ± 0.230.9578 ± 0.01596.16 ± 10.80
32[1,3,10,…,273]97.32 ± 0.2598.11 ± 0.190.9790 ± 0.010120.55 ± 9.55
36[1,2,3,…,273]96.09 ± 0.5194.65 ± 1.010.9532 ± 0.026164.89 ± 10.66
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, S.; Wang, M.; Cheng, C.; Gao, X.; Ye, Z.; Liu, W. Spectral-Spatial-Sensorial Attention Network with Controllable Factors for Hyperspectral Image Classification. Remote Sens. 2024, 16, 1253. https://doi.org/10.3390/rs16071253

AMA Style

Li S, Wang M, Cheng C, Gao X, Ye Z, Liu W. Spectral-Spatial-Sensorial Attention Network with Controllable Factors for Hyperspectral Image Classification. Remote Sensing. 2024; 16(7):1253. https://doi.org/10.3390/rs16071253

Chicago/Turabian Style

Li, Sheng, Mingwei Wang, Chong Cheng, Xianjun Gao, Zhiwei Ye, and Wei Liu. 2024. "Spectral-Spatial-Sensorial Attention Network with Controllable Factors for Hyperspectral Image Classification" Remote Sensing 16, no. 7: 1253. https://doi.org/10.3390/rs16071253

APA Style

Li, S., Wang, M., Cheng, C., Gao, X., Ye, Z., & Liu, W. (2024). Spectral-Spatial-Sensorial Attention Network with Controllable Factors for Hyperspectral Image Classification. Remote Sensing, 16(7), 1253. https://doi.org/10.3390/rs16071253

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop