1. Introduction
Hyperspectral images (HSIs) [
1] have been used in many applications, such as military reconnaissance [
2], precision agriculture [
3], wetland dynamic monitoring [
4], and so on [
5,
6,
7]. Particularly, among HSI analysis and processing technologies, classification has become a very important and challenging information acquisition technology.
The filtering-based algorithm is a more useful way to extract spatial–spectral features by manually designing filters and directly interacting with HSIs [
8]. Common filter-based feature extraction algorithms comprise the 3D Gabor filter [
8,
9], morphological profiles [
10], scattering wavelet transform [
11], scale-invariant feature transform [
12], and local binary patterns [
13], which can provide low-level and interpretable spatial–spectral features for HSI classification. Then, a suitable classifier can be used to classify land covers with the extracted spatial–spectral features from HSIs, where typical methods include multinomial logistic regression [
14], composite kernel support vector machine (SVM-CK) [
15], and sparse representation [
16]. However, these methods need to be designed manually by experienced researchers for specific tasks.
As deep neural network (DNN)-based classification methods present advantages such as strong self-learning ability and excellent model generalization, various DNNs have been applied in remote sensing field, including convolutional neural networks (CNNs) [
17,
18], graph convolutional networks (GCNs) [
19,
20], and recurrent neural networks (RNNs) [
21], obtaining better performance than other classification methods. RNNs have been widely used to analyze and process sequential data because of their unique directional feedback structure, where long short-term memory (LSTM) [
22] and convolutional LSTM [
23] are two special RNNs designed to solve learning problems related with the interdependence of long distance sequential data. According to [
24], the convolutional LSTM has been further renamed as ConvLSTM2D for convenience. Thanks to the ability to model the long-term correlation of sequential data, there are two common ways to use LSTM or ConvLSTM2D for the spatial–spectral extraction and classification of HSIs. On the one hand, LSTM or ConvLSTM2D can be utilized alone as a basic unit to build the RNNs for HSI classification, such as SSLSTMs [
25], bidirectional-ConvLSTM2D [
26], the spatial–spectral ConvLSTM2D neural network (SSCL2DNN) [
24], and the two-branch multidirectional spatial–spectral LSTM attention network [
27]. On the other hand, LSTM or ConvLSTM2D can be integrated with CNNs or attention mechanisms to build a hybrid network, such as capsule network+ConvLSTM2D [
28], tensor attention-driven ConvLSTM2D neural network [
29], and nonlocal-dependent learning fully convolutional network [
30]. To better preserve the intrinsic structure information of HSIs, Hu et al. [
24] further proposed to extend the ConvLSTM2D to its 3D version (namely ConvLSTM3D) to directly extract the spatial–spectral features for HSI classification. Inspired by this, some ConvLSTM3D-based models have been gradually developed for HSI classification, such as SSCL3DNN [
24], ConvLSTM3D+3D CNN (SSCRN) [
31], ConvLSTM3D+attention mechanism (Dual-Channel A
3CLNN) [
32], and the regularized spatial–spectral global learning (RSSGL) framework [
33]. However, the network structure designs and working mechanisms of these methods lack accurate interpretation, and there is a lack of obvious correlation between the extracted deep semantic features and the classification results of HSIs. To improve the interpretability of deep semantic features and make up for their lack of detailed information, some researchers have tried to integrate the traditional features into the training process of DNNs, where the complementary information and interaction of two different modality features can dynamically optimize the training direction of the whole networks for various applications, such as action recognition, object detection, lane detection, locomotion mode recognition, HSI change detection, and HSI classification [
6,
34], reducing the overfitting problems of the whole DNNs and improving their performance to a certain extent.
In particular, the special imaging mechanism with which HSIs are collected introduces rich and highly correlated spectral information, motivating the design of different spectral information learning submodules to be plugged into DNNs for spatial–spectral feature extraction. Particularly, attention mechanism is one of the most commonly used and effective ways. Li et al. [
32] designed a spectral attention module by using the ConvLSTM2D as a basic unit to assist the dual-channel A
3CLNN with adaptive learning of long-term spectral correlation, improving the classification accuracy of HSIs. An improved complex-valued deformable ConvLSTM2D was extended, improving the abilities of learning scale information and spectral correlations of the whole model [
6]. Sun et al. [
35] built a large kernel spectral–spatial attention network to learn the long-range 3D properties of HSIs. A dense spectral convolution module was designed for exploring the intrinsic similarity between spectral bands for HSI super-resolution [
36]. In addition, transformer has gained widespread attention and success in the field of natural language processing due to its ability to handle global long-term dependencies in sequence data [
37]. Subsequently, considering the sequence properties presented by the rich spectral information of HSIs, some researchers have attempted to introduce it into the task of HSI classification, achieving promising results. Xu et al. [
38] proposed a double branch convolution–transformer network, where a convolution-spectral projection unit and a convolutional multihead self-attention network are applied for exploring the spectral correlation among spectral bands and local–global features for HSI classification. A center-masked transformer with a regularized center masked pretraining task was built to learn the dependencies between central land cover and its neighboring objects without labels during the pretraining process [
39]. Shi et al. [
40] developed a parallel dual-branch multiscale transformer, containing the spectral convolution, channel shrink soft split, and token-to-token modules for multiscale spatial–spectral feature extraction, which was followed by a pooled activation fusion module for feature fusion and the classification of HSIs. A spatial–spectral wavelet transformer was introduced, which unifies the downsampling with wavelet transforms for a lossless compression of features, preserving data integrity and improving the interaction between structural and shape information for HSI classification [
41]. Aiming at the information loss of transformer during its propagation, a memory-augmented spectral–spatial transformer was constructed with a memory tokenizer and memory-augmented transformer encoder to effectively mix the spectral and spatial information for HSI classification. To fully eliminate the influence of multimodal heterogeneity, Hu et al. [
42] innovatively viewed the global information as an intermediate agent to propose a new cross-memory quaternion transformer network, effectively improving the classification accuracy of land covers. Although the above spectral learning submodules of models can adaptively model the spectral correlations to obtain the spectral-enhanced spatial–spectral features for HSI classification, there are still some problems needed to further be solved. First, these modules usually only explored the long-range dependencies between different spectral bands, which, however, ignored the local correlations between the adjacent spectral bands, limiting the further improvements of classification performance of HSIs. Second, these modules are usually considered as the additional submodules to assist the feature extraction process of DNNs, which are rarely designed as the basic network units to build the new backbones for HSI classification.
Following our previous work in [
6,
24], to solve the above-mentioned problems, a novel adaptive spectral correlation learning neural network (ASLNN) is proposed for the spatial–spectral feature extraction and classification of HSIs. By considering the strong spectral correlations that exist between the adjacent spectral bands and between the nonadjacent spectral bands, a new adaptive adaptive spectral correlation learning block (ASBlock) is first designed by utilizing the group convolutional (GConv) layer and the ConvLSTM3D layer, jointly learning the short- and long-term spectral correlations. Then, by taking the designed ASBlock as a basic network unit, a full convolution-based spatial–spectral feature extraction network is constructed, adaptively modeling the unique spectral characteristics of different land covers and extracting enhanced deep spatial–spectral features to accurately discriminate between them. Furthermore, in order to improve the ability of deep semantic features to describe details and upgrade their interpretability, inspired by the conclusions in [
6,
8,
43], the 3D Gabor filter is utilized as the heterogeneous feature extractor, and a simple but effective gated asymmetric fusion block (GAFBlock) is built to align and integrate these two modalities of spatial–spectral features, achieving competitive classification accuracy for HSIs. Experimental results on four common HSI data sets show the superiority of the proposed ASLNN model. The main contributions of this work can be summarized as follows.
(1) To adaptively learn the spectral information of HSIs, a new ASBlock is designed by integrating the GConv and ConvLSTM3D layers, which can be used to construct the backbones of other DNNs for joint learning of the short- and long-range correlations.
(2) Based on the designed ASBlock, a convolutional network can be constructed as the backbone for adaptively extracting the adaptive-spectral-enhanced spatial–spectral features.
(3) A GAFBlock is built to align and integrate the heterogeneous spatial–spectral features, and with the designed ASBlock-based backbone, a novel ASLNN model is further proposed to extract the adaptive spectral-enhanced spatial–spectral features, and it gives consideration to the interpretability of spatial–spectral feature extraction and the ability to describe detailed information design for the better classification of HSIs.
The remainder of this article is organized as follows. In
Section 2, the framework and optimization process of the proposed ASLNN are described in detail.
Section 3 reports the experimental settings, results, and comparative analysis, which is followed by the conclusions in
Section 4.
3. Experimental Results
In this section, to quantitatively and qualitatively evaluate the performance of our ASLNN, some spatial–spectral classification methods, e.g., APDCLNN [
6], 3DG-CNN [
43], SVM-CK [
15], SSCL3DNN [
24], 3DOC-SSAN [
46], D
2S
2BoT [
47], DBCTNet [
38], HybridSN [
48], and ISODATA [
49], are selected as the comparison methods. Four public HSI data sets are applied for performance analysis, e.g., WHU-Hi-LongKou, Indian Pines, 2013 Houston Data, and MUUFL Gulfport. The overall accuracy (OA), average accuracy (AA), and Kappa coefficient (
) are the adopted quantitative metrics. To avoid the deviation caused by a random selection of samples, 10 Monte Carlo runs are executed, reporting the average value of each metric in the following experimental results. Our ASLNN is built with the Tensorflow platform (i.e., Tensorflow-GPU 2.5.0, python 3.8.0) and trained on a desktop with an Intel Core i7-12700F processor and an Nvidia GeForce RTX 3080ti GPU.
3.1. Hyperspectral Data
In order to measure the classification performance of the proposed ASLNN model with the above comparison methods, the WHU-Hi-LongKou, Indian Pines, Houston Data, and MUUFL Gulfport are used as the HSI data sets, whose detailed information can be found in
Figure 5 and
Figure 6.
(1) WHU-Hi-LongKou: The WHU-Hi-LongKou data set, depicted in
Figure 5, was procured on a DJI Matrice 600 Pro, conducting aerial surveys over Longkou Town in China in 2018. The data set contains 270 spectral bands with a spectral resolution of 0.4 to 1 μm and 550 × 400 pixels [
50,
51]. The scene covers nine specific categories, comprising six diverse crop species, totaling 204,542 samples.
(2) Indian Pines: Using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor, the Indian Pines data set was acquired in 1992 in northwestern Indiana, USA, which consists of
pixels with a spatial resolution of 20 m per pixel (mpp), and 200 bands in the wavelength range from 0.40 to 2.50 μm after removing some unusable spectral bands [
52]. In addition, after ignoring the background pixels, 10,249 samples from 16 classes are used for experimental analysis.
(3) Houston Data: The Houston data set was collected in 2012 with the Compact Airborne Spectrographic Imager (CASI) sensor over the University of Houston campus and its surrounding area, and it is the official data set in the 2013 IEEE Geoscience and Remote Sensing Society (GRSS) Data Fusion contest, which is available online from the official Website:
http://dase.grss-ieee.org/, accessed on 22 May 2025. Its spatial size is
pixels with a spatial resolution of 2.5 m, and there are 144 spectral bands in the wavelength range from 0.38 to 1.05 μm maintained by removing some noise-corrupted spectral bands [
52]. Moreover, there are 15,029 samples from 15 labeled classes for experimental research.
(4) MUUFL Gulfport: It was collected in 2010 by using the ITRES CASI-1500 sensor over the University of Southern Mississippis Gulf park Campus, Long Beach, Mississippi, USA [
53], which is available online from the official website:
https://github.com/GatorSense/MUUFLGulfport, accessed on 22 May 2025. This data set contains
pixels with a spatial resolution of 1 m where
pixels are applied for analysis and 72 spectral bands with a wavelength range from 0.375 to 1.050 μm. After removing some noise bands and regardless of background pixels, 64 bands and 53,687 labeled samples with 11 urban land-cover classes are maintained for experimental research.
3.2. Experimental Settings
The parameter settings of the selected compared methods are used according to [
6,
15,
24,
38,
43,
46,
47,
48,
49]. In addition, 10 samples are randomly selected from each class of the Indian Pines, Houston, and MUUFL Gulfport data sets to build the training set, while the other samples are used for testing. The training and testing sets of the WHU-Hi-LongKou data set have been officially divided. In our research, we use 25 training labels for each class, while the other samples are used for testing.
According to the structure design of the proposed ASLNN model, there are some key parameters needed to be tuned. Particularly, inspired by the work in [
6,
8,
43], the value of the frequency
f for 3D Gabor filter is fixed to 0.50, 0.25, and 0.125 for the four HSI data sets, respectively. In addition, in order to simplify the parameter analysis process and compare fairly, the kernel size and the channel number in the whole model are fixed, as shown in
Section 2. Therefore, only two key parameters need to be analyzed in detail, such as the spatial size (
), the number (
) of the spectral groups or the number (
K) of the spectral bands. Compared with other methods, the complexity of parameter analysis of the proposed model is greatly reduced.
For the spatial size
s, the effect of its value on the classification performance of the whole model is further studied, where
s is searched from
. Specifically, due to the limitations of memory, the local spatial window with a larger size fails to be supported. According to the experimental results in
Figure 7, the proposed ASLNN model produces quasi-optimal classification accuracy when
s is set to 7 for the four HSI data sets.
3.3. Classification Performance
Based on the experimental settings,
Table 1,
Table 2,
Table 3 and
Table 4 report the classification results (i.e., class-specific accuracy, OA, AA, and
) of the proposed ASLNN model and other comparison methods under the Indian Pines, Houston Data, MUUFL Gulfport, and WHU-Hi-LongKou data sets, respectively.
Compared with other supervised algorithms, ISODATA failed to achieve comparable results. There might be several reasons why ISODATA fails to achieve comparable results to other supervised algorithms. First, as an unsupervised algorithm, it cannot utilize labeled training samples for model development, which inherently limits its ability to match the classification performance of supervised methods that leverage explicit class annotations. Second, the data set used in this paper contains a large number of continuous spectral bands, leading to a high-dimensional and information-redundant data space. This poses significant challenges for distance-based algorithms like ISODATA, as traditional distance metrics become ineffective, inter-class discrimination becomes ambiguous, and the reliability of clustering results decreases. Finally, real-world ground objects exhibit inherent spectral variability (e.g., intra-class spectral differences caused by lighting conditions, humidity, or phenological stages). However, ISODATA relies on linear distance metrics and assumes simple clustering structures [
49], making it unable to model the complex nonlinear distributions of such spectral variations, resulting in suboptimal clustering performance.
By integrating the multilayer residual convolution and spectral–spatial bottleneck transformer, D
2S
2BoT was proposed to jointly model the local features and the long-range spectral correlation, as well as adaptively fuse global spectral and spatial dependencies, thus yielding the second-best classification results on the four data sets. As a strong baseline under the case of small-sample classification, SVM-CK was designed by incorporating the composite kernel into the SVM classifier for jointly using the spatial–spectral information [
15], which can process high-dimensional small-sample data, and it is insensitive to the “curse of dimensionality”, generating the third-best classification results on the MUUFL Gulfport data set. Particularly, according to
Figure 6c–f, compared with the other three HSI data sets, the MUUFL Gulfport data set presents the characteristics of category imbalance and large differences in the number of samples between different classes, thus obtaining better classification accuracy than APDCLNN and DBCTNet. Since the complementary information between different modality features is not considered, other DNNs (i.e., HybridSN, SSCL3DNN, 3DOC-SSAN, 3DG-CNN) yield the relatively poor classification performance for the four HSI data sets. Differently, on the basis of the spectral group and 3D Gabor filter, our ASLNN model adaptively learns the spectral correlations by jointly considering the relationships between the adjacent spectral bands and the nonadjacent spectral bands, and it integrates the fine-grained complementarity of deep and heterogeneous spatial–spectral features, thus achieving the best classification performance for the four HSI data sets under the premise of low complexity. All of these aspects verify the effectiveness and superiority of the proposed ASLNN model for HSI classification purposes.
More precisely, D2S2BoT and DBCTNet are designed by transformer to capture the local feature maps and simulate the long-range correlation of HSI pixels across both spectral and spatial dimensions. However, they only consider the long-range spectral correlation, and they ignore the local relationship between the adjacent spectral bands, limiting their classification performance. Compared with them, our ASLNN model jointly learns the short-range and long-range spectral correlations from both local and global perspectives, and it effectively integrates the heterogeneous spatial–spectral features, achieving an average improvement of 1.70%, 3.21%, 3.80% and 2.49% compared to D2S2BoT and DBCTNet on the OA value of four HSI data sets, respectively. Although APDCLNN utilizes the modified ConvLSTM2D layer and the heterogeneous feature fusion module to enhance the ability of spatial–spectral feature extraction for improving the classification accuracy of HSIs, its complexity is very high due to a large number of operations and complicated feature extraction and fusion processes. Compared with APDCLNN, our ASLNN model effectively integrates the heterogeneous spatial–spectral features with low storage and computing requirements, thus obtaining 3.30%, 4.57%, 7.59%, and 2.70% improvements in OA for the four HSI data sets, respectively. As for 3DG-CNN, the 3D Gabor filter is applied to build a 3D Gabor-modulated kernel to replace the random initialization kernel to improve the ability of representation and robustness of the extracted spatial–spectral features, which, however, overlooks the influence of heterogeneous features on the classification performance. Compared with it, the gains in OA generated by our ASLNN model are 7.25%, 6.81%, 9.72%, and 4.76% for the four HSI data sets, respectively. The proposed ASLNN model can obtain 7.94%, 6.30%, 4.21%, and 3.42% gains in OA for the four HSI data sets, respectively, when compared with SVM-CK. In addition, compared with other deep models (i.e., HybridSN, SSCL3DNN, and 3DOC-SSAN), our ASLNN model can achieve similar performance gains. Based on the above experimental results and analysis, effective learning of the spectral correlations and fusion of the heterogeneous spatial–spectral features are the keys to improving the classification accuracy of land covers. Therefore, the proposed ASLNN model can extract more discriminative and robust spatial–spectral features and obtain the highest classification accuracy for HSIs, illustrating its rationality and effectiveness.
Moreover, more intuitively,
Figure 8,
Figure 9,
Figure 10 and
Figure 11 illustrate their classification maps for the four HSI data sets, respectively. It can be observed that the classification maps yielded by our ASLNN model are the closest to the ground-truth maps of the four HSI data sets, where the classification qualities of class 3, class 6, class 7, and class 12 in
Figure 8, class 1, class 2, class 4, and class 15 in
Figure 9, class 2, class 5, class 6, class 9, and class 11 in
Figure 10, and class 1, class 2, class 5, and class 8 in
Figure 11 are significantly improved. These are consistent with the conclusions in
Table 1,
Table 2,
Table 3 and
Table 4. These experimental results further prove the effectiveness of our proposed ASLNN model.
3.4. Sensitivity Comparison of Different Training Samples
To further compare and analyze the sensitivity of different training samples on the classification performance of all supervised considered methods,
Figure 12 reports the experimental results under different training sizes, where 10, 20, 30, and 40 samples are randomly selected from each class to build the training set for the Indian Pines, Houston, and MUUFL Gulfport data sets, respectively, while the other samples are applied for testing. Particularly, as for class 7 and class 9 in the Indian Pines data set, their numbers of the training samples are fixed to 10. For the WHU-Hi-LongKou data set, we use four kinds of ground truth setups with 25, 50, 100, and 150 training labeled samples (also provided by the official) for each class, respectively. We can observe that as the number of training samples increases from 10 to 40 (or 25 to 150), the classification performance of all methods presents a trend of first rising and then stabilizing. In particular, the proposed ASLNN model achieves the best classification performance in all cases for the four HSI data sets, which also demonstrates its advantages.
3.5. Ablation Study
Aiming at adaptively learning the inherent attribute information of HSIs, an ASBlock and a GAFBlock are designed to construct an ASLNN model to adaptively learn the adaptive spectral correlations and integrate the details and interpretability from the heterogeneous features, improving the classification accuracy of land covers in HSIs. To illustrate and measure the contributions of the designed ASBlock and GAFBlock to the classification results of our ASLNN, a detailed ablation study is conducted, where the proposed ASLNN models with and without each component are abbreviated as proposed (with) and proposed (without) for convenience, respectively.
3.5.1. Effectiveness of the ASBlock
Inspired by the ResNeSt block in [
44], to adaptively model the correlations between the adjacent spectral bands and between the nonadjacent spectral bands of HSIs, a novel ASBlock is built by utilizing the GConv and ConvLSTM3D layers. To demonstrate its contribution for HSI classification of the whole model, our ASLNN model is compared with its variant (i.e., the proposed (without ASBlock)) where the ASBlock is replaced with the ResNeSt block, whose experimental results are given in
Table 5. It can be observed that compared with this variant, the ASBlock can bring 9.26%, 12.96%, 6.97%, and 3.01% gains in OA to the proposed ASLNN model for the four HSI data sets, respectively, verifying its feasibility and effectiveness in learning spectral correlations and extracting spatial–spectral features with strong expressive ability for HSI classification.
3.5.2. Structure Analysis of the GAFBlock
In order to adaptively align and fuse the heterogeneous spatial–spectral features (i.e., deep features and heterogeneous features) with low complexity, a simple but effective GAFBlock is constructed, and the experiments are carried out to compare our ASLNN with its variant (i.e., the proposed (without GAFBlock)) where the GAFBlock is replaced with the concatenation operation for HSI classification. From
Table 6, compared with the concatenation operation, with the GAFBlock, ASLNN obtains 1.35%, 3.12%, 4.75%, and 2.36% gains in OA for the four HSI data sets, respectively, which shows the effectiveness of the designed GAFBlock on fusing the heterogeneous spatial–spectral features. Particularly, compared with the MAFB submodule in [
6], the newly designed GAFBlock contains fewer network parameters and has lower computational complexity, which reduce the adverse influence of overfitting problem on the proposed ASLNN model, thus improving its optimization efficiency and classification performance for HSIs.
3.6. Computational Costs
As presented in
Table 7,
Table 8,
Table 9 and
Table 10, we report the average training time, average inference time, and memory usage of all compared methods (supervised) on an NVIDIA RTX 3080 GPU. The results show that while our algorithm incurs relatively higher time costs due to its inherent complexity (especially the serial sequence processing design of ConvLSTM3D), it outperforms other methods in terms of classification accuracy.
3.7. Limitations and Future Directions
First, the model exhibits high computational complexity with its intricate architecture leading to significant computational overhead and prolonged processing times. Second, real-time deployment on drones or embedded devices remains currently infeasible due to memory requirements exceeding 2 GB and high inference latency (over 2000 ms inference). Third, because of various factors such as sensor types as well as daily and seasonal variations, there are usually significant differences in data distribution and category distribution between test data and training data in actual scenarios, limiting the application effectiveness of the proposed model in practical scenarios. Moving forward, our future work will focus on addressing these challenges through lightweight optimization techniques, such as model quantization, pruning, and knowledge distillation [
54]. In addition, in order to improve the generalization of the proposed model in practical scenarios, domain adaptive hyperspectral or multisource remote sensing image classification methods are also the next focus of research, such as cross-domain few-shot classification [
55], cross-domain end-to-end classification of HSIs, or multisource remote sensing data [
56].
4. Conclusions
In this article, a novel ASLNN model has been proposed for the spatial–spectral feature extraction and classification of HSIs. Firstly, by integrating the GConv layer and the ConvLSTM3D layer, an effective ASBlock is designed to jointly capture the short- and long-term spectral correlations, which can be used as a plug-and-play feature extraction module to be inserted into any deep learning-based models or used as a fundamental network unit to construct a new backbone for adaptively modeling the local and global correlations, thus generating enhanced deep spatial–spectral features for HSI classification. After that, due to the rich detail and interpretability of heterogeneous features, the 3D Gabor filter is applied to extract heterogeneous spatial–spectral features, and a GAFBlock is further designed to fuse the heterogeneous spatial–spectral features with low requirements in terms of parameter and computational complexity. In addition, some network parameters are designed as fixed values, simplifying the whole model. Extensive comparison and ablation experiments on four commonly-used HSI data sets (i.e., Indian Pines, Houston, MUUFL Gulfport, and WHU-Hi-LongKou data sets) have been conducted to evaluate the proposed ASLNN model. Specifically, when 10, 10, 10 and 25 samples from each class are selected for training, ASLNN achieves the highest overall accuracy (OA) of 81.12%, 85.88%, 80.62%, and 97.97% on the four data sets, outperforming other methods with increases of more than 1.70%, 3.21%, 3.78%, and 2.70% in OA, respectively.