1. Introduction
Hyperspectral imaging (HSI) has emerged as a powerful technique for remote sensing and the analysis of the Earth’s surface [
1,
2]. By capturing and analyzing a large number of narrow and contiguous spectral bands, HSI data provides rich and detailed information about the composition and properties of observed objects [
3,
4]. The ability to differentiate between different land cover types and detect subtle variations in materials has made HSI classification a crucial task in various fields, including agriculture [
5], environmental monitoring [
6], mineral exploration [
7], and military reconnaissance [
8]. HSI classification has become a hot research topic [
9,
10,
11,
12,
13].
Currently, several HSI classification methods based on traditional machine learning algorithms have been proposed. These methods include Support Vector Machines (SVMs) [
14,
15] and Random Forest (RF) [
16]. In addition, the k-Nearest Neighbors (k-NN) [
17] algorithm is a non-parametric classification method that is based on the assumption of similar feature values. It assigns the class of an unlabeled pixel as the most frequent class among its k-nearest-neighboring pixels in the feature space. Linear Discriminant Analysis (LDA) [
18] is a supervised dimensionality reduction and classification algorithm. It aims to find a linear transformation that maximizes the differences between different classes and minimizes the within-class scatter, resulting in discriminative features used for pixel classification. The Endmember Extraction and Classification Algorithm (EMAP) [
19] is a comprehensive algorithm that combines endmember extraction and classification in hyperspectral image analysis. It involves extracting endmembers, which are pure spectral signatures, and using a linear mixing model to classify pixels based on their linear combinations of endmembers. EMAP enables the accurate characterization of materials present in hyperspectral data.
Traditional machine learning methods for hyperspectral classification have limitations in feature extraction, high-dimensional data, and modeling nonlinear relationships [
20]. In contrast, deep learning offers advantages such as automatic feature learning, strong nonlinear modeling capabilities, compact data representation, and data augmentation for improved generalization [
21]. These benefits make deep learning well-suited to handling high-dimensional, nonlinear, and complex hyperspectral data, leading to enhanced classification accuracy and robustness.
Due to the popularity of deep learning research, deep learning methods have also been applied to HSI classification tasks. Initially, researchers only used convolutional layers to solve classification tasks, such as 1D-CNN [
22], 2D-CNN [
23], and 3D-CNN [
24]. However, more complex and deeper networks have been designed. He et al. [
25] discovered that HSI differed significantly from 3D object images due to its combination of 2D spatial and 1D spectral features. Existing deep neural networks cannot be directly applied to HSI classification tasks. To address this issue, they proposed a Multiscale 3D Deep Convolutional Neural Network (M3D-CNN), which jointly learned both two-dimensional multiscale spatial features and one-dimensional spectral features from HSI data in an end-to-end manner. To achieve better classification performance by combining two types of convolutions, Roy et al. [
26] effectively combined 3D-CNN with 2D-CNN. Zhu et al. [
27] discovered the remarkable capabilities of Generative Adversarial Networks (GANs) in various applications. As a result, they explored the application of GANs in the field of HSI classification and designed a CNN for discriminating samples and another CNN for generating synthetic input samples. Their approach achieved superior classification accuracy compared to previous methods. Due to the sequential nature of hyperspectral pixels, Mou et al. [
28] applied Recurrent Neural Networks (RNNs) to HSI classification tasks. Then, they proposed a novel RNN model that effectively analyzed HSI pixels as sequential data. Their research demonstrated the significant potential of RNNs in HSI classification tasks. Traditional CNN models can only capture fixed receptive fields for HSI, making it challenging to extract feature information with different object distributions. To address this issue, Wan et al. [
29] applied Graph Convolutional Networks (GCNs) to HSI classification tasks. They designed a multi-scale dynamic GCN (MDGCN) that updated the graph dynamically during the convolution process, leveraging multiscale features in HSI.
With the introduction of attention mechanisms, Haut et al. [
30] combined CNNs and Residual Networks (ResNets) with visual attention. Visual attention effectively assisted in identifying the most representative parts of the data. Experimental results demonstrated that deep attention models had a strong competitive advantage. Sun et al. [
31] discovered that CNN-based methods, due to the presence of interfering pixels, weaken the discriminative power of spatial–spectral features. Hence, they proposed a Spectral–Spatial Attention Network (SSAN) that captured discriminative spatial–spectral features from attention areas in HSI. To leverage the diverse spatial–spectral features inherent in different regions of the training data, Hang et al. [
32] proposed a novel attention-aided CNN. It consisted of two subnetworks responsible for extracting spatial and spectral features, respectively. Both subnetworks incorporated attention modules to assist in constructing a discriminative network. To mitigate the interference between spatial and spectral features during the extraction process, Ma et al. [
33] designed a Double-Branch Multi-Attention mechanism network (DBMA). It employed two branches, each focusing on extracting spatial and spectral features, respectively, thereby reducing mutual interference. Subsequently, Zhu et al. [
34] discovered that the equal treatment of all spectral bands using deep neural networks restricted feature learning and was not conducive to classification performance in HSI. Therefore, they proposed a Residual Spectral–Spatial Attention Network (RSSAN) to address this issue. The RSSAN took raw 3D cubes as input data and employed spectral attention and spatial attention to suppress irrelevant components and emphasize relevant components, achieving adaptive feature refinement.
Recently, with the introduction of Vision Transformer [
35] into image processing, which originated from the transformer model in natural language processing, more and more efficient transformer structures have been designed [
36]. To fully exploit the sequential properties inherent in the spectral feature of HSI, Hong et al. [
37] proposed a new classification network called SpectralFormer. It can learn the spectral sequence information. Similarly, He et al. [
38] also addressed this issue and designed a classification framework called Spatial–Spectral Transformer to capture the sequential spectral relationships in HSI. Due to the limited ability of CNN to capture deep semantic features, Sun et al. [
39] discovered that transformer structures can effectively complement this drawback. They proposed a method called Spectral–Spatial Feature Tokenization Transformer (SSFTT). It combined CNNs and transformers to extract abundant spectral–spatial features. Mei et al. [
40] found that the features extracted using the current transformer structures exhibited excessive discretization and, thus, proposed a Group-Aware Hierarchical Transformer (GAHT) based on group perception. This network used a hierarchical structure and achieved a significant improvement in classification performance. Fang et al. [
41] introduced a Multi-Attention Joint Representation with Lightweight Transformer (MAR-LWFormer) for scenarios with extremely limited samples. They employed a three-branch structure to extract multi-scale features and demonstrated excellent classification performance. To utilize morphological features, Roy et al. [
42] proposed a novel transformer (morphFormer) that combined morphological convolutional operations with attention mechanisms.
In the current research, most models are capable of effectively extracting spatial–spectral information from HSI. However, training on fixed-size sample cubes constrained the model’s ability to extract multi-scale features. Additionally, in practical applications, there is often a scarcity of labeled samples in HSI datasets [
43]. Therefore, it is crucial to develop a network model that can adequately extract spatial–spectral features from HSI even in scenarios with limited samples.
The TNCCA model proposed by us offers the following three main contributions:
Taking blocks of different sizes from HSI, we employ a mixed fusion multi-scale extraction shallow spatial–spectral feature module to process shallow features. This module primarily consists of two multi-scale convolutional neural networks designed for different-sized data. The network utilizes convolutional kernels of varying sizes to extract shallow feature information at different scales.
An efficient transformer encoder was designed in which we apply 2D convolution and dilated convolution to tokens to obtain two sets of Q, K, and V with different scale information. This enables the transformer architecture with cross-attention to not only learn deeper feature information and promote the interaction of deep semantic information but also effectively fuse feature information of different sizes from the two branches.
We designed an innovative dual-branch network specifically for classification tasks in small-sample scenarios. This network efficiently integrates a multi-scale CNN with a transformer encoder to fully exploit the multi-scale spatial–spectral features of HSI. We validated this network on three datasets, and the experimental results indicated that our proposed network was competitive compared to state-of-the-art methods.
2. Materials and Methods
In
Figure 1, we illustrate an overview diagram of the proposed TNCCA model, which is an efficient dual-branch deep learning network for HSI classification. The network consists of the following sub-modules: the data preprocessing module for HSI, the shallow feature extraction module that utilizes different fusion methods to combine multi-scale spatial–spectral features, the module that converts the shallow features into tokens with different quantities assigned to different sizes, and the transformer module with CNN-enhanced cross-attention. Finally, there is the classifier head, which takes the input pixels and outputs the corresponding classification labels.
In summary, the TNCCA model consists of the following five components: HSI data preprocessing, a dual-branch multi-scale shallow feature extraction module, a feature-maps-to-tokens conversion module, a transformer with a CNN-enhanced cross-attention module, and a classifier head.
2.1. HSI Data Preprocessing
The processing of the original HSI () is described in this section, where a and b represent different spatial sizes, and l represents the spectral dimension. Due to the typically large number of spectral dimensions in HSI, it increases computational complexity and consumes significant computational resources. Therefore, we use the PCA operation to solve this problem by reducing the dimensionality of the original image from l to r.
To obtain information at different scales, we extract two square patches of different sizes, and (), centered at each pixel. We combine these two variables into a dataset and feed it into the network together. Finally, the set of data generated via each pixel is placed into a collection, A, and the training and test sets are randomly partitioned from A based on the sampling rate. Each group of training and testing data contains the corresponding ground truth labels. The labels, denoted as , are obtained from the set of ground truth labels.
2.2. Dual-Branch Multi-Scale Shallow Feature Extraction Module
As shown in
Figure 2, a group of cubes, denoted as
and
, with different sizes are fed into the network. Firstly, they pass through a 3D convolutional layer. In the first branch, a larger-sized cube is processed, and 8 convolutional kernels are allocated. The size of each kernel is (
). In the second branch, a cube with smaller dimensions is processed, and 4 convolutional kernels are allocated. The size of each kernel is (
). To maintain the original size of the cubes, padding is applied. The above process can be represented in the following equation:
where Conv3D and Conv2D represent 3D convolutional layers and 2D convolutional layers with different kernel sizes, respectively.
After passing through a 3D convolutional layer, we extract shallow spatial features at different scales using multi-scale 2D convolutional layers. Similarly, we use different numbers of convolutional kernels and different kernel sizes in different branches. In the first branch, we use 32 2D convolutional kernels of size (), 16 kernels of size (), and 16 kernels of size (). The information from these three different scales is fused through the Concatenation operation. In the second branch, smaller kernel sizes are used to extract shallow spatial features. Specifically, we use 64 2D convolutional kernels of size (), 64 2D dilated convolutional kernels with a dilation rate of 2 and size (), and 64 2D convolutional kernels of size (). The information from these three different scales is fused through element-wise addition.
Finally, we obtain two sets of 2D features,
and
, respectively. This process can be represented in the following equations:
2.3. Feature-Maps-to-Tokens Conversion Module
After obtaining the multi-scale 2D feature information from the dual-branch shallow feature extraction module, in order to better adapt to the structure of the Transformer, these features need to be tokenized.
The flattened feature maps are denoted as
and
, respectively. These two variables can be represented in the following equation:
where
(·) is a transpose function. Next,
is multiplied by a learnable weight matrix
using a 1 × 1 operation, and similarly,
is multiplied by a learnable weight matrix
using a 1 × 1 operation. We use weight matrices of different shapes to achieve the purpose of assigning a different number of tokens. Then, the feature maps are transformed into feature tokens multiplied by themselves. The above process can be achieved using the following equation:
To accomplish the classification task, we also embed a learnable classification token consisting of all zeros. Then, to preserve the original positional information, positional information is embedded into the tokens. The tokens of the two branches can be obtained from the following equation:
2.4. Transformer with CNN-Enhanced Cross-Attention Module
The transformer possesses powerful feature-information-mining capabilities, as it can capture long-range dependencies and acquire global contextual information. To further explore the deep feature information contained in the data and fully integrate the multi-scale feature information extracted via the two branches, we embed a cross-attention in the transformer structure.
As shown in
Figure 3, We utilize different convolutional layers to obtain the attention mechanism’s
Q,
K, and
V tensors from one of the outputs
obtained from the previous module. Firstly, we apply a 2D convolutional layer with kernel sizes of (
) and padding of 1 to obtain
. Next, a 2D convolutional layer with kernel sizes of (
) and padding of 2 is used to obtain
. Finally, we employ a dilated convolutional layer with kernel sizes of (
), padding of 2, and a dilation rate of 2 to obtain
.
Next, we apply similar multi-scale convolutions to another output,
, to obtain
,
, and
. Firstly, we use a 2D convolutional layer with a kernel size of (
) and padding of 1 to obtain
. Then, we employ a dilated convolutional layer with a kernel size of (
), padding of 2, and a dilation rate of 2 to obtain
. Finally, we utilize a 2D convolutional layer with a kernel size of (
) and padding of 2 to obtain
. Once we have obtained these tensors, we perform element-wise multiplication among them to obtain deep features
and
that have undergone the attention mechanism. The process can be represented in the following formula:
where
is the dimension of
, and
is the dimension of
. We obtain the deep features from two branches and sum them pixel-wise. Then, we pass the summed features through a multi-layer perceptron block using a residual structure to obtain the final deep feature,
. This can be obtained using the following equation:
where
is the multi-layer perceptron, and
is the abbreviation for layer normalization. The MLP mainly includes two linear layers, with the addition of the Gaussian Error Linear Unit (GELU) activation function in between.
2.5. Classifier Head
We extract the learnable classification token,
, from the output tokens,
, of the transformer encoder. Then, we pass it through a linear layer to obtain a one-dimensional vector, denoted as
, where
c represents the number of classes. The softmax function is used to ensure that the total activation of each output unit is 1. By selecting the corresponding maximum value, we obtain the class label for that pixel. The entire process can be represented in the following equation:
The complete procedure of the TNCCA method, as proposed, is outlined in Algorithm 1.
Algorithm 1 Multi-scale Feature Transformer with CNN-Enhanced Cross-Attention Model |
Input: Input HSI data and ground truth labels ; the original data are reduced in spectral dimension to r = 30 using PCA operation. A set of small cubes with sizes = 13 and = 7 is then extracted. Subsequently, the training set of the model is randomly sampled at a sampling rate of 1%. Output: Predicted labels for the test dataset. - 1:
Set the batch size of the training data to 64, and use the Adam optimizer with a learning rate of = 5 × 10. Decay the learning rate to * 0.9 every 50 steps. Set the total number of training epochs to = 500. - 2:
After the dimensionality reduction of the original HSI using PCA, cubes corresponding to each pixel are extracted with the pixel as the center. Subsequently, each extracted set of data, and , is placed into a collection. Then, the collection is divided into a training set and a testing set according to Table 1. - 3:
Create training and test data loaders. Each group of training and testing data will obtain corresponding ground truth labels from . - 4:
for to do - 5:
The dual-branch, multi-scale shallow feature extraction module is used to extract the multi-scale shallow spatial–spectral features and . - 6:
The outputs of the feature maps to the token conversion module are used as inputs for the next module, denoted as and . - 7:
Passing tokens through a transformer encoder with cross-attention yields deep semantic features, referred to as deep semantic features, . - 8:
Extracting a learnable classification token, , from and feeding it into a classification head yields the predicted class for the current pixel. - 9:
end for - 10:
Apply the trained model to the test dataset to generate predicted labels.
|
Table 1.
Explanation of the division of training samples and test samples in the Houston2013 dataset, the Trento dataset, and the Pavia University dataset.
Table 1.
Explanation of the division of training samples and test samples in the Houston2013 dataset, the Trento dataset, and the Pavia University dataset.
NO. | Houston2013 Dataset | Trento Dataset | Pavia University Dataset |
---|
Class | Training (). | Test. | Class | Training (). | Test. | Class | Training (). | Test. |
---|
#1 | Healthy Grass | 13 | 1238 | Apple Trees | 40 | 3994 | Asphalt | 66 | 6565 |
#2 | Stressed Grass | 13 | 1241 | Buildings | 29 | 2874 | Meadows | 186 | 18,463 |
#3 | Synthetic Grass | 7 | 690 | Ground | 5 | 474 | Gravel | 21 | 2078 |
#4 | Tree | 12 | 1232 | Woods | 91 | 9032 | Trees | 31 | 3033 |
#5 | Soil | 12 | 1230 | Vineyard | 105 | 10,396 | Metal Sheets | 13 | 1332 |
#6 | Water | 3 | 322 | Roads | 31 | 3143 | Bare Soil | 50 | 4979 |
#7 | Residential | 13 | 1255 | | | | Bitumen | 13 | 1317 |
#8 | Commercial | 12 | 1232 | | | | Bricks | 37 | 3645 |
#9 | Road | 13 | 1239 | | | | Shadows | 9 | 938 |
#10 | Highway | 12 | 1215 | | | | | | |
#11 | Railway | 12 | 1223 | | | | | | |
#12 | Parking Lot 1 | 12 | 1221 | | | | | | |
#13 | Parking Lot 2 | 5 | 464 | | | | | | |
#14 | Tennis Court | 4 | 424 | | | | | | |
#15 | Running Track | 7 | 653 | | | | | | |
| Total | 150 | 14,879 | Total | 301 | 29,913 | Total | 426 | 42,350 |
3. Results
3.1. Data Description
The proposed TNCCA model was tested on three widely used datasets. Below, we introduce these three datasets one by one.
Houston2013 dataset: The Houston2013 dataset was jointly provided by the research group at the University of Houston and the National Mapping Center of the United States. It contained a wide range of categories and has been widely used by researchers. The dataset consisted of 144 bands and contained
classified pixels. There were 15 different classification categories.
Figure 4 displayed the pseudocolored image and ground truth map of the Houston2013 dataset.
Trento dataset: The Trento dataset was captured in the southern region of Trento, Italy. It was an HSI obtained using the Airborne lmaging Spectrometer for Application (AISA) Eagle sensor. The dataset consisted of 63 spectral bands and had dimensions of
pixels for classification. It included six different categories of ground objects.
Figure 5a,b respectively display the pseudocolored image and ground truth map.
Pavia University dataset: The Pavia University dataset was a collection of HSI taken in 2001, specifically at Pavia University in Italy. The dataset was an HSI obtained using a Reflective Optics System Imaging Spectrometer (ROSIS) sensor. The image comprised 115 bands and had dimensions of
classified pixels. There were a total of nine land cover classification categories. To reduce the interference of noise, we removed 12 bands that contained noise.
Figure 6 displays the pseudocolored image and ground truth map of the dataset.
We present the division of training and test samples for the three datasets in
Table 1, which includes the specific data for each category. For each category, we used
of the total number of samples as the training set.
3.2. Parameter Analysis
In the model we proposed, there was a set of hyperparameters, such as batch size, the size of the first cubic patch, and the size of the second cubic patch. We conducted experimental analysis on these parameters to ensure that their values were optimal. The analysis results are shown in
Figure 7,
Figure 8 and
Figure 9.
(1) Batch Size: Due to our observation that the performance of the transformer architecture was highly sensitive to the batch size, different sizes resulted in varying classification performance. We set the batch size to the following candidate values: . Additionally, we experimentally determined the batch size that yielded the best performance for our proposed model.
(2) Patch Size: Since the cubic patch served as the input to the model, selecting a patch size that was too small could limit the model’s receptive field, while choosing a size that was too large could result in excessive data volume and increased computational complexity. Our proposed TNCCA selected two different sizes of cubic patches to extract multi-scale features, for which the size of the cubic patch in the first branch was slightly larger than that in the second branch. These two cubic patches served as inputs to the model, and their sizes significantly impacted the classification accuracy. Therefore, we conducted experiments on these two hyperparameters.
We first selected the parameter for the first branch from the set
, and the experimental results showed that the model achieved the best classification performance when its value was 13. Then, for the second branch, we selected the parameter from the set
. From
Figure 7,
Figure 8 and
Figure 9, it can be observed that the model achieved the highest classification metrics when its value was 7.
3.3. Classification Results and Analysis
We explored eight advanced classification models, and in this section, we describe the conducted experiments and analyze them to compare the classification performance of our proposed model with these models. They comprised SVM [
14], 1D-CNN [
22], 3D-CNN [
24], M3D-CNN [
25], 3D-DLA [
44], Hybrid [
26], SSFTT [
39], and morphFormer [
42]. To maintain the original performance of the comparative models, we used the training strategies described in their respective papers. The number of training and testing samples for each model was the same as the numbers listed in
Table 1, and random sampling was employed. If you wish to reproduce our experiments, you can download the code from the following link:
https://github.com/cupid6868/TNCCA.git (accessed on 25 March 2024).
(1) Quantitative results and analysis: We present the results in
Table 2,
Table 3 and
Table 4, where we demonstrate the superior performance of our proposed model. We highlight the best results for each metric. We conducted experiments on three datasets: the Houston2013 dataset, the Trento dataset, and the Pavia University dataset. The comparative classification metrics included overall accuracy (OA), average accuracy (AA), the Kappa coefficient (
), and class-wise accuracy. The data in the tables clearly indicate that our proposed TNCCA outperformed the other seven models on the experimental datasets. Let us take the Houston2013 dataset as an example. The proposed TNCCA exhibited the best classification performance for classes such as ‘Synthetic Grass’, ‘Soil’, ’Water’, ‘Commercial’, ‘Parking Lot 2’, ‘Tennis Court’, and ‘Running Track’. Additionally, for classes like ‘Healthy Grass’, ‘Stressed Grass’, and ‘Parking Lot 1’, although our model’s performance was not the best, it still ranked among that of the top methods. In contrast, SVM and 1D-CNN showed extremely low classification performance for certain classes. This clearly demonstrated that, in the context of small sample sizes, our proposed model effectively utilized multi-scale feature information and fully exploited the spatial–spectral characteristics in HSI.
(2) Visual evaluation and analysis: We present the aforementioned experimental results in the form of classification maps, shown in
Figure 10,
Figure 11 and
Figure 12. By comparing the spatial contours of the classification maps with the noise contained in the images, we can clearly observe the superior classification performance of the proposed TNCCA compared to other models.
In the classification maps, it is obvious that the classification map of TNCCA exhibited the clearest spatial contours and contained the least amount of noise. Conversely, the classification maps of the other models showed more instances of misclassifications and interfering noise. Let us take the classification map of the Houston2013 dataset as an example. The classification map of our proposed model closely resembles the ground truth map. On the other hand, the classification maps of SVM, 1D-CNN, 3D-CNN, M3D-CNN, and 3D-DLA exhibited more misclassifications and noise. In the zoomed-in window, we can clearly observe the high classification performance of our proposed model for classes such as ‘Parking Lot 2’, ‘Road’, and ‘Synthetic Grass’.
In conclusion, our proposed model outperformed the compared models and demonstrated the best classification performance. It highlighted the model’s capability of extracting features effectively in small sample scenarios.
3.4. Analysis of Inference Speed
To demonstrate the inference speed of our proposed model, TNCCA, we present the training time and testing time of the model with different datasets in
Table 5. The data show that our training speed is fast, as the model can complete 500 epochs in a very short period. To facilitate the observation of model performance during the training process, we adopted a training strategy of conducting a test after each epoch. This resulted in a significantly longer testing time compared to the training time. Additionally, we employed dynamic learning rates to accelerate the convergence speed.
Among the three tested datasets, the Pavia University dataset, which had larger spatial dimensions and higher spectral dimensions, took the longest time, with 1.26 min for training and only 0.153 s per epoch. The training times for the other datasets were shorter. From this table, it is easy to conclude that our proposed model not only achieved high classification accuracy but also trained at a fast speed, demonstrating high efficiency.
3.5. Ablation Analysis
To validate the effectiveness of each module in our proposed model, we conducted ablation experiments on the four modules using the Houston2013 dataset. These four modules comprised a 3D convolutional layer (3D-Conv), a multi-scale 2D convolutional module (Ms2D-Conv), a feature map tokenization module (Tokenizer), and a transformer encoder module (TE). We evaluated their performance in terms of OA, AA, and
by considering five different combinations of these modules. The results are listed in
Table 6.
Specifically, we first kept only the 3D convolutional layer, and it was evident that the performance was extremely poor. In the next step, we removed the transformer encoder with the CNN-enhanced cross-attention mechanism, which was one of the main innovations of this paper. The results showed a significant decrease in classification performance. The OA, AA, and values of the model decreased by , , and , respectively, compared to TNCCA. Next, we removed the 3D convolutional layer and replaced the multi-scale 2D convolutional module with a regular 2D convolutional layer. In this configuration, the model’s OA decreased by , and its AA decreased by , compared to TNCCA. Then, we removed the 3D convolutional layer, which resulted in the loss of rich spectral information in the HSI. We observed that the model’s OA decreased by , and its AA decreased by , compared to TNCCA. Finally, we replaced only the multi-scale 2D convolutional module with a regular 2D convolutional layer. In this case, the model’s OA decreased by , and its AA decreased by , compared to TNCCA. This clearly demonstrated the positive contributions of these four modules in enhancing the accuracy of network classification.