1. Introduction
Hyperspectral image (HSI) classification technology is of paramount importance in the realm of remote sensing, as it captures and analyzes spectral information from visible light to the near-infrared spectrum, thereby enabling precise identification of terrestrial objects. With the advent of deep learning algorithms, the performance of HIS classification has seen remarkable improvement, allowing researchers to delve deeply into the spatial and spectral characteristics embedded within spectral data, thus propelling the ongoing advancement of remote sensing technology.
Early research on the classification of HSIs focused on spectral feature extraction and widely used methods such as principal component analysis, the envelope removal method, and the spectral angle mapper algorithm. The extracted features were subsequently input into various supervised classifiers, such as logistic regression, naive Bayes, and artificial neural networks, to achieve effective classification. However, relying solely on spectral information for hyperspectral classification presents numerous challenges. For instance, the Hughes phenomenon indicates that the demand for training samples increases sharply with the number of bands. This leads to a scenario where classification accuracy first improves and then declines when samples are limited. Additionally, HSIs often contain mixed pixels, in which the spectral characteristics arise from the reflections of multiple objects, making it more likely that classification errors will occur if only spectral information is considered.
Researchers have begun to fuse spectral features with spatial features for HSI classification to overcome the limitations of single-dimensional information. Deep learning methods have gradually replaced traditional HSI classification methods in this context. Convolutional neural networks (CNN) in HSI classification have evolved from using spectral dimensions to spatial dimensions and then to space-spectrum associations. At first, 1D CNN [
1] was used to focus on extracting spectral features. However, spectral information alone is insufficient for the HSI classification task to achieve accurate results. As a result, 2D CNN [
2] was proposed to extract spatial information. Nevertheless, neither 1D nor 2D CNN takes full advantage of the three-dimensionality of HSI. Consequently, 3D CNN [
3] has been applied in HSI classification to achieve a comprehensive fusion of HSI’s spatial spectral features. During this period, Dosovitskiy et al. presented the vision Transformer (ViT) [
4] model, which effectively alleviates the receptive field limitation issues faced by traditional convolutional approaches through its multi-head attention mechanism. Since then, many methods combining ViTs and CNNs have appeared. SpectralFormer [
5] was the first to apply the Transformer to the HSI classification task. SSFTT [
6] integrated a backbone CNN and Transformer, with convolutional layers used to capture low-level features and the Transformer used to generate high-level features. GAHT [
7] presented a grouped pixel embedding module that restricts multi-head attention to the local spatial-spectral domain, allowing for HSI classification from a spatial-spectral perspective in a global-local manner. DBSSAN [
8] is a dual-branch spectral-spatial attention network, where the spatial branch fuses global and local spatial features through the proposed spatial self-attention module. In contrast, the spectral branch utilizes the Transformer model to extract spectral features. The extracted features are then fused and classified using a multi-layer perceptron. These methods not only inherit the global feature perception capabilities of the ViT model but also retain the strengths of CNNs in local feature fusion.
Accurate feature extraction can enhance the model’s performance and reduce computational complexity. Many studies have focused on improving the accuracy and diversity of spatial-spectral feature extraction. MASSFormer [
9] is a memory-augmented spectral-spatial Transformer that introduces a memory tokenization module to convert spectral-spatial features into memory tokens, preserving local context and significant features, thereby enhancing the model’s feature representation capability. The memory-augmented Transformer encoder achieves sufficient information mixing through an expanded multi-head self-attention mechanism, improving the performance of HSI classification. Wang et al. [
10] constructed a comprehensive feature extraction framework by combining global feature extraction, local feature extraction, and feature alignment strategies. This framework can fully utilize the spectral and spatial information of HSIs to extract more representative and discriminative features, providing strong support for classification tasks. M3FuNet [
11] is an unsupervised multi-feature fusion network that extracts spectral and spatial features through multi-scale supervector matrix correction and multi-scale random convolutional dispersion methods and achieves feature calibration through feature fusion and decision fusion, enhancing the performance of HSI classification. APFL [
12] is a semi-supervised adaptive pseudo-label feature learning model that improves the accuracy of HSI classification by utilizing unlabeled sample information through iterative multi-scale super-pixel segmentation and pseudo-label feature generation. Therefore, compared to methods that rely solely on CNNs to obtain spectral-spatial features, these hybrid methods can extract more comprehensive and richer spectral-spatial features.
While achieving good results, the hyperspectral classification methods mentioned above also have several challenges. One major issue is the rigidity of existing neural networks, which cannot adapt to the complexity and diversity of HSI data. When processing HSI data, neural networks must possess adaptable and flexible receptive fields to accommodate the diverse sizes and shapes of surface features [
13]. The spectral-spatial asymmetry inherent in HSIs must be effectively managed. Furthermore, given the pronounced differences such as spatial resolution, spectral coverage, and the number of bands among various HSI datasets, researchers must thoroughly consider and address these issues when designing neural networks, as they often present difficulties, making the design process cumbersome and often limited by the designer’s experience. NAS [
14] is an automated approach for creating neural network structures that reduces the complexity of the design cycle and its reliance on expertise compared to manually customizing the structure for each dataset.
Current NAS methods for HSI classification are categorized into global and cell-based search spaces [
15,
16]. Global research involves constructing a directed acyclic graph to identify an optimal set of operators, but this often results in numerous candidate architectures, consuming considerable computational resources and time. In contrast, cell-based search spaces offer flexibility in managing the number of cells tailored to specific scenarios and come in two forms: one utilizing a single cell type that views hyperspectral data holistically and another using two separate cells for spectral and spatial data. While the former acknowledges the relationship between spectral and spatial dimensions, it underestimates the importance of spectral data in classification. The latter method emphasizes feature extraction but overlooks the interdependency of spectral and spatial information, which is crucial for distinguishing categories with similar spectral characteristics, especially in complex environments. Moreover, prevalent patch-based input strategies in existing deep learning models [
17] can lead to a loss of global contextual information, undermining the contribution of spatial data to classification tasks. Thus, there is a pressing need for methods that effectively integrate spectral and spatial dimensions to fully leverage the unique characteristics of hyperspectral data.
In this paper, we propose a new triple-unit hyperspectral NAS network named TUH-NAS to address the main issues present in cell-based search methods. TUH-NAS features an inner search space comprising a spectral processing unit (SPEU), a spatial processing unit (SPAU), and a feature fusion unit (FFU). Each unit has distinct functions, with SPEU focusing on deep spectral information mining, SPAU on spatial feature extraction, and FFU on merging information from the first two units to strengthen the link between spectral and spatial dimensions. Additionally, the model incorporates an HSI attention mechanism module (HSIAM) in the search space to increase sensitivity to crucial features, such as edge regions. To ensure effective training, we utilized a comprehensive loss function that facilitates balanced objectives for accurate classification while emphasizing edge significance. The FFU specifically addresses patch-based input patterns, enabling the model to concentrate on localized details and subtle spectral variations, ultimately improving accuracy in identifying edge land cover targets.
The primary contributions presented in our paper are as follows:
This paper proposes a triple-unit hyperspectral NAS, which includes three types of processing units: the spectral processing unit, the spatial processing unit, and the feature fusion unit. The entire search space can be divided into internal and external spaces.
We designed an attention mechanism module named HSIAM, specifically tailored for HSI data processing. This module integrates multi-scale feature fusion and enhanced channel technology to improve the spectral, spatial, and channel feature representation capabilities of the input feature maps. Additionally, we developed a triple fusion loss function that comprehensively considers both spectral and spatial information, capturing feature differences and similarities from multiple aspects, thereby enhancing classification accuracy.
The outcomes of experiments on the MUUFL, Houston2018, and XiongAn datasets indicate that our methodology improves the overall accuracy by approximately 3%, 5.6%, and 3.5% compared to the baseline methods, demonstrating that the proposed method significantly enhances the precision of HSI classification methods.
The remainder of this paper is structured as follows:
Section 2 offers an overview of related research.
Section 3 discusses our methodology in detail.
Section 4 describes the experiment setup and evaluates the results. Finally, in
Section 5, we summarize our findings.
3. Proposed Method
The main objective of the proposed TUH-NAS is to enhance the integration of spectral and spatial data, aiming to achieve greater accuracy in HSI classification tasks. In this section, we will commence by presenting the comprehensive framework of TUH-NAS, encompassing both its internal and external search architectures. Then, we introduce the network’s attention mechanism module and loss function, respectively.
3.1. The Overall Architecture of TUH-NAS
TUH-NAS provides a complex NAS model for HSI classification tasks. This model integrates a new hybrid search space to handle HSI data’s spectral and spatial information effectively. The process of NAS can be divided into two components: an internal search and an external search. The internal search strategy determines the internal topology of each working unit, while the external search identifies the type of working unit for each layer. The overall algorithm flowchart of the proposed method is shown in
Figure 1.
3.1.1. Internal Search Architecture
The internal search architecture of TUH-NAS is illustrated in
Figure 2a. Each computational unit comprises multiple nodes that receive two inputs from the current unit along with outputs from all preceding nodes. These nodes process the information via multiple pathways, each corresponding to a set of candidate operations. Each operation is accompanied by a weight that can be fine-tuned using a gradient descent algorithm. Ultimately, each node retains the top two most effective paths with the highest weights, thus establishing a fixed fundamental cell, as depicted in
Figure 2b. It is assumed that each path in the cell contains C candidate operations, and the output of node
Ni is
where
ρ denotes different candidate operations and
θ represents their corresponding weights, which are derived through self-learning. The output of the work unit
is obtained as follows:
where
j indicates the unit type,
l signifies the layer number,
i ∈ {1, 2, …,
N}, and
N is the number of nodes within each fundamental unit.
Each work unit contains nine dedicated candidate operations and two common operations, resulting in twenty-nine candidate operations within the entire architecture, higher than other NAS methods. When using NAS for HSI classification tasks, a more diverse set of candidate operations allows the network to achieve more complex feature extraction, benefiting the identification of land cover edges. The candidate operations included in each unit are shown in
Table 1.
The meanings of the various candidate operations are as follows:
econ_i − 1: LReLU − Conv(i × 1 × 1) − BN.
esep_i − 1: LReLU − Sep(i × 1 × 1) − BN.
acon_1 − i: LReLU − Conv(1 × i × i) − BN.
asep_1 − i: LReLU −Sep(1 × i × i) − BN.
con_i − j: LReLU −Conv(1 × j × j) − Conv(i × 1 × 1) − BN.
dilated_i − 1: Dilated convolution with a kernel size of i × 1 × 1 and a dilation factor 2.
dilated_1 − i: Dilated convolution with a kernel size of 1 × i × i and a dilation factor 2. dilated_i − i: Dilated convolution with a kernel size of i × i × i and a dilation factor 2.
skip_connect: f(x) = x.
none: f(x) = 0. |
where LReLU denotes the Leaky ReLU activation function; BN refers to batch normalization; Conv signifies standard convolution; and Sep represents separable convolution.
3.1.2. External Search Network
In the TUH-NAS, each network layer comprises three working units: a spectral processing unit, a spatial processing unit, and a feature fusion unit. The external search network determines which kind of working unit will be selected for the final trained network. The external architecture of the search network is depicted in
Figure 3.
While conducting topological structure research within the working units, we also perform computations on the external outputs of these units. During the external architecture search process, the SPEU and SPAU of each network layer simultaneously extract features from the spectral and spatial dimensions of the input image, respectively. The outputs from SPEU and SPAU are then used as input for the FFU to integrate spectral and spatial data. An attention mechanism module, HSIAM, is connected after the FFU to focus on critical spectral-spatial regions. Thus, the output
for layer l generated by the HSIAM can be represented as:
To quantify and regulate the contribution of each working unit in the final decision-making process, we assign a weight coefficient
to each working unit. This coefficient reflects the importance of each unit’s output in the overall decision and provides a flexible mechanism for adjusting the network architecture based on task requirements. For each network layer, we select the working unit with the highest weight
as the representative for that layer, which will participate in constructing the final optimized training network. By assigning different initial values to the weight
, we can effectively intervene in the external search process, allowing for tailored architecture search schemes for various application scenarios.
Figure 4 depicts the architecture of the optimized training network.
At the end of optimizing the training network, we connected a standard Transformer module, which works in conjunction with the HSIAM in the network. This combination can further enhance the ability to capture global contextual information, allowing the model to learn long-range dependencies between input features. The attention mechanism helps the network selectively amplify important information at different resolution levels, while the Transformer further enhances the interactions among features through self-attention, enabling the network to learn more complex spatial relationships. This combination is highly effective for HSI classification tasks.
3.2. Attention Mechanism Module
Adding attention modules to deep learning networks can help the network automatically concentrate on significant spectral and spatial features in images, suppress interfering factors, enhance the model’s understanding of spatial context, and effectively capture the edges and structures between ground objects.
We propose HSIAM, an attention mechanism module comprising spatial, spectral, and channel attention modules. HSIAM typically follows the FFU in the search network. Each processing unit handles comprehensive three-dimensional data, with all three units preserving the interrelationships between spatial and spectral information while focusing on different aspects. This approach optimizes the extraction of relational information between spatial and spectral data. The multi-scale spectral attention module captures spectral information at different resolutions, aiding in capturing details and edges. The multi-scale spatial attention mechanism emphasizes salient features and retains global context. To manage the high computational load of hyperspectral data, which contains numerous bands, we divide the spectral data into smaller blocks for batch processing and introduce channels. The channel attention module processes spatial-spectral data within a single channel. The three submodules of HSIAM effectively fuse different attention outputs through weighted distribution, enhancing feature representation and analysis.
Figure 5 illustrates the structure of HSIAM.
In the multi-scale spectral attention module, the input feature map is first subjected to adaptive average pooling to compress the spatial dimension data while preserving spectral information. Next, 3D convolutions with kernel sizes of 1 × 1 × 1, 3 × 3 × 3, and 5 × 5 × 5 are employed to extract local, medium-scale, and large-scale spectral features, respectively. The convolution results at different scales are concatenated in the channel dimension, followed by processing through a 1 × 1 × 1 3D convolution and a ReLU activation function to restore the original number of channels. Finally, normalization is performed by applying a Sigmoid activation function, and an element-wise multiplication with the original input feature map is carried out to obtain enhanced spectral features.
The multi-scale spatial attention module aims to strengthen the spatial information of the input feature map through convolutions of different scales. First, the average and maximum values of the input feature map are calculated along the channel dimension and concatenated. Then, multi-scale spatial feature extraction is performed by combining depth-wise separable convolutions (3 × 3 × 3 kernels) with standard 3D convolutions (3 × 3 × 3 and 5 × 5 × 5 kernels). The results of these multi-scale convolutions are summed and then normalized using a Sigmoid activation function, which is used for element-wise multiplication with the generated spectral feature map, resulting in enhanced spatial features. By combining spectral and spatial attention features, the module achieves complementarity between the two, enhancing the representational capability of features and enabling the model to highlight important features when processing complex inputs more effectively.
The enhanced channel attention module focuses on strengthening the channel information of the input feature map. This module first performs adaptive average pooling and max pooling to compute average and maximum values in the spatial dimension. Following this, a fully connected layer is utilized to reduce and subsequently restore the number of channels while also normalizing the processed features. Ultimately, the normalized feature map is multiplied element-wise with the original input feature map in order to obtain enhanced channel features.
The formula used to calculate the HSIAM is as follows:
Let x be the input feature map. The output of the multi-scale spectral attention module can be calculated using the following equations:
where conv1, conv3, and conv5 represent 3D convolutions with kernel sizes of 1 × 1 × 1, 3 × 3 × 3, and 5 × 5 × 5, respectively, and avg_pool signifies adaptive average pooling. ReLU is the activation function, and Conv3d refers to a 3D convolution with a kernel size of 1 × 1 × 1. Concat denotes concatenation along a specified dimension.
Next, the output calculation formula for the multi-scale spectral-spatial force module is as follows:
where depthwise_conv signifies depth wise convolution, pointwise_conv denotes pointwise convolution, and max_out is the output of maximum pooling.
The formula for computing the output of the enhanced channel attention module is obtained as follows.
The final output signal of the HSIAM is obtained using:
The final output signal of the HSIAM is obtained using:
where a, b, and c represent the weights designated to the spectral attention submodule, spatial attention submodule, and channel attention submodule, respectively. By assigning different weights to the outputs of these three submodules, the HSIAM can better adapt to various tasks.
3.3. Loss Function
In HSI classification, selecting a robust and effective loss function is crucial for extracting features from hyperspectral data. Numerous studies [
41,
42] have demonstrated the effectiveness of integrating different loss functions to address various HSI tasks. The cross-entropy loss (CE loss) [
43] function is often employed in HSI classification tasks. While it performs well in many cases, it also exhibits some notable drawbacks. The CE loss focuses on pixel-level classification accuracy, neglecting the spatial relationships between pixels. In HSIs, adjacent pixels often share similar spectral characteristics and categories. Additionally, in hyperspectral data, some classes of samples might be easier to distinguish, while others are relatively difficult [
44,
45]. For example, identifying land cover types that occupy large contiguous areas is relatively easy, whereas recognizing elongated features such as sidewalks can be challenging. CE loss does not take this aspect into account. It treats all samples equally, which could result in the model optimizing ineffectively and failing to focus on harder-to-distinguish samples.
To overcome this issue, we propose a Triple Fusion Loss Function (TFLF) composed of CE loss, dice loss [
46], and focal loss [
47]. Dice loss emphasizes improving the prediction capability for small objects, particularly in scenarios with sample imbalance, and encourages the model to learn minority classes more effectively. It is widely used in classification tasks. Focal Loss, on the other hand, reduces the model’s focus on easy samples and increases the weight on hard-to-classify samples, thereby making the model pay more attention to difficult-to-distinguish samples, such as those at the edges of land cover. By integrating different types of losses, the joint loss function can flexibly adjust the weights for each type of loss based on the features of the data, allowing the model to classify the targets better and facilitating the judgment of land cover boundaries.
The TFLF is defined as follows:
where
α,
β, and
γ are hyperparameters that balance the contributions of each individual loss component.
CE loss: This metric is widely used in classification tasks, primarily measuring the difference between two probability distributions. Its formulation is as follows:
where
yi represents the true label and
pi denotes the predicted probability for class
i.
Dice loss: This loss is a function employed in image segmentation tasks, primarily designed to optimize the F1 score, measuring the similarity between predictions and true labels. It is particularly well-suited for addressing issues associated with class imbalance. The dice loss is defined as:
where
ϵ is a small constant to prevent division by zero,
pi is the predicted probability, and
yi denotes the one-hot encoded true labels.
Focal loss: This loss function mitigates class imbalance by reducing the weight loss assigned to well-classified examples. Its formulation is:
where
γ is a focusing parameter that adjusts the decay rate of the loss weight for easily classified samples.
The TFLF effectively addresses various challenges in HSI classification by combining joint CE loss, dice loss, and focal loss.
4. Experiments
We conducted experiments on a server equipped with an Intel Xeon Gold 6330 CPU 2.00 GHz, 512 GB of memory, and featuring an NVIDIA GeForce RTX 4090 graphics card with 24 GB of video memory (
https://www.autodl.com). The deep learning framework utilized was Pytorch 2.3.1. Our validation experiments evaluated the proposed method on three benchmark hyperspectral datasets: MUUFL, Houston2018, and XiongAn. We adopted a rigorous evaluation protocol, randomly selecting a fixed number of labeled pixels per class for training (20 and 30 pixels) and validation (10 pixels), and testing using the remaining pixels. The performance of our method was assessed using three standard metrics, namely, overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa), providing a comprehensive evaluation of classification effectiveness.
4.1. Dataset
To assess the practical efficacy of our proposed NAS algorithm, we carefully selected the three representative HSI datasets mentioned above for comparative experiments. These datasets vary by an order of magnitude in sample size, representing small, medium, and large datasets, thereby providing us with a unique perspective to examine the algorithm’s performance from multiple angles. The sample distribution information for these three HSI datasets is listed in
Table 2.
The MUUFL Gulfport dataset [
48,
49] is an open-source project developed by the GatorSense team at the University of Florida, primarily designed for research in HSI processing and target detection. Collected in November 2010 on the University of Southern Mississippi campus in Long Beach, Mississippi, this dataset incorporates data from various sensors, including hyperspectral and LiDAR data, with a ground sample distance of 1 m. The imagery encompasses a region of 325 by 337 pixels across 72 spectral bands; to mitigate noise interference, the first and last four bands were discarded, resulting in a final compilation of 64 spectral bands, with cropped images measuring 325 by 220 pixels. This dataset features 11 distinct categories, including trees, grasslands, roads, buildings, and various fabric panels, and is regarded as a quintessential small hyperspectral dataset due to its relatively modest size.
The Houston2018 dataset was meticulously gathered at the University of Houston and surrounding urban areas. These hyperspectral data span a spectral range of 380–1050 nm, encompassing 48 distinct bands with a spatial resolution of 1 m. The dataset includes 20 diverse land cover classes featuring roadways, sidewalks, crosswalks, highways, railways, and trains, all exhibiting elongated and narrow shapes that pose significant challenges for classification algorithms. It stands as a quintessential example of a medium-sized hyperspectral dataset.
The XiongAn dataset [
50] was collected using a full spectral band multimodal imaging spectrometer integrated into a high-resolution aviation system, which the Shanghai Institute of Technical Physics developed. The spectral range is 400–1000 nm, with 256 bands. The image dimensions are 1580 × 3750 pixels, with a spatial resolution of 0.5 m. The dataset contains 20 land cover types, primarily focused on economic crops. In this dataset, multiple types of economic crops are intermixed within the same area, overlapping each other, and some planting areas are relatively small. This presents a significant challenge for classification models. The XiongAn dataset is a typical large-scale dataset.
4.2. Implementation Details
We conducted experiments on two groups to validate the proposed algorithm’s efficacy. In the first group, we randomly selected 20 labeled pixels from each category to construct the training set. An additional 10 labeled pixels were designated for the validation set, with the remaining pixels utilized as the test set. In the second group of experiments, the number of training samples was elevated to 30 for each category, while all other configurations remained consistent with the first group. More details on the distribution of samples are provided in
Table 3.
We constructed three separate search networks with a similar outline structure for the three different datasets. In the external structure, each search network comprises four layers of basic cells, each consisting of three distinct types of cells. Within the internal structure, each operational unit comprises three nodes. Given the storage capacity limitation of 24 GB of VRAM, we cropped patches as search samples at a spatial resolution of 25 × 25 for MUUFL and Houston2018, with a batch size set to 4. For the large-scale XiongAn dataset, 24 GB of VRAM was slightly insufficient, so the patch cropping dimensions were set to 18 × 18, and the batch size was 1. We employed the Adam optimizer for all three datasets to fine-tune the architectural parameters, with the learning rate and weight decay set at 0.001. Concurrently, we utilized the standard SGD optimizer to update the parameters of the search network, specifically configuring the momentum decay at 0.9 and the weight decay at 0.0003. At the same time, the learning rate was systematically reduced from 0.025 to 0.001. For the small MUUFL dataset, the first 16 epochs served as a warm-up phase, during which only the search network parameters were optimized. For the medium-sized Houston2018 dataset, we set 20 epochs for warm-up. For the large-scale XiongAn dataset, the warm-up phase was set to 30 epochs. After the warm-up phase, we updated each iteration’s architecture and search network parameters.
We extracted patches with a spatial resolution measuring 32 × 32 pixels to optimize the network’s training, employing random cropping, flipping, and rotation as data augmentation methods. During the training phase, the batch size for all three datasets was set to 8. At this juncture, we selected the SGD optimizer, equipped with an adjustable learning rate strategy, with an initial learning rate established at 0.1 and decayed according to a multi-learning rate strategy with a power of 0.9. The network’s performance was validated every 100 iterations.
For this task, the weights a, b, and c of the HSIAM were set at 0.4, 0.3, and 0.3, respectively, while the weights α, β, and γ of the TFLF loss function were assigned values of 0.2, 0.5, and 0.3, respectively.
In the inference phase, we employed a sliding window technique to extract small segments, utilizing a stride equivalent to half the window size, which we then input into the training model. The gradient computation was disabled, and the well-trained model was utilized to predict the classification of each pixel within the dataset. We obtained the probability by applying the SoftMax function, selecting the category with the highest probability as the designated pixel class.
4.3. Comparative Analysis with Other Techniques
This section compares the proposed TUH-NAS with five other HSI classification methods: SpectralFormer [
5], SSFTT [
6], GAHT [
7], 3D-ANAS [
33], and Hyt-NAS [
34]. Among these, 3D-ANAS and Hyt-NAS are methods that utilize NAS architectures. All comparative methods use official codes. To ensure fairness, all methodologies employed a consistent training set strategy throughout the experiments: randomly selecting 20 and 30 labeled pixels per class for the training set, 10 labeled pixels for the validation set, and the rest for the test set.
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9 summarize the results of the comparative experiments, and
Figure 6,
Figure 7 and
Figure 8 present the corresponding visual outcomes.
The performance of MUUFL is listed in
Table 4 and
Table 5.
Figure 6 illustrates a comparative analysis of the visual effects yielded by various methods. From the comparative results, we can draw the following conclusions:
The proposed TUH-NAS method generally outperforms other comparison methods in classification accuracy when the sample sizes are 20 and 30. For instance, when extracting 20 training samples per class, TUH-NAS achieves an OA of 87.65%, an AA of 88.92%, and a K of 83.90%. This indicates that TUH-NAS has stronger feature extraction and classification capabilities in HSI classification tasks, particularly with smaller sample sizes, where its advantages are more pronounced.
Analyzing the specific classification results in both tables, TUH-NAS demonstrates a more balanced performance across different classes, indicating that it is more robust in feature extraction and classification decision-making for all types of samples. This balance is crucial in practical applications as it minimizes overall performance degradation caused by poor classification of certain categories. In contrast, other classification methods may exhibit fluctuations in classification performance for some categories.
Figure 6 displays the visual results when utilizing 30 training samples for each class. A locally zoomed patch is placed on the right side of each result image to illustrate the differences. This locally zoomed patch mainly contains a building (in pink) and its shadow (in purple). From the locally zoomed patch, it can be observed that SpectralFormer misclassifies the entire building as a shadow, while SSFTT identifies parts of the building as a shadow. In GAHT, some building areas are identified as shadows and road. In both 3D-ANAS and Hyt-NAS, some building areas are identified as roads. In the results from TUH-NAS, both the buildings and shadows are correctly identified, with well-defined edges and minimal influence from other classes. Overall, TUH-NAS performs exceptionally well in image detail processing, clearly delineating the edges of trees, grass, roads, and buildings, significantly reducing blurriness and confusion areas. This high clarity aids in the more accurate identification of ground cover. The classification results of TUH-NAS are generally more consistent, with no large-scale misclassifications or outliers.
The comparison results for Houston2018 can be seen in
Table 6 and
Table 7, as well as
Figure 7. The dataset includes more target categories, such as roads, sidewalks, crosswalks, major thoroughfares, highways, railways, and trains. These six elongated shapes are generally more challenging to identify, comprising 30% of the total categories. In this dataset, the classification accuracy of all methods is generally low, with significant discrepancies in accuracy among the various approaches. As illustrated in
Table 6 and
Table 7, when the training samples per category are limited to 20, our THU-NAS method demonstrates the most superior classification performance across the majority of categories, achieving OA, AA, and Kappa values of 71.51%, 84.52%, and 64.95%, respectively. Conversely, when the training samples per category are increased to 30, both THU-NAS and Hyt-NAS achieve optimal classification for eight categories each, with THU-NAS attaining OA, AA, and Kappa values of 78.47%, 87.79%, and 73.08%, respectively. This indicates that our method exhibits a distinct advantage under conditions of limited training sample conditions. This indicates that our method has an advantage with fewer training samples.
We cropped a small patch from the upper right corner of each result image for magnification, which includes non-residential buildings (dark purple), major thoroughfares (blue), highways (dark red), railways (purple-red), and trains (yellow). Among the methods, SpectralFormer and SSFTT had the poorest identification results, needing more complete recognition of large buildings. GAHT and 3D-ANAS were able to identify buildings to a basic extent but contained some noise. GAHT could hardly distinguish between trains and railways. The 3D-ANAS managed to identify trains and railways but largely failed to recognize major thoroughfares. Hyt-NAS performed well overall but did not identify major thoroughfares. Our proposed TUH-NAS had the best overall classification performance, capable of distinguishing between trains and railways, fully recognizing major thoroughfares and buildings, and exhibiting minimal noise. From the results, it can be seen that, overall, TUH-NAS was better at accurately identifying gaps between aligned roads and accurately depicting road edges. The identification of paved parking lots and cars parked in those lots was also quite accurate. Hyt-NAS also performed well in this aspect, while other classification methods showed poorer results, especially in identifying parking lots in the upper half of the map. TUH-NAS demonstrated the best recognition performance among all methods, with the highest level of discernibility regarding the train on the right side of the image.
The comparative results for XiongAn can be seen in
Table 8 and
Table 9, as well as
Figure 8. The XiongAn dataset includes 20 target categories, primarily consisting of various trees and cultivated crops. The boundaries of major land cover types are quite distinct, and areas of overlapping mixtures occupy a relatively small portion of the entire map. The localized zoomed-in images we extracted mainly include willow (blue), Chinese ash trees (olive green), and lawns (red). The results from the three methods, SpectralFormer, SSFTT, and GAHT, exhibit considerable noise and fail to delineate the boundaries of these three categories clearly. The 3D-ANAS method can identify the boundaries of Chinese ash trees and lawns reasonably well, though there are some noise issues; however, it struggles significantly with the classification of willows. Hyt-NAS can fairly identify the boundaries of all three land cover types, but there are considerable errors in classifying the lawns. Our proposed method, TUH-NAS, accurately identifies all three land cover types with precise boundary delineation and minimal noise. Lawn’s classification performance is the best of all methods. Overall, TUH-NAS and Hyt-NAS show relatively strong classification results, while other classification methods exhibit considerable discrepancies in accuracy. In the left half of the map, Chinese scholar trees (purple), lawns (red), and willows (blue) show significant noise in methods other than TUH-NAS and Hyt-NAS, failing to express the boundaries of each area fully. Hyt-NAS performs well, but there are evident misclassification regions within the Chinese scholar trees and lawn areas. In contrast, TUH-NAS provides a more accurate classification result, fully reflecting the boundary information of each classified land cover.
Table 7 and
Table 8 show that with 20 training samples per category and 30 classification samples per category, our proposed TUH-NAS obtained superior classification results across the majority of categories. With 20 training samples per category, TUH-NAS’s OA, AA, and Kappa values are 84.36%, 91.11%, and 82.23%, respectively. With 30 training samples per category, the advantages of TUH-NAS become even more pronounced, achieving OA, AA, and K values of 88.95%, 93.14%, and 87.39%, respectively.
4.4. Ablation Study
To gain a more profound understanding of the specific impact of each component of the TUH-NAS model on the performance of HSI classification, we embarked on a series of ablation experiments utilizing the Houston 2018 dataset. In these experiments, we selected 30 training samples for each category and meticulously analyzed the effects of individually removing the HSIAM and the TFLF loss function from the TUH-NAS network to assess their contributions to the overall classification efficacy and validate their effectiveness. We established four distinct group configurations. Groups G1, G2, and G3 primarily investigate the roles played by the three constituents of the loss function, while Group G4 focuses on the significance of the attention module, HSIAM.
G1: Removed the HSIAM and the TFLF loss function. The loss function used was CE loss.
G2: Removed the HSIAM and the TFLF loss function. The loss functions used were CE loss + dice loss.
G3: Removed the HSIAM. The loss function used was the complete TFLF loss function.
G4: Included the HSIAM. The loss function used was CE loss instead.
Additionally, we introduced the results from the complete form of TUH-NAS as a reference. The contributions of different modules to the classification task are shown in
Table 10.
By comparing the OA, AA, and Kappa coefficients under different configurations, we reached the following conclusions: the simultaneous removal of the attention mechanism module HSIAM and the TFLF loss function and replacement with cross-entropy loss resulted in a significant decline in classification performance. Specifically, the OA, AA, and K values decreased by approximately 4.6%, 1.3%, and 5.1%, respectively. The three-unit mixed search architecture alone achieved a 1% performance improvement over Hyt-NAS. After adding the HSIAM to the mixed search architecture, there was a noticeable enhancement in classification performance, with OA, AA, and K values increasing by about 3.4%, 0.2%, and 3.7%, respectively. This demonstrates that the HSIAM enhances the feature representation capability of input feature maps across spectral, spatial, and channel dimensions, significantly improving the model’s classification accuracy and compared to the sole utilization of the cross-entropy loss function, introducing TFLF as the loss function further augments classification performance. This validates the effectiveness of TFLF in handling HSI classification tasks, as it can better address issues such as boundary recognition in land cover.
4.5. Architecture Analysis
The final architectures searched by TUH-NAS are shown in
Figure 9. Due to the different spectral and spatial resolutions and the distribution of ground objects across the three datasets, we searched for architectures separately on each dataset. Although these architectures have some unique features in different datasets, they share commonalities in terms of operation types and quantities, which reveal TUH-NAS’s preferences and advantages in dealing with HSI classification tasks. The three architectural diagrams show that the newly introduced dilated convolutions are extensively utilized. In the three datasets, the dilated convolution operation accounts for approximately 25% of the total operations. Additionally, the architecture diagrams for all three datasets reflect that operations related to spectral processing units dominate the final architectures. In the MUUFL and Houston2018 datasets, spectral processing operations account for over 60%, while in the XiongAn dataset, this proportion is close to 50%. These findings indicate that TUH-NAS tends to utilize operations that effectively handle spectral information in HSI classification tasks, while also integrating modern convolution operations and attention mechanisms. This combination demonstrates good performance and adaptability across different datasets.
4.6. Comparison of Sensitivity with Different Numbers of Training Sample
To verify the sensitivity to different training sample sizes, we designed the following experiment using the Houston2018 dataset as the experimental dataset, which contains 20 different categories. We chose the number of samples for each category to be 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 and tested the model training results in terms of OA, AA, and Kappa for each sample size. We limited the maximum number of training iterations to 100,000 to save time. Other settings remained consistent with the previous experimental scheme. The final experimental results are shown in
Table 11. We represented the data in
Table 11 using a line chart, as shown in
Figure 10.
From
Table 11 and
Figure 10, we can analyze the sensitivity of the experimental results to different sample sizes:
The OA increases consistently as the number of training samples per class increases from 10 to 100. This suggests that a larger training sample size improves the model’s overall performance on the classification task, likely due to more data providing better representation and variability.
There is a noticeable jump from 57.93% OA at 10 samples to 73.00% at 20 samples, indicating that even a modest increase in sample size can lead to significant gains in performance.
- 2.
Average Accuracy (AA):
The AA also shows a steady increase with the number of samples. The AA starts at 79.49% with 10 samples and approaches 92.99% with 100 samples. This indicates a strong sensitivity of average class accuracy to the increase in training samples, which might reflect better differentiation between classes as more examples are available to learn from.
- 3.
Kappa Index (K × 100):
The Kappa index also increases as the sample size grows, from 50.84 at 10 samples to 82.32 at 100 samples. Kappa provides a measure of agreement between predicted classifications and actual classes while accounting for chance agreements. The substantial increase in Kappa, along with the increase in sample size, indicates improved model reliability and consistency in classifications.
Overall, the consistent improvements in OA, AA, and Kappa with increased sample sizes reflect the sensitivity of the classification model’s performance to the quantity of training data available. It suggests that larger sample sizes lead to better performance metrics in HSI classification tasks, likely due to enhanced model training and reduced overfitting. This analysis underscores the importance of adequate sample sizes in machine learning tasks to achieve robust model performance.
4.7. About Training Time and Number of Model Parameters
We have recorded the average execution time of six comparison methods on the same device across three datasets, as shown in
Table 12. We also obtained the number of trainable parameters in the actual models for six different methods when processing the three datasets, as shown in
Table 13. For NAS networks, the execution time is the sum of the time spent searching the network architecture and the time spent on model optimization training.
From the average execution times shown in
Table 12, there are significant differences in the running times of different deep learning models on the hyperspectral classification task. The three Transformer-based methods, SpectralFormer, SSFTT, and GAHT, demonstrate significantly lower average execution times across the three datasets compared to the latter three NAS-based methods. However, running time is not the only metric for assessing model performance. Despite the high efficiency of SSFTT and GAHT, they do not achieve the same classification accuracy as the NAS-based models. Among the three NAS-based methods, although TUH-NAS has the highest model complexity, thanks to our optimization of the degree code and the use of mixed precision during architecture search, its average execution time is not the highest.
From
Table 13, we can observe clear differences in the model parameters of the three Transformer-based methods across different datasets, while the model parameters of the three NAS-based methods remain relatively stable. This is due to the varying architectural designs and module complexities. The parameter count of Transformer-based models is typically closely related to the number of layers, the number of neurons per layer, and the implementation of the self-attention mechanism. Different models may adopt different layer encoding mechanisms, numbers of attention heads, and other hyperparameters, leading to significant variations in parameter counts for the same task. In contrast, NAS-based methods identify optimal model structures through an automated NAS process, which tends to be more standardized and uniform. As a result, the generated models exhibit a more consistent number of parameters, and the searched and selected models usually strike a good balance between complexity and performance. This characteristic of architectural automation allows NAS models to maintain relatively consistent parameter scales when facing different datasets.
Overall, TUH-NAS presents itself as a viable model choice for HSI classification tasks, with a moderate and relatively stable number of parameters alongside good execution time performance. It is suitable for practical applications where resource consumption and classification performance need to be considered comprehensively.