1. Introduction
Hyperspectral images (HSIs) are generated by imaging spectrometers mounted on various space platforms, capturing spatial–spectral information [
1]. The advancement of hyperspectral imaging technology has enabled sensors to capture hundreds of continuous spectral bands with nanometer-level resolution [
2,
3]. As a result, HSIs have found numerous applications in diverse fields, including environmental monitoring [
4,
5,
6], hyperspectral anomaly detection [
7], and hyperspectral image classification [
8]. HSI classification, in particular, serves as a fundamental technique in many hyperspectral remote sensing applications, and it has proven to be invaluable in precision agriculture [
9], geological exploration [
10], and other domains [
11,
12].
The task of HSI classification is mainly tackled by two schemes: one with a hand-crafted feature extraction method and another with learning-based feature extraction method. In the early phase of HSI classification, the strategy of extracting more spectral or spatial features is conventional machine learning. For instance, Yang and Qian [
13] introduced a novel approach for hyperspectral image classification called multiscale joint collaborative representation with a locally adaptive dictionary. It constrains the adverse impact to HSI classification from useless pixels. Camps-Valls et al. [
14] proposed and validated the effectiveness of composite kernels, a novel technique that combines multiple kernel functions to improve hyperspectral image classification performance. Other hand-crafted approaches, one after another, were proposed. One prominent method is the joint sparse model and discontinuity preserving relaxation [
15]. The approach preprocesses each pixel and calculates relevant statistical measures, aiming at elegantly integrating spatial context and spectral signatures. Similarly, sparse self-representation [
16] addresses band selection using optimization-based sparse self-representation, optimizing for efficient feature selection. To improve classification accuracy, researchers have proposed fusing correlation coefficient and sparse representation [
17], aiming to harness the strengths of both methods. Additionally, multiscale superpixels and guided filter [
18] have been explored for sparse-representation-based hyperspectral image classification, promising effective feature extraction and classification. The Boltzmann-entropy-based unsupervised band selection [
19] has been investigated, targeting informative band selection to enhance classification performance.
However, the classification processes, based on the abovementioned approaches, are relatively cumbersome due to that their accomplishments rely on manually extracted features. Furthermore, faced with the inherent high-dimensional complexity of HSIs, the researchers find it hard to obtain ideal classification just using the above approaches, especially in challenging scenes [
20]. Regrettably, the lack of labeled samples in the HSI field is in sharp contrast to the richness of spectral data. This fact poses the challenge of learning better feature representation and is prone to overfitting of methods. In view of the above problems, some schemes have been proposed to alleviate them, mainly including feature extraction [
21,
22,
23], dimension reduction [
24,
25], and data augmentation [
26].
Deep learning has emerged as a powerful method for feature extraction, enabling the identification of features from hyperspectral images (HSIs). Among various deep learning models, the convolutional neural network (CNN) stands out as one of the most widely applied models to address HSI classification challenges. CNNs have shown remarkable performance gains over conventional hand-designed features. CNN models are capable of processing spatial HSI patches as data inputs, leading to the development of progressive CNN-based methods that leverage both spectral and spatial features. For instance, Mei et al. [
27] proposed a CNN model that adopts a pixel-wise approach and involves preprocessing each pixel by calculating the mean and standard deviation of the pixel neighborhood for each spectral band. On the other hand, Paoletti et al. [
28] and Li et al. [
29] introduced two distinct CNN models—one for extracting spatial features and the other for extracting spectral features. These models utilize a softmax classifier to achieve desirable classification results. While CNN-based technologies can effectively extract spatial and spectral features for HSI classification and other applications, they still encounter challenges in effectively utilizing information related to spatial and spectral associations.
In contrast, some technologies, which learn spatial and spectral features simultaneously for HSI classification, have been proposed. Yang et al. [
30] proposed a multiscale wavelet 3D-CNN (MW-3D-CNN). Great importance was attached to the relationship information amid adjacent channels, while the corresponding model’s calculating complexity was augmented intensively. Accordingly, Roy et al. [
31] proposed the 3D–2D CNN feature hierarchy model for HSI classification. On one hand, a few 3D-CNN layers were utilized to extract spectral information amid spectral bands. On the other hand, 2D-CNN layers concentrated on much spatial texture and context. Liu et al. [
32] extended the CNN model by incorporating the attention mechanism to enhance feature extraction from HSI. More recently, Zhong et al. [
33] introduced the spectral–spatial residual network (SSRN). In the SSRN’s designed residual blocks, identity mapping is utilized to connect 3D convolutional layers. These innovative models and technologies address the limitations of previous approaches and effectively leverage spectral and spatial features to significantly improve the performance of HSI classification.
In recent times, large language models such as Transformer [
34] have become a new paradigm in the field of natural language processing (NLP). Transformer introduces the attention mechanism, which allows for interactions between different tokens, capturing long-range semantic dependencies. Inspired by this success, researchers have extended these methods to the field of computer vision (CV). Similar to Transformer in NLP, Vision Transformer (ViT) [
35] is typically pretrained on unlabeled image streams or video streams and then fine-tuned on downstream tasks to train model parameters. In the domain of remote sensing imagery, more researchers are adopting ViT or its variants [
36] as foundational models for their studies. Liu et al. [
37] utilized a customized Swin Transformer to reduce computational complexity and obtained strong generalization. Ayas and Tunc-Gormus [
38] introduced a novel spectral Swin Transformer (SpectralSWIN) network, but without using attention mechanism, to hierarchically fuse spatial–spectral features and achieve a significant superiority. Zhao et al. [
39] proposed a spectral–spatial axial aggregation Transformer framework to perform multiscale feature extraction and fusion on the input data while utilizing spectral shift operations to ensure information aggregation and feature extraction across different spectral components.
With regard to the issue of HSI classification, currently, many loss functions have been designed and obtained stunning performance. Multiclass hinge (MCH) loss in support vector machine (SVM), as one of the traditional classification strategies, is the main solution to HSI classification in the early phase. Wang et al. [
40] introduced a novel classification framework, incorporating spatial, spectral, and hierarchical structure features, for hyperspectral images. The approach involves integrating three different and important types of information into the SVM classifier. By leveraging this joint integration, the proposed framework intends to enhance the classification performance of hyperspectral images. Furthermore, cross-entropy loss, as a standard metric, is widely applied to complete different classification tasks in hyperspectral image recognition, such as [
41,
42]. Regrettably, common and publicly available datasets Indian Pines (IP), Salinas (SA) and University of Pavia (UP), as well as the overwhelming majority of other hyperspectral imagery datasets, are class-imbalanced [
43,
44] (See
Section 2.2). This fact poses great challenges for deep-learning-based HSI classification models regarding how to handle the class imbalance problem [
45].
In this paper, spatial–spectral feature obtained by the deep learning approach is proposed for the HSI classification task. First, to extract more vital HSI features, including spatial and spectral information, we propose a novel convolutional neural network (FE-HybridSN), which pays attention to the correlation of adjacent bands from HSI. Compared with the plain 3D-CNN model for HSI classification, the FE-HybridSN alleviates the burden from complex computation. The proposed framework results in better HSI classification performance when the 3D-CNN model and the FE-HybridSN have similar scale in model structure. Second, to cope with the class-imbalanced problem in HSI classification, we apply the focal loss as loss function. In fact, we obtain expected classification results based on the focal loss after much rigorous research. In summary, the main contributions of this paper are as follows:
We fashion a novel five-layer FE-HybridSN of hyperspectral images for mining spatial context and spectral features;
We apply the focal loss as the loss function to alleviate the class-imbalanced problem in the HSI classification task;
We explore feature learning and classification of hyperspectral images using systematic experiments, and inspire new deep learning ideas for hyperspectral applications.
The paper is structured as follows.
Section 2 outlines the challenges and presents our proposed approach to address them. In
Section 3, we provide a detailed description of the experimental background and present the fundamental experimental configurations. Additionally, we compare our method with other state-of-the-art approaches. Building upon the results presented in
Section 3,
Section 4 offers a comprehensive discussion to provide insights and interpretations. Finally, in
Section 5, we summarize the findings of the paper and propose promising research directions for future exploration.
2. Methodology
We present a novel joint network called FE-HybridSN, which combines a hybrid 3D–2D CNN architecture with the focal loss to achieve hyperspectral image classification. The proposed approach is designed to capitalize on spectral–spatial feature maps and extract more abstract representations from hierarchical space, as depicted in
Figure 1. By leveraging the hybrid 3D–2D CNN model, we aim to effectively mine valuable information from the hyperspectral data. Additionally, we utilize the focal loss to mitigate any adverse effects on hyperspectral image classification, as illustrated in
Figure 1. The focal loss helps in handling challenging samples, thereby enhancing the overall performance of the classification process.
2.1. Proposed Model
Assume the spatial–spectral hyperspectral data cube is denoted by , where H, W, and B represent the height, width, and the number of spectral bands in the hyperspectral image (HSI), respectively. Each pixel in is associated with a one-hot label vector , where C signifies the number of land cover classes. However, a real-world challenge arises from the fact that high-dimensional hyperspectral pixels may exhibit mixtures of multiple land cover classes, leading to substantial spectral intravariability and significant interclass similarity.
To address this issue, we employ principal component analysis (PCA) to preprocess the original HSI data and eliminate redundant spectral information. PCA reduces the original B spectral bands to K bands while preserving the spatial dimensions, thus retaining the spatial information even after the PCA process. Consequently, we represent the processed spatial–spectral hyperspectral data cube as . Here, H and W still represent the height and width of the spectral bands, and K indicates the number of retained spectral bands after PCA processing.
During the data preprocessing stage, besides reducing the dimensionality of the hyperspectral data using PCA, it is also necessary to segment the image into small, overlapping 3D blocks. The primary purpose of this is to apply our deep learning method on each smaller 3D block. Each adjacent 3D block originates from the PCA-reduced hyperspectral data cube, and they are uniquely identified by the central spatial coordinates. We denote each 3D block as , where s represents the size of each spatial window and K denotes the depth of spectral bands in each block. Notably, K also corresponds to the number of bands retained after PCA dimensionality reduction. Ultimately, the PCA-reduced hyperspectral data cube generates 3D blocks. Specifically, for any given hyperspectral 3D block with central spatial coordinates , the corresponding spatial window will cover the width range and the height range .
We propose a framework named FEHN-FL with hierarchical convolutional structure for HSI classification. CNN parameters are trained using supervised methods [
46] with gradient descent optimization. Conventional 2D CNNs compute 2D discriminative feature maps by applying convolutions solely over spatial dimensions, but they lack the ability to identify and handle spectral information. In contrast, a single 3D-CNN can slide the convolutional kernel along all three dimensions (height, width, and spectral) and interact with both spatial and spectral dimensions. This capability allows the 3D-CNN to comprehensively capture spatial and spectral information from the high-dimensional hyperspectral data. FE-HybridSN, the backbone of the FEHN-FL, hierarchically combines three 3D convolutions (refer to Equation (
2)), two 2D convolutions (refer to Equation (
1)), and three fully connected layers, achieving a balanced integration of spectral and spatial information for more effective HSI classification.
In 2D-CNN, the output results are generated through the process of convolution, where the input image is convolved with 2D filters such that the size is predesigned, also known as convolutional kernels. This operation involves element-wise multiplication between the filter’s weights and the pixels in the input image, followed by summation to obtain the new output pixel value. The convolutional process slides the filter across the entire input image, computing the output pixel value at each position. By utilizing distinct filters at different layers, 2D-CNN can effectively learn various features present in the image. These learned features are then combined to form higher-level representations, enabling the model to perform image classification and feature extraction tasks. The resulting features from the convolution are passed through an activation function, introducing nonlinearity to the model. In 2D convolution, the activation value of the
jth feature map at spatial position
in the
ith layer is represented as
and can be expressed by the following equation:
where
is the nonlinear activation function,
is the bias parameter for the
jth feature map of the
ith layer, and
indicates the number of feature maps in the
th layer. The size of the predesigned convolutional kernel is
.
corresponds to the weight parameter for the
jth feature map of the
ith layer.
According to the definition of three dimensional convolution [
47], we perform convolutional operations by applying 3D convolutional kernels to the hyperspectral images. In the FEHN-FL, the feature maps in the convolutional layer are generated by applying 3D convolutional kernels on discrete or consecutive spectral bands of the input layer, thus capturing spectral information. In 3D convolution, the activation value
of the
jth feature map at spatial position
in the
ith layer is expressed as follows:
where
is the depth of kernel along the spectral dimension and other parameters are the same as in Equation (
1).
In the FE-HybridSN, we design different 3D convolution kernels. We denote the structure of the 3D kernel as
, where u and L, respectively, represent the layer index of current kernels and the number of output channels for the current convolution layer. Moreover, we denote the structure of the 2D kernel as
, where d represents the layer index of current kernels, and L is same as the 3D kernel. One of our designed kernel sequence is (
,
,
,
,
). Here,
means that the first 3D kernel size is
and the number of output channel is eight when a single channel from
is the first input. The explanation of subsequent kernel parameters is the same as
. The proposed model is comprehensively summarized in
Table 1, presenting the layer index, output map dimensions, and the corresponding number of parameters.
In the proposed method, the batch normalization (BN) layers are introduced as the important elements since they make the distribution of input data in every layer of the network relatively stable and accelerate the training speed of the entire framework. The formula of BN is as follows:
where
is the average of the summation,
is the variance,
and
are learnable parameter vectors, and
is a parameter for numerical stability. In addition, the nonlinear layer aims at adding some nonlinear features to the network. Then, it is notable that the rectified linear unit (ReLU) [
48] is selected into each 3D/2D convolutional layer.
2.2. Focal Loss
The class-imbalanced problem commonly occurs in tasks with a long-tailed data distribution, where a few classes dominate the data, while most classes have very few samples. In traditional classification and visual recognition tasks, the training distribution can be balanced through manual intervention using resampling strategies. This strategy ensures that the number of samples from different classes does not significantly differ. However, as the number of categories increases, maintaining a balance between all categories becomes increasingly challenging, resulting in an exponential growth in the collection of samples.
In the case of HSI classification, neither using resampling strategies nor not using them are feasible or rational solutions. Hence, we employ the focal loss as the loss function. Compared to traditional loss functions like multiclass hinge loss and cross-entropy loss, the focal loss provides better performance and tackles the class imbalance issue effectively.
2.2.1. Balanced Cross Entropy
Introducing weighted factors
and
for positive and negative classes, respectively, is a common approach to address class imbalance issues. In practice,
is often initialized as the inverse class frequency, which is the reciprocal of the ratio of positive class samples to negative class samples. The purpose is to assign larger weights to the classes with fewer samples during the training phase, aiming to balance the class distribution. We denote the
-balanced cross entropy (CE) loss as
where
is expressed as follows:
Here, the definition of
is analogous to how we defined
, and
indicates negative and positive class, respectively.
2.2.2. Focal Loss Definition
Easily classified negatives comprise the majority of the loss and dominate the gradient [
49]. While
balances the importance of positive/negative examples, it does not differentiate between easy and hard examples. The focal loss reconstructs the balanced cross-entropy loss to down-weight easy examples and thus focuses training on hard negatives. The focal loss is defined as follows:
Formally, we use an
-balanced variant of the focal loss as the loss function:
4. Discussion
After analyzing the results of Experiment 1 (refer to
Section 3.3.1), several key observations can be made. Firstly, the FE-HybridSN, combined with cross-entropy loss, demonstrates high classification accuracy. When training with 15% and 5% labeled samples using different spatial sizes on the corresponding IP and UP datasets, our method outperforms other state-of-the-art techniques in terms of classification accuracy. Specifically, utilizing the CE loss as the terminal loss function and following the basic experimental configurations, the FE-HybridSN consistently achieves superior pixel-level classification results compared to 2D-CNN [
53], 3D-CNN [
54], 3D2D-HybridSN [
31], and M3D-CNN [
55].
In addition,
Table 4 and
Table 5 highlight the significant impact of spatial size on HSI classification accuracy. Notably, the classification accuracies obtained using
and
patch sizes are consistently lower than those achieved with a
patch size across all methods in
Table 4. Similarly,
Table 5 illustrates that increasing the patch size leads to improved classification accuracy for the FE-HybridSN framework utilizing cross-entropy loss on the UP dataset. This finding emphasizes the crucial role of spatial size in HSI classification and its ability to adjust the decision boundary. Specifically, a smaller spatial size leads to a greater loss of information, resulting in lower classification accuracy, as evidenced by the experimental results. Furthermore,
Figure 2 provides a visual representation of the relationship between spatial size and classification accuracy, further supporting these conclusions.
Based on the results in
Table 6 of Experiment 2 (refer to
Section 3.3.2), it is evident that our proposed feature extractor, FE-HybridSN, achieves significantly better classification accuracy on both the SV and UP datasets. On one hand, when using training samples with different proportions of labeled data, it is evident that as the number of training samples decreases, the classification accuracy of hyperspectral images also decreases. With the small scale of the training set, the risk of overfitting increases. In other words, when training is completed, the resulting classification decision boundaries may lack robust generalization ability, particularly when facing new samples such as test samples. On the other hand, with the same training ratio, the OA of the SV dataset is noticeably higher compared to the classification accuracy of the UP dataset. This is primarily due to the fact that the SV dataset has a significantly larger training scale than the UP dataset.
In Experiment 3 (refer to
Section 3.3.3), we considered the use of the focal loss as the terminal loss function. As depicted in
Table 7, it contains a wealth of classification accuracy results for both the SV and UP datasets, making it highly valuable for analyzing the performance of various hyperspectral classification methods. Firstly, when considering the overall classification results, the SV dataset consistently outperforms the UP dataset for the same classification methods. This disparity can primarily be attributed to the larger scale of labeled training samples available in the SV dataset, which facilitates the construction of a more effective feature space. Similarly, when comparing different training ratios within the same dataset, the same trend persists. Secondly, it is evident that the 2D-CNN method, which solely focuses on spatial features and disregards spectral information, exhibits lower classification accuracy compared to other methods. This observation underscores the essentiality of leveraging spectral–spatial features for robust hyperspectral classification. Lastly, when comparing our proposed FEHN-FL framework with other methods, we consistently achieve superior classification accuracy on both the UP and SV datasets, irrespective of the training ratio employed. This demonstrates the efficacy of the focal loss in addressing class imbalance issues within hyperspectral images and refining decision boundaries during classification. Furthermore, it is noteworthy that even without considering the focal loss as the final loss function, we are still able to attain commendable classification accuracy. For instance, when utilizing the FE-HybridSN+CE framework on the SV dataset with 15% training samples, we observe exceptional accuracy levels of OA (99.97%), AA (99.95%), and kappa (99.98%). To summarize Experiment 3, we provide a set of classification performance charts in
Figure 3,
Figure 4 and
Figure 5 showcasing the results obtained on three distinct experimental datasets. Despite the findings from
Figure 5, where our proposed method did not show an improvement in classification accuracy on the SV dataset, this result was expected. The SV dataset inherently has friendly sample feature discrimination and an ample number of labeled samples.
In the final experiment, we provided the learnable parameter sizes of the methods involved, which are closely related to time complexity. As shown in
Table 8 and
Table 9, although our method’s learned parameter size increased to some extent compared to other methods, it can be observed that its training phase convergence speed is similar to, or even faster than, other methods.
5. Conclusions
The paper proposes a novel hierarchical deep neural network based on CNN for efficient extraction of spectral–spatial features from high-dimensional hyperspectral images. In the proposed framework, the hierarchical convolutional structure effectively reduces computational complexity while effectively capturing both channel and spatial texture information in the hyperspectral data. Specifically, the feature extractor FE-HybridSN consists of a three-layer 3D-CNN component dedicated to extracting spectral information across different channels, complemented by a two-layer 2D-CNN component focusing on spatial information within each channel. Compared with the state-of-the-art methods, FE-HybridSN demonstrates competitive classification performance on widely used datasets such as IP, UP, and SV. Furthermore, we introduce the focal loss as the loss function, which effectively mitigates the problem of biased classification decision boundaries caused by long-tailed distributions.
In conclusion, our research contributes to the field of hyperspectral image analysis by proposing a comprehensive framework that combines feature extraction, dimensionality reduction, and classification techniques. The experimental results demonstrate the potential of these techniques in enhancing the accuracy and efficiency of hyperspectral image classification. Although our method provides efficient performance in HSI classification, there are still several unresolved challenges that may pose future limitations. Our ongoing research will focus on the following directions to address these challenges and further improve our approach:
Enhancing the model design to enable adaptive adjustment of decision boundaries based on different hyperspectral datasets. This will allow our method to better accommodate the unique characteristics and variations present in different datasets.
Exploring and integrating advanced data augmentation techniques to tackle the issue of limited sample sizes. By generating synthetic data and applying transformational operations, we can effectively expand the training dataset and improve the model’s generalization capability.
Investigating alternative strategies to mitigate or alleviate the impact of spatial size during the convolutional process. This includes exploring methods such as multiscale feature extraction and attention mechanisms to capture both local and global spatial information more effectively.
By addressing these areas, we aim to overcome the current limitations and further enhance the robustness, adaptability, and overall performance of our approach in hyperspectral image classification.