1. Introduction
Hyperspectral imagery (HSI) has revolutionized the field of remote sensing by combining the benefits of subdivisional spectroscopy and imaging technology. It enables the capture of spatial distribution information along with hundreds or even thousands of contiguous narrow spectral bands, offering a unique perspective on surface targets [
1,
2]. Owing to its inherent advantages in distinguishing unique land-cover categories and different objects, hyperspectral imaging (HSI) has gained widespread usage across a variety of domains. These applications encompass a wide spectrum of disciplines, including but not limited to military defense, atmospheric science, urban planning, vegetation ecology, and environmental surveillance [
3,
4,
5,
6]. Despite its remarkable capabilities, HSI classification poses significant challenges, including interference from redundant spectral information, the limited availability of labeled samples, and the presence of high intra-class variability. Addressing these obstacles and making advancements in hyperspectral image classification techniques offer significant potential for realizing the complete capabilities of HSI data analysis [
7].
In the field of HSI classification, traditional methods have predominantly relied on spectral information, often neglecting the effective integration of spatial data. While techniques like band selection, sparse representation classifiers, and principal component analysis (PCA) [
8] have been utilized to extract discriminative features, they encountered notable limitations in feature extraction and robustness. Specifically, PCA, despite its widespread use, suffers from several drawbacks, including its assumption of linearity and orthogonality in data distribution, which may not hold true in complex real-world scenarios, potentially overlooking subtle but relevant information crucial for accurate classification [
9]. However, traditional machine learning algorithms like random forest [
10] and SVM [
11], while widely used, have limitations in handling non-linear data and complex decision boundaries. Random forest may struggle with capturing intricate spectral–spatial relationships, while SVM’s performance may deteriorate with high-dimensional feature spaces. Nonetheless, the advent of deep learning has spurred significant interest in HSI classification, particularly with convolutional neural networks (CNNs). Various CNN-based architectures have demonstrated remarkable performance by simultaneously leveraging both spectral and spatial information [
12]. This integration of advanced deep learning techniques presents a promising avenue for enhancing the accuracy of classification in hyperspectral imagery analysis.
As previously discussed, the formidable challenge posed by the high dimensionality of HSI data underscores the necessity of dimensionality reduction as a crucial preprocessing step to mitigate data redundancy and complexity. In the realm of HSI classification, two primary approaches for dimensionality reduction exist: feature extraction and band selection. Feature extraction techniques such as kernel PCA (KPCA) [
13], PCA, linear discriminant analysis (LDA) [
14], and Fisher’s linear discriminant analysis (FLDA) [
15] aim to optimize between-class separation, whether leveraging label information or not. However, it is important to note the limitations of minimum noise fraction (MNF) [
16], another commonly used technique for dimensionality reduction. While MNF is effective in certain contexts, it may not be suitable for HSI classification due to its inherent assumptions and limitations. MNF assumes that noise in the data is uncorrelated across bands, which may not hold true for HSI datasets characterized by complex spectral and spatial correlations. Additionally, MNF may not effectively capture the subtle spectral variations crucial for accurate classification in HSI data [
17]. Therefore, while MNF may be a useful technique in some scenarios, its effectiveness for HSI classification is limited by these factors. Conversely, band selection algorithms, while effective at selecting informative band subsets directly from the original band space of HSIs, often struggle to capture the intricate spectral–spatial correlations inherent in HSI data. Despite the application of popular feature selection methods like chi-squared, select K best, and mutual information feature selection [
18,
19], their efficacy for HSI classification may be limited by the inherent challenges in capturing the nuanced spectral and spatial properties unique to HSI datasets.
In the study covered in [
20], the conventional low-rank representation (LRR) method for HSI classification has been improved and modified. This paper presents a locality-and-structure-regularized LRR approach that combines spectral and spatial data to analyze local pixel similarities. The authors of [
21] also presented a new method for classifying HSI using spectral gradients in another paper. They were able to efficiently collect both geographical information and spectral features by utilizing spectral gradients in conjunction with the random forest approach. The spectral characteristics were then fused together using multi-scale fusion so that support vector machines (SVMs) could be used for classification. Furthermore, the study in [
22] introduced deep support vector machines (DSVMs) for HSI classification, outperforming other cutting-edge algorithms, including different versions of the conventional SVM. However, spatial features play a vital role in enhancing the classification performance. To alleviate the spatial redundancy that frequently arises when using a regular 3D CNN with HSI, a 3D octave CNN was developed in the work reported in [
23].
In a separate investigation [
24], a novel technique incorporating 3D CNNs was introduced, integrating both spectral and spatial data to bolster classification accuracy. Similarly, in [
25], a model utilizing 3D CNNs was proposed for HSI classification, following a similar methodology. Initially, the HSI dataset is segmented into small, overlapping 3D patches and processed with a 3D kernel function across contiguous spectral bands to produce 3D feature maps. Subsequent research has seen a surge in studies employing either 2D CNNs or 3D CNNs for HSI analysis [
26]. However, utilizing 2D CNNs or 3D CNNs for HSI classification poses several challenges. While 2D CNN architectures excel in capturing spatial details, they often struggle to extract informative or distinguishing features from spectral dimensions. Conversely, although 3D CNNs are presumed to offer enhanced performance, they come with increased computational demands due to extensive 3D convolution operations [
27]. Deep 3D CNNs necessitate a larger dataset for training, yet publicly accessible HSI datasets provide limited samples. Additionally, numerous prevalent 3D CNN-based methods rely on stacked 3D convolutions, complicating the direct minimization of estimation loss using such nonlinear structures [
28]. To address these challenges, many researchers have proposed hybrid approaches. For instance, in [
29], a hybrid model termed HybridSN was introduced, merging 2D-CNN with 3D-CNN to effectively extract both spectral and spatial features from HSI data, leading to improved classification accuracy. In another study, the authors developed a wavelet-based 2D CNN named SpectralNET [
30] for extracting spectral–spatial features from HSI. They utilized a four-level wavelet decomposition with 2D convolutions by concatenating the upper levels with previously decomposed features and applying average pooling after convolution operations. However, due to the increased number of hidden layers during wavelet transformation, computational costs are slightly higher, and performance diminishes with fewer training samples.
In recent studies, researchers have recognized the benefits of multi-scale spatial features in enhancing the accuracy of semantic segmentation in regular RGB images [
31]. Models such as PSPnet [
32] and the Inception Module have successfully fused features at different scales to capture detailed information and improve overall performance. Influenced by the dilated residual network (DRN) [
33], their investigation extended this notion through the introduction of the spectral dilated convolutions (SDCs) concept [
34], aiming to enhance spectral coverage. In a different study, a novel Tri-CNN [
35] approach enabled the extraction of multi-scale spectral features, thereby enhancing the classification accuracy and performance of HSI analysis. It introduced a novel three-branch CNN architecture aimed at enhancing classification performance. However, it is worth noting some limitations. Firstly, the reliance on PCA-based dimensionality reduction may restrict the model’s ability to capture all relevant spectral information, potentially leading to the loss of discriminative features. Additionally, the three-branch convolution structure introduces complexity, as all branches are applied to the reduced data simultaneously. This approach may hinder effective feature extraction, as different segments of data could benefit from distinct processing strategies. Hence, exploring alternative methods wherein branches are applied to different data segments individually could potentially improve feature extraction and classification accuracy. By incorporating spectrally segmented dimensionality reduction techniques, the issue of optimization and reduced time costs associated with HSI analysis can be addressed. Partitioning the HSI data into subsets based on spectral characteristics allows for an effective reduction in high dimensionality while preserving informative features [
36]. This approach not only enhances the efficiency of subsequent classification processes but also improves the overall effectiveness of HSI analysis.
The proposed model introduces a comprehensive approach to enhance HSI classification, focusing on three key components aimed at improving accuracy and effectiveness. Firstly, we implement spectrally segmented dimensionality reduction through factor analysis (FA), leveraging the inherent strengths of FA for superior performance in handling the complex spectral characteristics of HSI data. Factor analysis excels in capturing the underlying structure of high-dimensional data, making it particularly well-suited for extracting informative features relevant to land cover classification. By segmenting the data and applying FA, we not only reduce dimensionality but also ensure the inclusion of top informative features from diverse segments, enhancing the model’s ability to capture crucial spectral nuances. Moving on to the segmentation-based multibranch CNNs, we recognize CNNs’ inherent capability as feature extractors. However, in scenarios with limited training samples, CNNs may struggle to extract features comprehensively from various perspectives. To address this limitation, we deploy a multibranch CNN architecture with distinct branches structured to extract spatial and spectral features from different data weights. This innovative approach enables the extraction of detailed features even with fewer training samples, enhancing the model’s robustness and adaptability to diverse HSI datasets. Lastly, the integration of support vector machines (SVMs) into the final classification stage leverages SVMs’ proficiency in creating non-linear decision boundaries, complementing CNNs’ feature extraction capabilities. SVMs are renowned for their robustness and effectiveness in handling complex data, making them an ideal choice for classification tasks. By combining CNNs with SVMs, we harness the strengths of both models, resulting in improved classification accuracy. Additionally, the incorporation of dropout as a regularization strategy mitigates overfitting and enhances the model’s generalization ability, ensuring reliable performance on novel instances.
Our proposed model, the spectrally segmented-enhanced neural network (SENN), represents a significant advancement in precision land cover classification. By synergistically combining spectrally segmented dimensionality reduction, segmentation-based multibranch CNNs, and CNN-SVM fusion, our model effectively addresses the inherent challenges of HSI data analysis. SENN offers a promising solution for diverse landscapes and scenarios, particularly in the context of agricultural land cover object classification, by retaining crucial features and integrating spectral–spatial data. Notably, the SENN’s primary innovation lies in its effective mitigation of dimensionality through spectrally segmented dimensionality reduction while maintaining crucial features, essential for handling the complexity of agricultural land cover object classification. Moreover, the fusion of CNNs and support vector machines (SVMs) enhances class differentiation, contributing to improved accuracy in classification tasks. Techniques such as dropout and early stopping are incorporated to alleviate overfitting issues, further enhancing the overall performance and robustness of our proposed model. The primary achievements of our study can be outlined as follows:
Spectrally segmented dimensionality reduction: CNNs are adept at extracting significant features from unprocessed image data, facilitating the effective representation of intricate information. These techniques effectively reduce the data’s high dimensionality while preserving the most informative features. This step aims to enhance the subsequent classification process in terms of efficiency and effectiveness.
Parallel multi-resolution CNN model: The second objective is to implement a multi-resolution CNN model for classification, utilizing the extracted informative features. Utilizing the CNN architecture, this model will harness both spectral and spatial data inherent in hyperspectral imagery to acquire intricate characteristics, leading to precise classification outcomes for agricultural landscapes. By capitalizing on the hierarchical representations learned by the CNN model, the overall classification accuracy is expected to improve.
Final classification using SVM: CNNs excel at automatically extracting meaningful features from raw image data, enabling effective representation of complex information. In combination with SVMs, which create non-linear decision boundaries, class separation is enhanced, resulting in improved classification accuracy. The integration of CNNs and SVMs provides a flexible framework for image analysis tasks. Throughout the training process, a range of strategies is employed, encompassing techniques such as “dropout”. Regularization using dropout randomly deactivates neurons during training, helping the network learn more robust features and avoid overreliance on individual neurons.
2. Methodology
The integration of spectral and spatial data is achieved through a multi-resolution CNN, drawing inspiration from the simplified iteration of the conventional 3D CNN architecture [
37]. Recent strides in deep learning emphasize the effectiveness of hybrid models that fuse 3D and 2D CNNs, demonstrating their efficacy in extracting intricate features from hyperspectral datasets. By incorporating both spectral and spatial dimensions, these models enhance network performance, enabling a more comprehensive analysis of high-dimensional hyperspectral images and addressing challenges within image processing.
2.1. Background of SENN
Most contemporary models designed for HSI classification predominantly adopt either a 2D-CNN structure [
25] or a 3D-CNN framework [
27]. However, while the 2D-CNN excels at capturing spatial information, it tends to overlook the valuable spectral intricacies inherent in HSI data [
38]. On the other hand, the 3D-CNN strives to simultaneously extract both spatial and spectral information; however, this approach might not effectively extract features. To overcome these limitations, an innovative strategy is presented: a multi-resolution feature fusion network. This novel network amalgamates a spectral feature extractor and a spatial feature extractor. Through the integration of these components, the network profoundly augments the extraction of both spectral and spatial features, resulting in significant enhancements in the comprehensive feature representation and the precision of HSI classification.
To address the challenges associated with HSI classification, a combination of spectrally segmented dimensionality reduction and a multiple-layer CNN model is proposed. HSI data are represented as a high-dimensional cube denoted as , where P and Q stand for the measures of spatial domain and B signifies the count of the spectral bands. Each pixel in the hyperspectral cube is represented by a vector , where i denotes the pixel index.
The first key aspect of SENN involves spectrally segmented dimensionality reduction techniques. These techniques partition the HSI data into subsets based on spectral characteristics, effectively reducing high dimensionality while preserving informative features. Let denote the spectrally segmented subsets obtained from the original hyperspectral cube X. Each subset RK is represented by a reduced-dimensional matrix , where R is the reduced dimensionality obtained after the segmentation process. The dimensionality reduction is typically achieved using methods such as FA.
The second aspect of SENN involves the implementation of a multiple-layer CNN model for classification. The CNN model takes as input the reduced-dimensional subsets
obtained from the spectrally segmented dimensionality reduction. The CNN structure encompasses convolutional layers, identified as CONV3D and CONV2D, carrying out the extraction of spatial and spectral features. These layers apply filters with learnable parameters to capture spatial and spectral patterns. The output of the CNN model is a classification layer that produces the predicted class labels for each pixel. The mathematical expressions associated with the SENN model include the operations performed in the convolution layers [
26]. For example, a 3D convolution operation can be represented as:
where
Zi represents the output feature maps after the convolution operation,
Wi denotes the learnable weights,
bi is the bias term, and φ represents the activation function, often exemplified by ReLU. Similarly, a 2D convolution operation could be formulated as follows [
26]:
where
Z′i represents the output feature maps after the 2D convolution,
W′i denotes the learnable weights,
b′i denotes the bias term. Additionally, pooling operations, such as max pooling or average pooling, can decrease the size of the feature maps in terms of spatial information. This operation can be mathematically represented as [
30]:
where
P(
Zi) represents the pooled feature maps obtained from
Zi using the max pooling operation. Furthermore, fully connected layers in the CNN model can be represented as [
39]:
where A denotes the output activation,
V represents the learnable weights,
F represents the input features, c is the bias term, and φ represents the activation function.
The integration of mathematical expressions and equations within the SENN model provides a comprehensive understanding of the intricate operations and transformations conducted within each branch, offering insights into its functionality. By synergizing spectrally segmented dimensionality reduction with multibranch CNN architecture, the SENN model aims to elevate HSI classification accuracy by adeptly capturing both spectral and spatial features. Initially, the HSI data undergo segmentation based on spectral correlation, dividing it into distinct segments to enable more targeted analysis. Subsequently, the application of the dimensionality reduction technique FA to each segmented subset effectively reduces data complexity while preserving essential characteristics. This segmented approach ensures tailored processing for each subset, optimizing the extraction of informative features vital for accurate classification. In parallel, the SENN model incorporates multiple branches within the CNN architecture, each endowed with diverse convolution kernels and weights to comprehensively extract features. By diversifying the feature extraction process across multiple branches, the model can effectively capture a broader range of spectral and spatial characteristics inherent in the HSI data. Following the extraction of spectral–spatial features, the data are passed to the fully connected layers. Notably, a strategic decision is made to replace the last dense layer with an SVM classifier, enhancing the model’s capacity to discern subtle patterns and nuances within the HSI data. This integration is particularly effective as SVMs excel in creating non-linear decision boundaries, thereby complementing the feature extraction capabilities of the CNN architecture. The synergy between CNN and SVM fosters improved classification accuracy and performance, rendering the SENN model a valuable asset for HSI analysis in diverse applications.
2.2. SENN Model Description
The input HSI cube, with dimensions P × Q × R is first spectrally segmented into several subgroups based on spectral band correlations. During the spectral segmentation phase, creating 3 segments provides a balanced representation of spectral features, allowing for the effective extraction of the features and the classification of land cover objects in subsequent stages of the model. In the next step, factor analysis (FA) is implemented in each subgroup for the dimensionality reduction to P × Q × R data. The output vector Y, with dimensions P × Q, represents the selection of a class from the existing classes of land cover objects.
The integrity of spectral dimensions is upheld through the application of FA, maintaining the
P × Q × R scale, while the number of bands undergoes a reduction from
R to
RD (where
RD < R). Employing FA in conjunction with spectral segmentation as a preliminary step in HSI preprocessing proves to be immensely advantageous. This approach empowers FA to effectively encapsulate the variances intrinsic to correlated and overlapping spectral bands, thereby enhancing the model’s prowess in discerning analogous instances. Conversely, the conventional employment of PCA-based dimensionality reduction methods such as PCA or the MNF fails to directly address this specific objective in the realm of HSI [
40]. On occasion, these strategies yield an approximation to the essential factors that do not adequately discriminate among comparable instances. Following the conclusion of the FA phase, we proceed to derive overlapping 3D cube patches measuring
W × W × R from the preprocessed HSI. Subsequently, these patches are input into the deep learning model. For patch extraction, the window dimensions represented as
W × W are set at 19 × 19 for the Indian Pines dataset, and 15 × 15 for both the Pavia University and Salinas Scene datasets. The veracious values for these patches are established based on the class category attributed to the central pixel. These specific values have been chosen through grid-search, aiming to optimize the overall accuracy. The SENN model architecture, depicted in
Figure 1, features multiple 3D convolution layers followed by a 2D convolution layer to extract spatial-spectral features. Each block consists of three 3D convolution layers with 8, 16, and 32 filters, respectively, using kernel sizes of 3 × 3 × 5 for the first two layers and 3 × 3 × 1 for the third layer. By utilizing smaller convolution kernels, the model efficiently extracts features while minimizing computational cost [
29,
35]. Max pooling is applied after each convolution layer to prevent overfitting, and all convolution blocks are concatenated and flattened to convert each branch’s extracted features into one-dimensional vectors.
To ensure the seamless flow of features throughout our model, we implemented the channel-wise concatenation of decomposed data within the fully connected dense layers. This strategic approach enables the effective integration and transmission of spectral–spatial features, ensuring that relevant information is retained and propagated throughout the network. By preserving crucial details, our model facilitates comprehensive analysis and classification of HSI data, allowing for the accurate interpretation of complex spatial and spectral characteristics. Moreover, the incorporation of two dropout layers serves as a crucial mechanism to mitigate the risk of overfitting, particularly in scenarios with a limited number of HSI samples. Through the random deactivation of neurons during training, dropout layers prevent the model from relying excessively on specific features or patterns, thereby enhancing its generalization ability and robustness to unseen data instances.
Furthermore, the strategic utilization of a support vector machine (SVM) as the final layer of our model aims to maximize classification accuracy and performance. SVMs are renowned for their robustness and effectiveness in handling complex classification tasks by identifying an optimal hyperplane that maximizes the margin between distinct classes. By integrating the SVM into our model, we leverage its discriminative capabilities to effectively manage intricate decision boundaries and capture non-linear correlations present within the HSI data. The utilization of the “squared_hinge” loss function further enhances classification accuracy by penalizing misclassifications based on their distance from the decision boundary, thereby promoting a more precise delineation of class boundaries and reducing classification errors.
Moreover, the network structure and hyperparameters have been meticulously designed and fine-tuned to ensure optimal performance across all HSI datasets used in our research. The detailed specifications of the layers in each branch and level, including three 3D convolution layers and one 2D convolution layer, are provided in
Table 1. The dimensions of the 3D convolution kernels are 8 × 3 × 3 × 7 × 1, 16 × 3 × 3 × 5 × 8, and 32 × 3 × 3 × 3 × 16, where the third 3D convolution signifies 32 3D kernels 3 × 3 × 3 in dimension for all 16 3D input feature maps. Conversely, the dimension of the 2D convolution kernel is 64 × 3 × 3 × 576, where 64 represents the number of 2D kernels, 3 × 3 denotes the spatial dimension of the 2D kernel, and 576 signifies the number of 2D input feature maps. Through extensive experimentation and validation, our model has demonstrated exceptional accuracy and reliability in accurately classifying HSI data, even in complex and heterogeneous environments. This robust performance underscores the efficacy and suitability of our proposed model for HSI classification.