3DVT: Hyperspectral Image Classification Using 3D Dilated Convolution and Mean Transformer

Su, Xinling; Shao, Jingbo

doi:10.3390/photonics12020146

Open AccessArticle

3DVT: Hyperspectral Image Classification Using 3D Dilated Convolution and Mean Transformer

by

Xinling Su

and

Jingbo Shao

^*

School of Computer Science and Information Engineering, Harbin Normal University, Harbin 150025, China

^*

Author to whom correspondence should be addressed.

Photonics 2025, 12(2), 146; https://doi.org/10.3390/photonics12020146

Submission received: 21 January 2025 / Revised: 1 February 2025 / Accepted: 7 February 2025 / Published: 11 February 2025

(This article belongs to the Special Issue Advanced Fiber Laser Technology and Its Application)

Download

Browse Figures

Versions Notes

Abstract

Hyperspectral imaging and laser technology both rely on different wavelengths of light to analyze the characteristics of materials, revealing their composition, state, or structure through precise spectral data. In hyperspectral image (HSI) classification tasks, the limited number of labeled samples and the lack of feature extraction diversity often lead to suboptimal classification performance. Furthermore, traditional convolutional neural networks (CNNs) primarily focus on local features in hyperspectral data, neglecting long-range dependencies and global context. To address these challenges, this paper proposes a novel model that combines CNNs with an average pooling Vision Transformer (ViT) for hyperspectral image classification. The model utilizes three-dimensional dilated convolution and two-dimensional convolution to extract multi-scale spatial–spectral features, while ViT was employed to capture global features and long-range dependencies in the hyperspectral data. Unlike the traditional ViT encoder, which uses linear projection, our model replaces it with average pooling projection. This change enhances the extraction of local features and compensates for the ViT encoder’s limitations in local feature extraction. This hybrid approach effectively combines the local feature extraction strengths of CNNs with the long-range dependency handling capabilities of Transformers, significantly improving overall performance in hyperspectral image classification tasks. Additionally, the proposed method holds promise for the classification of fiber laser spectra, where high precision and spectral analysis are crucial for distinguishing between different fiber laser characteristics. Experimental results demonstrate that the CNN-Transformer model substantially improves classification accuracy on three benchmark hyperspectral datasets. The overall accuracies achieved on the three public datasets—IP, PU, and SV—were 99.35%, 99.31%, and 99.66%, respectively. These advancements offer potential benefits for a wide range of applications, including high-performance optical fiber sensing, laser medicine, and environmental monitoring, where accurate spectral classification is essential for the development of advanced systems in fields such as laser medicine and optical fiber technology.

Keywords:

laser technology; hyperspectral image; 3D dilated convolution; 2D convolution; transformer

1. Introduction

Hyperspectral images possess hundreds of contiguous narrow spectral bands, with each band containing substantial channel dimensions. This distinctive characteristic enables hyperspectral images to deliver rich spectral and spatial information, which is essential for a wide range of applications, including feature extraction [1], object detection [2], anomaly detection [3], and image classification [4,5]. In optics, hyperspectral imaging leverages its precise spectral resolution to analyze material properties, which has driven advancements in fields like environmental monitoring, agricultural analysis, and biomedical imaging.

On the other hand, laser-based technologies, including laser spectroscopy and laser-induced breakdown spectroscopy, utilize specific wavelengths of light to examine material composition and structure. By harnessing highly controlled light–matter interactions, these optical systems reveal intricate details about materials at both macro and micro levels. While hyperspectral imaging provides a broad and continuous spectrum for analysis, laser technologies often focus on discrete spectral lines, making the two methods highly complementary. Together, they exemplify the transformative potential of optical methods in advancing material characterization, from remote sensing to industrial quality control.

As a powerful tool, hyperspectral imaging has been widely applied in various fields such as remote sensing [6], agriculture [7,8], environmental monitoring [9], and medical diagnostics [10,11], national defense [12]. Unlike traditional imaging systems that capture images in only three bands (red, green, and blue), hyperspectral imaging systems can acquire images across hundreds or even thousands of narrow, continuous spectral bands, providing detailed spectral information for each pixel [13]. This unique capability allows hyperspectral imaging to capture fine spectral details and distinguish between surface materials with subtle spectral differences, thus playing a critical role in many applications [14].

Over the past few decades, significant contributions have been made by researchers in the field of hyperspectral image classification [15]. Numerous traditional methods have been proposed, including Sparse Representation (SR) [16], K-Nearest Neighbors (KNN) [17], Support Vector Machines (SVM) [18], Random Forest [19], and Extreme Learning Machines (ELM) [20]. While these traditional methods remain effective in certain situations, modern data science and machine learning applications often require more sophisticated algorithms to handle large-scale, high-dimensional, and noisy data environments [21]. These traditional methods are limited by their learning capacity and are often susceptible to noise, resulting in lower classification accuracy [22].

In recent years, deep learning methods have been widely applied in the field of image processing, achieving remarkable results [23,24,25]. Among these methods, CNN has garnered significant attention due to their powerful feature extraction capabilities [26]. CNNs can automatically learn and extract deep features from images [27]. Researchers have designed various CNN architectures to enhance image classification accuracy [28]. Lee and Kwon proposed a 2D CNN method that efficiently uses spatial and spectral features for HSI classification [29]. Liu et al. introduced a method of directly inputting hyperspectral images into three-dimensional Convolutional Neural Networks (3D-CNN), which extract spatial and spectral features simultaneously [30]. Zhang et al. proposed a spectral partitioning residual network (SPRN) for hyperspectral image classification [31]. It uses a homogeneous pixel detection module to reduce the impact of interfering pixels and improve classification performance. Roy et al. proposed a hybrid spectral convolutional neural network (HybridSN) model for hyperspectral image classification [32]. This model combines the advantages of 3D convolutional neural networks and 2D convolutional neural networks. However, these networks tend to exhibit complex structures and a large number of model parameters. Zhu et al. proposed a deformable convolutional neural network model, called deformable HSI classification networks (DHCNs), for hyperspectral image classification [33]. This model adaptively adjusts convolutional sampling locations and replaces pooling layers with strided convolutions, enabling it to better capture complex spatial structures and improve classification performance. Zhao et al. proposed a kernel-guided deformable convolution with a double-window joint bilateral filter to better preserve spatial edge details [34]. Existing CNN-based models perform well in extracting local spectral and spatial information from hyperspectral images. However, these methods, especially when using small convolution kernels, are often limited to capturing fine-grained features within a restricted receptive field [35]. To capture long-range contextual and higher-level features, deeper CNN architectures are typically required [35]. However, due to the limited availability of samples in hyperspectral datasets, deep models are prone to overfitting during training [36]. Afjal et al. introduced a novel 3D-2D CNN method with multi-branch feature fusion (MFFNet) for hyperspectral image classification [37]. This approach integrates segmented principal component analysis (Seg-PCA) to reduce spectral dimensions and enhance spectral–spatial feature extraction, offering improved classification performance. These networks are often quite complex, with a large number of parameters and considerable computational demands. Achieving optimal performance typically requires a substantial amount of training data. Shallow features typically capture basic geometric elements such as points, lines, and edges, providing essential spatial information. In contrast, deep features focus more on semantic information, often overlooking the finer details of spatial structure. As a result, overly complex or deep network architectures may not be the most suitable for hyperspectral image classification tasks. To address these challenges, we incorporate transformers in the classification of HSI data, offering a more efficient and streamlined solution.

With the advancement of attention mechanisms, the Transformer model has gained significant attention [38,39]. As a novel neural network architecture, the Transformer excels at handling sequential data, achieving remarkable success across various domains [40,41]. Following the emergence of ViT, image classification has made substantial progress [42]. ViT leverages the self-attention mechanism to model long-range dependencies and effectively capture global features, distinguishing it from traditional CNNs [43]. This approach is particularly well-suited for high-dimensional data and complex tasks, such as hyperspectral image classification, where global context is key to improving performance. Arshad et al. propose a hierarchical attention transformer model for hyperspectral image classification [44]. The model uses window-based self-attention to efficiently learn both local and global representations through dedicated tokens in each window. He et al. introduced a method called Hyperspectral Image Bidirectional Encoder Representations (HSI-BERT), which leverages the principles of the Transformer [45]. This method integrates multiple self-attention mechanism (SAM) layers within each unit, converting hyperspectral image pixels into sequences for model input. However, it overlooks the inherent local feature recognition capabilities of the HSI cube and struggles with the redundant spectral bands in hyperspectral images, making it difficult to effectively describe non-local features. Wu et al. proposed an innovative feature extraction and classification method called the Dilated Deep MPFormer Network (DDMN), which combines the advantages of CNN and ViT while further enhancing the utilization of spectral information [46]. Mei et al. proposed a model called the Group-Aware Hierarchical Transformer (GAHT) for hyperspectral image classification [47]. The model improves traditional vision transformers. It addresses the problem of feature over-dispersion in multi-head self-attention mechanisms. Zhang et al. proposed a new model called the Dilated Spectral–Spatial Gaussian Transformer Net (DSSGT) for hyperspectral image classification [48]. Sun et al. developed the Spectral–Spatial Feature Tokenization Transformer (SSFTT) to capture both spectral-spatial features and high-level semantic information [49]. This method consists of two main modules: a spectral–spatial convolution module for local feature extraction and a Transformer encoder module for global feature extraction. However, the simple concatenation approach of these modules prevents the model from fully leveraging their combined advantages, leading to issues with information loss or inefficient information transfer when processing image data. Yang et al. proposed an innovative hyperspectral image classification method called PD2C, which combines pyramid feature extraction with deformable-dilated convolution to overcome the limitations of traditional CNNs [50]. Chen et al. proposed GSPFormer, which enhances spectral consistency by constructing a global spectral projection space [51]. It also introduces a spatial aggregation strategy to fuse local spectral features, improving the performance of hyperspectral image classification. Roy et al. proposed a new model called MorphFormer, a morphological Transformer designed for hyperspectral image classification [52]. This model aims to enhance the performance of traditional Vision Transformers, which fail to fully leverage spatial–spectral features. Recently, Zhao et al. proposed a lightweight vision Transformer network called GSC-ViT to enhance hyperspectral image classification [53]. This model aims to address the issue of excessive parameters in traditional ViT models when processing hyperspectral data, as well as their inability to effectively capture local spatial and spectral features.

This paper proposes a hyperspectral image classification framework that integrates neural networks with ViT. The framework enhances classification performance by leveraging multi-scale spatial–spectral feature extraction and global context modeling. The global receptive field of ViT and the local feature extraction capability of Convolutional Neural Networks offer distinct advantages within their respective architectures [54]. By integrating ViT with CNNs, the resulting hybrid model can effectively leverage both comprehensive global context and robust local feature processing. This combination has been validated by previous research, demonstrating superior performance in tasks such as feature extraction and pattern recognition. The key contributions of this paper are summarized as follows:

(1): The proposed model integrates 2D convolutional neural networks with 3D dilated convolutions to capture comprehensive multi-scale spatial–spectral features from hyperspectral images. The 2D CNN was employed to extract local spatial features, while the 3D dilated convolution expanded the receptive field without increasing the number of convolutional kernel parameters. This approach effectively enhances the model’s capacity to capture detailed features, contributing to more refined feature extraction.
(2): The model combines the advantages of CNN for local feature extraction and ViT for capturing global features and long-range dependencies. By utilizing 2D convolutional embeddings followed by ViT encoders, the architecture capitalizes on both local spatial details and global contextual information. This combination is particularly advantageous for managing the complexities of hyperspectral data.
(3): In the ViT module, the model adopts an average pooling projection in place of the conventional linear projection. This adjustment enhances the model’s ability to extract local features, reduces computational complexity, and improves robustness in feature extraction by smoothing activation maps. Experimental results across multiple hyperspectral image datasets demonstrate that this design significantly improves classification accuracy.

The rest of the paper is organized as follows: Section 2 provides a comprehensive overview of the 3DVT network model framework, detailing its architectural design, key components, and their roles in hyperspectral image classification. Section 3 presents a detailed performance analysis through experimental comparisons, highlighting the model’s advantages over existing methods and discussing its limitations. Finally, Section 4 summarizes the main conclusions, emphasizing key findings and contributions, while suggesting potential directions for future research.

2. Proposed Method

This section introduces the proposed 3DVT model. Figure 1 illustrates the network structure of the 3DVT model.

2.1. PCA for Dimensionality Reduction

Reducing the dimensionality of hyperspectral images using principal component analysis (PCA) effectively minimizes redundancy in the dataset [55]. By extracting key features, PCA significantly lowers data dimensionality and enhances image contrast by preserving the principal components with the greatest variance. This makes hidden important features more prominent. Additionally, PCA reduces network training time and improves classification accuracy. PCA achieves dimensionality reduction by transforming high-dimensional data into a new coordinate system. This transformation ensures that the first principal component captures the maximum variance in the data, while subsequent components capture progressively less variance, with each being orthogonal to the others. Mathematically, PCA involves the decomposition of the covariance matrix of the dataset to compute eigenvalues and eigenvectors, which correspond to the variance magnitude and the direction of the principal components, respectively.

In the context of hyperspectral imaging, PCA is particularly advantageous due to the significant redundancy among spectral bands. By selecting a reduced number of principal components that preserve most of the variance, PCA effectively addresses the “curse of dimensionality”, reduces noise, and enhances computational efficiency for machine learning models. Additionally, the whitening process in PCA ensures that the principal components are decorrelated and scaled to unit variance, which is particularly beneficial for downstream tasks such as classification and clustering. By retaining only the most essential spectral information, PCA not only enhances the interpretability of hyperspectral data, but also improves the generalization ability of models across diverse datasets. As a result, PCA has become an indispensable preprocessing technique in hyperspectral image analysis workflows, particularly in resource-constrained scenarios.

2.2. The 3D Dilated Convolution Layer

In hyperspectral image classification tasks, hyperspectral images are inherently 3D. The 3D atrous convolution enhances the model’s ability to learn complex spatial structures within these images [56]. By appropriately choosing the dilation rate, the model can learn richer spatial–spectral features while maintaining low parameter and computational costs, thereby improving classification accuracy. Additionally, using 3D atrous convolution reduces the risk of overfitting due to high dimensionality and enhances the model’s generalization ability on unseen samples.

For a given 3D input feature map

F \in R^{D \times H \times W}

, where D, H, and W represent the number of spectral bands, height, and width, respectively, consider a 3D convolution kernel

K \in R^{D^{'} \times H^{'} \times W^{'}}

, where D′, H′, and W′ are the kernel sizes in depth, height, and width directions. The dilation rate

l

determines the spatial interval between adjacent elements in the convolution kernel, with the size of

l

being the number of interval elements plus one. The output

O

of the 3D atrous convolution can be calculated using the following formula:

O (x, x, z) = \sum_{d, h, w} F (x + l \cdot d, y + l \cdot h, z + l \cdot w) \cdot K (d, h, w)

(1)

where

x

,

x

, and

z

are the indices in the depth, height, and width directions, respectively, and

(d, h, w)

traverse all positions of the convolution kernel

K

. The dilation rate

l

controls the spacing between elements in the convolution kernel. The actual receptive field

R

of the convolution kernel can be calculated based on the dilation rate

l

and the kernel dimensions

{(D}^{'} {, H}^{'}, W^{'})

as follows:

R_{D} = l (D^{'} - 1) + 1

(2)

R_{H} = l (H^{'} - 1) + 1

(3)

R_{W} = l (W^{'} - 1) + 1

(4)

Applying batch normalization (BN) and the ReLU activation function after 3D atrous convolution can help the model converge more quickly, reduce the risk of overfitting, and enhance the model’s capacity to learn complex data. This ultimately improves the accuracy of hyperspectral image classification [57,58].

2.3. The 2D Convolution Layer

Hyperspectral image data are vast, making models susceptible to overfitting. By employing 3D atrous convolution, the model can learn richer features without a substantial increase in parameters, while 2D convolution focuses on extracting refined spatial features. Together, these techniques effectively mitigate overfitting [59]. Let F be the input feature map and k be the convolution kernel. The output G of the 2D convolution at position (i,j) can be expressed as

G [i, j] = \sum_{m = - a}^{a} \sum_{n = - b}^{b} F [i + m, j + n] \cdot k [m, n]

(5)

Here, m and n are the indices of the convolution kernel k, typically centered at the position G[i,j] in the output. The size of the convolution kernel is usually (2a + 1) × (2b + 1), for example, 3 × 3, 5 × 5, etc.

2.4. The Embedding Layer

By introducing a 2D convolutional embedding layer before the ViT encoder, we enhance the model’s ability to process image data. Firstly, increasing the number of feature maps enriches the feature representations passed to the ViT encoder, providing it with more dimensional information and improving the model’s capacity to capture image details. Secondly, the efficiency of the convolutional layer in extracting local features from images sets a high-quality information baseline for the ViT encoder, ensuring that critical information is captured from the initial stages. Finally, the spatial information retention characteristic of convolution operations allows the ViT encoder to more effectively understand and process spatial relationships within the images. This approach leverages the advantages of convolutional networks in feature extraction and combines them with the ViT encoder’s powerful ability to handle long-range dependencies, significantly improving the model’s efficiency and accuracy in handling complex image tasks. According to the rules of two-dimensional calculations, the 2D convolution formula is:

O (i, j) = \sum_{m = 0}^{K - 1} \sum_{n = 0}^{L - 1} I (i + m, j + n) \cdot F (m, n)

(6)

Here, the input matrix

I

has dimensions

M \times N

, and the convolution kernel

F

has dimensions

K \times L

.

O (i, j)

is the value of the output matrix at position

(i, j)

,

F (m, n)

is the value of the convolution kernel at position

(m, n)

, and

I (i + m, j + n)

represents the value of the input matrix at position

(i + m, j + n)

.

2.5. The Transformer Encoder Layer

The Vision Transformer (ViT) is a model based on the Transformer architecture, designed to capture global information in images using a multi-head self-attention mechanism. A feature image input to the ViT is represented as

X \in R^{S \times S \times L}

, where S denotes the length and width of the image, and L represents the number of channels.

Firstly, to be processed by the multi-head self-attention mechanism,

X

needs to be split into multiple tokens. After splitting X into tokens, its dimension becomes

R^{S^{2} \times L}

. These tokens are then evenly distributed into h different parts, with each part denoted as

{\hat{X}}_{β} \in R^{S^{2} \times (L / h)}

, where h is the number

\hat{X}

undergoes a linear transformation to obtain the Q (query), K (key), and V (value). This process can be described by the following equations:

Q_{β} \in R^{S^{2} \times (L / h)} = W_{q} \cdot {\hat{X}}_{β}

(7)

K_{β} \in R^{S^{2} \times (L / h)} = W_{k} \cdot {\hat{X}}_{β}

(8)

V_{β} \in R^{S^{2} \times (L / h)} = W_{v} \cdot {\hat{X}}_{β}

(9)

In this context,

W_{q}

,

W_{k}

, and

W_{v} \in R^{(S^{2} \times S^{2})}

are three learnable matrices in the aforementioned equations. These matrices correspond to the linear transformations of the queries (Q), keys (K), and values (V), respectively. They are essential for implementing the self-attention mechanism, enabling the model to capture different aspects of information across various “heads”. Through this method, the Vision Transformer can effectively process image information, utilizing the global context to enhance the representation of each image patch. Subsequently, the multi-head attention mechanism computes the attention matrices for each head as follows:

A_{β} = Attention (Q_{β}, K_{β}, V_{β}) = Softmax (\frac{Q_{β} \cdot K_{β}^{T}}{\sqrt{d}}) \cdot V_{β}

(10)

Here,

A_{β} \in R^{S^{2} \times (L / h)}

, with the Softmax () function performing a softmax calculation on each row of the matrix. In our experiments, we set d = L/h.

Finally, the output of the multi-head self-attention mechanism can be expressed by the following equation:

O u t p u t \in R^{S^{2} \times L} = C o n c a t (A_{1}, A_{2}, \dots, A_{h}) + H

(11)

The Concat function combines the attention matrices obtained from each head into a single matrix. Each head produces an attention matrix A

α

with dimensions

R^{S^{2} \times (L / h)}

. By concatenating these matrices along the specified dimension, we obtain a matrix with dimensions

R^{S^{2} \times L}

, which matches the dimensions of the input matrix

X

. Next, by performing a residual connection with the original input

X

(i.e., adding the input matrix

X

to the concatenated matrix), we obtain the final output. This design helps alleviate the vanishing gradient problem in deep networks, enabling the deep attention model to learn more effectively.

In our proposed 3DVT network, instead of using linear projection like the traditional Vision Transformer (ViT), we employ an average pooling projection approach. The input feature image is defined as

X \in R^{S \times S \times L}

. The first operation performed is the average pooling projection. We use three average pooling kernels (kernel_size = 3, stride = 1, padding = 1, channels = L) to project the feature image, thereby obtaining the queries (Q), keys (K), and values (V). The specific equations are as follows:

Q \in R^{S \times S \times L} = {A v g P o o l 2 D}_{Q} () \cdot X

(12)

K \in R^{S \times S \times L} = {A v g P o o l 2 D}_{K} () \cdot X

(13)

V \in R^{S \times S \times L} = A v g P o o l 2 D_{V} () \cdot X

(14)

The ViT encoder structure with average pooling projection is illustrated in Figure 2. Here, Conv2DQ, Conv2DK, and Conv2DV are learnable parameters. This design allows the network to learn directly through convolutional layers on the input image how to most effectively extract the feature representations corresponding to the queries, keys, and values. This not only increases the model’s sensitivity to the local structure of the input data but also enhances its ability to handle the characteristics of image data. By using this convolutional projection approach, the 3DCT network aims to effectively combine the advantages of convolutional neural networks in local feature extraction with the Transformers’ ability to handle long-range dependencies, thereby improving the overall performance of the model. Next, the queries (Q), keys (K), and values (V) are split into tokens and evenly divided into h equal parts to obtain

Q_{β}

,

K_{β}

, and

V_{β} \in R^{S^{2} \times (L / h)}

, where

β = 1, 2, \dots, h

. Finally, the attention mechanism computes the attention matrices for each head according to the equations, and the outputs from multiple heads are integrated to obtain the final output.

3. Experiments

To validate the effectiveness of the newly proposed 3DVT network, we conducted experiments on three public hyperspectral datasets: Indian Pines (IP), University of Pavia (PU), and Salinas Valley (SV). We selected seven algorithms for comparison, including convolution-based and ViT-based approaches. During model training, we chose labeled samples according to three different schemes, comprising 1%, 3%, and 10% of the total samples. These samples constituted the training set, while the remaining labeled samples formed the test set, which was used to evaluate the model’s performance on new data.

In the Vision Transformer architecture, both the class token and mean pooling methods are key areas of research for refining global feature representations. The class token utilizes the self-attention mechanism to integrate comprehensive information from the image, while mean pooling simplifies this process by averaging the encoded features [60]. This study aimed to compare the performance of these two methods across different tasks to determine the more efficient feature extraction strategy for specific hyperspectral image analyses. Additionally, we examined the impact of principal component analysis (PCA) preprocessing on dimensionality reduction and noise suppression, evaluating its contribution to model accuracy and generalization capability. This investigation will provide empirical evidence for selecting deep learning model architectures and data preprocessing techniques.

In our study, we compared the proposed 3DVT model with the state-of-the-art methods. The network was trained using the cross-entropy loss function and stochastic gradient descent [61]. The learning rate was configured to 0.01. The experiments were conducted using the PyTorch framework, with hardware configurations consisting of an i7-13700K CPU, 32GB DDR5 RAM, and an NVIDIA GeForce RTX 3080 GPU. We selected three common evaluation metrics—overall accuracy (OA), average accuracy (AA), and the Kappa coefficient (Kappa)—to assess network performance. Table 1 provides detailed descriptions of each dataset.

We rigorously evaluated the effectiveness of each scenario using three key statistical metrics: Overall Accuracy (OA), Average Accuracy (AA), and the Kappa Coefficient (Kappa). These metrics are critical for capturing different aspects of the model’s performance, particularly in the context of hyperspectral image classification.

Overall Accuracy (OA): OA reflects the proportion of correctly classified samples across the entire dataset and serves as a general measure of the model’s accuracy. It is expressed as

O A = \frac{Number of Correct Predictions}{Total Number of Samples}

(15)

While OA provides a straightforward assessment of the overall performance, it may be biased in datasets with class imbalance, as it treats all samples equally regardless of their class.

Average Accuracy (AA): Unlike OA, AA evaluates the model’s balanced performance across all classes by averaging the individual class accuracies. This ensures that each class contributes equally, regardless of its size. It is defined as

A A = \frac{1}{N} \sum_{i = 1}^{N} {A c c u r a c y}_{i}

(16)

where N represents the number of classes, and

{A c c u r a c y}_{i}

is the accuracy for class

i

. This metric is particularly valuable in scenarios where underrepresented classes play a significant role, providing a more nuanced perspective on model performance.

Kappa Coefficient (Kappa): The Kappa coefficient quantifies the agreement between the model’s predictions and the ground truth, accounting for random chance. It is calculated as

κ = \frac{P_{o} - P_{e}}{1 - P_{e}}

(17)

where

P_{o}

represents the observed agreement (equivalent to OA), and

P_{e}

is the expected agreement due to chance. A Kappa value of 1 indicates perfect agreement, while a value of 0 suggests that the model performs no better than random guessing. This metric is particularly insightful when dealing with imbalanced datasets, as it adjusts for class distribution effects.

By combining these metrics, we achieve a comprehensive evaluation of the model’s performance, capturing its overall accuracy, balance across classes, and agreement quality beyond random chance. This multi-faceted approach ensures a robust assessment, particularly for datasets with diverse and imbalanced class distributions.

3.1. Ablation Study

Table 2 summarizes the findings of an ablation study aimed at elucidating the impact of different architectural components on the classification accuracy of the 3DVT model. This study presents the performance results of the model under four different scenarios, each utilizing various combinations of three key elements: the Convolutional Neural Network segment (CNN Segment), Convolutional Embedding (Conv Embedding), and Average Pooling Projection.

We rigorously evaluated the effectiveness of each scenario using three statistical metrics: Overall Accuracy (OA), Average Accuracy (AA), and the Kappa Coefficient (Kappa). Overall Accuracy measures the model’s general accuracy, Average Accuracy assesses the balanced performance of the model across different classes, and the Kappa Coefficient evaluates the degree to which the model’s predictions exceed random agreement.

In Case 1, the Convolutional Embedding and Average Pooling Projection were activated, but the Convolutional Neural Network was not used. The final performance was significantly lower than that of Case 3. This case served as a control, indicating that the Convolutional Neural Network focuses on local features in hyperspectral data, enhancing the network model’s performance.

In Case 2, the Convolutional Neural Network segment and Average Pooling Projection were activated, but the Convolutional Embedding was excluded. The Convolutional Embedding, by increasing the number of feature maps, expands the feature representation space passed to the ViT encoder, providing it with more informational dimensions and thereby enhancing the model’s ability to capture image details. In Case 4, with the addition of Convolutional Embedding, the model’s performance improved significantly.

Case 3 activated the Convolutional Neural Network and Convolutional Embedding but did not use Average Pooling Projection. Compared to Case 4, which included Average Pooling Projection, the performance declined. This is because Average Pooling Projection excels at extracting local features, compensating for the ViT encoder’s shortcomings in local feature extraction, thereby improving performance.

Case 4 integrated all components and provided the highest performance metrics. In the IP dataset, the OA was 99.36 ± 0.05%, the AA was 98.77 ± 0.41%, and the Kappa was 99.26 ± 0.06%. This indicates that the combination of the Convolutional Neural Network, Convolutional Embedding, and Average Pooling Projection fosters the most effective interaction between components, significantly enhancing classification performance.

3.2. Contributions of PCA, Class Token, and Mean Pooling to the 3DVT

Table 3 showcases the contributions of PCA, Class Token, and Mean Pooling to the 3DVT model. In Case 1 (OA of 99.28%) and Case 4 (OA of 98.80%), PCA is demonstrated as a technique that enhances the dataset’s expressive capability, typically by reducing dimensionality to eliminate noise and redundancy. Comparing cases using PCA (Case 1 and Case 4) with those not using PCA (Case 2 and Case 3), there was a significant performance improvement. Additionally, by comparing the roles of Class Token and Mean Pooling, we found that Mean Pooling improved the overall accuracy and average accuracy of the model in Case 1 and Case 3. This effect is particularly notable in Case 3 (OA of 96.98%), where the impact of Mean Pooling is especially significant. Conversely, the application of Class Token as a global information representative in Case 2 does not appear to bring a notable performance improvement (OA of 98.40%). This might suggest that in the 3DVT model, global feature representation can be more effectively captured by other means.

Considering all factors, Mean Pooling, in the absence of a Class Token, as seen in Case 3, can significantly enhance the performance of the 3DVT model, particularly in overall classification accuracy. This improvement may be attributed to Mean Pooling’s more efficient global representation capability when integrating feature information. Although the Class Token is an effective global information representative in typical Transformer models, its role in the 3DVT model appears to be less pronounced compared to Mean Pooling. Therefore, we can conclude that Mean Pooling has a more significant advantage over the Class Token in improving the performance of the 3DVT model, especially in terms of overall classification accuracy.

Table 4 presents the classification results of 3DVT across various hyperspectral datasets.

3.3. Comparative Analysis of Experiments

Extensive experiments were conducted using the proposed 3DVT model, and the results were compared with both traditional and state-of-the-art models to evaluate its performance. The comparison included classical CNN-based methods such as SPRN [31] and HybridSN [32]. Additionally, we included cutting-edge Transformer-based techniques like GAHT [47], MorphFormer [52], GSPFormer [51], and GSC-Vit [53]. Table 5, Table 6 and Table 7 present the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (κ) for all compared methods on four HSI datasets. The Kappa coefficient (κ) measures the agreement between the predicted and true labels [62]. A higher Kappa coefficient generally indicates better model performance.

In a comparative study of the HSI classification capabilities of multiple networks, the proposed 3DVT model outperformed other comparative algorithms across the IP, SV, and PU datasets. The first four networks in the table were based on CNN models, while the latter four were based on Transformer models. From the table, it is evident that CNN-based network models performed poorly on the SV dataset. This is because the SV dataset is large, with few labeled pixels that are widely distributed. CNNs focused on local features in hyperspectral data and tended to ignore global features, leading to poor performance on the SV dataset. Conversely, CNN-based models performed well on the IP dataset because the IP dataset was smaller with concentrated labeled pixels. On the PU dataset, Transformer-based models outperformed CNN models, as Transformers exceled at capturing global features and long-range dependencies in hyperspectral data, thereby providing a more comprehensive contextual understanding. The text was clear, well-structured, and grammatically correct. It effectively conveyed the intended message without any issues in clarity or coherence.

3DVT addresses the shortcomings of traditional deep learning models in handling high-dimensional data and information transfer by combining Convolutional Neural Networks with Transformer architectures for feature extraction. This unique integration not only enhances the model’s ability to capture spatial–spectral information, but also significantly improves its classification performance on high-dimensional features in remote sensing images. Consequently, 3DVT provides a novel approach to tackling the challenges of deep learning in processing high-dimensional remote sensing data.

To investigate the relationship between the number of encoders and classification accuracy, we conducted experiments to test the overall accuracy (OA) and training time at different depths. Table 8 presents the OA for various numbers of encoders. In the IP dataset, the highest OA was achieved with six encoders; in the PU dataset, the highest OA was achieved with seven encoders; and in the SV dataset, the highest OA was achieved with four encoders. Overall, having more encoders did not necessarily result in better performance; each dataset had its optimal number of ViTs. As shown in Table 9, the training time increased with the number of encoders. Therefore, selecting the appropriate number of encoders is crucial for optimizing the model’s performance.

Table 10 shows the impact of different patch sizes on overall accuracy (OA). In the IP dataset, the highest OA was achieved with a patch size of 17; in the PU dataset, the highest OA was achieved with a patch size of 15; and in the SV dataset, the highest OA was achieved with a patch size of 19. However, as shown in Table 11, the training time increased with the patch size. Therefore, blindly increasing the patch size is not advisable. Selecting an appropriate patch size is crucial for balancing training time and classification accuracy.

Table 12 presents the training time of different methods across three hyperspectral datasets. While 3DVT did not exhibit significant advantages in terms of training time, requiring 50.78 s for the IP dataset, 78.30 s for the PU dataset, and 52.69 s for the SV dataset, the differences compared to other models are not substantial. This indicates that although 3DVT has slightly higher computational costs, its training time remains within a reasonable range.

The increased training time of 3DVT primarily stems from its complex architectural design, which incorporates 3D dilated convolutions and Vision Transformer (ViT) encoders. These components are crucial for achieving superior classification performance. Future work will focus on optimizing the model’s structure to further improve training efficiency while maintaining its excellent classification accuracy.

4. Conclusions

This paper presents the 3DVT network model for hyperspectral image (HSI) classification, which integrates 3D dilated convolutions, 2D convolutions, and Vision Transformer (ViT) encoders. By employing average pooling projection instead of the conventional linear projection in ViT, the model effectively enhanced feature extraction. The convolutional components exceled at capturing local spatial relationships, while the self-attention mechanism in ViT focused on global dependencies, creating a complementary synergy that delivers superior classification performance.

Experimental results across multiple datasets demonstrate that 3DVT achieves higher accuracy compared to existing methods. Although the model has slightly longer training times than some lightweight approaches, it remains a practical solution for scenarios where high accuracy is essential. Furthermore, the model shows promise for applications beyond HSI classification, such as fiber laser spectrum analysis.

Future work will prioritize optimizing training efficiency, reducing training time, and improving generalization across diverse datasets and real-world scenarios. These efforts aim to broaden the model’s applicability and adaptability to meet the demands of various practical settings.

Author Contributions

Writing—original draft preparation, X.S.; writing—review and editing, J.S.; Conceptualization, X.S. and J.S.; methodology, X.S.; software, X.S.; validation, J.S.; investigation, X.S.; resources, J.S.; supervision, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Harbin Normal University Postgraduate Innovation Project (HSDSSCX2023-7); Heilongjiang Provincial Natural Fund Joint Guidance Project (LH2019F027).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study can be downloaded from Hyperspectral Remote Sensing Scenes—Grupo de Inteligencia Computacional (GIC).

Acknowledgments

We acknowledge the assistance of OpenAI’s GPT for language refinement in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, C.; Zhu, Y.; Song, M.; Wang, Y.; Zhang, Q. Unseen feature extraction: Spatial mapping expansion with spectral compression network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
He, X.; Tang, C.; Liu, X.; Zhang, W.; Sun, K.; Xu, J. Object detection in hyperspectral image via unified spectral-spatial feature aggregation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Guo, T.; He, L.; Luo, F.; Gong, X.; Li, Y.; Zhang, L. Anomaly detection of hyperspectral image with hierarchical antinoise mutual-incoherence-induced low-rank representation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Akewar, M.; Chandak, M. Hyperspectral Imaging Algorithms and Applications: A Review. Authorea Preprints 2023. [Google Scholar] [CrossRef]
Datta, D.; Mallick, P.K.; Bhoi, A.K.; Ijaz, M.F.; Shafi, J.; Choi, J. Hyperspectral image classification: Potentials, challenges, and future directions. Comput. Intell. Neurosci. 2022, 2022, 3854635. [Google Scholar] [CrossRef]
Zhao, C.; Qin, B.; Feng, S.; Zhu, W.; Sun, W.; Li, W.; Jia, X. Hyperspectral image classification with multi-attention transformer and adaptive superpixel segmentation-based active learning. IEEE Trans. Image Process. 2023, 32, 3606–3621. [Google Scholar] [CrossRef] [PubMed]
Guerri, M.F.; Distante, C.; Spagnolo, P.; Bougourzi, F.; Taleb-Ahmed, A. Deep learning techniques for hyperspectral image analysis in agriculture: A review. ISPRS Open J. Photogramm. Remote Sens. 2024, 12, 100062. [Google Scholar] [CrossRef]
Rajabi, R.; Zehtabian, A.; Singh, K.D.; Tabatabaeenejad, A.; Ghamisi, P.; Homayouni, S. Hyperspectral imaging in environmental monitoring and analysis. Front. Environ. Sci. 2024, 11, 1353447. [Google Scholar] [CrossRef]
Lu, B.; Dao, P.D.; Liu, J.; He, Y.; Shang, J. Recent advances of hyperspectral imaging technology and applications in agriculture. Remote Sens. 2020, 12, 2659. [Google Scholar] [CrossRef]
Yoon, J. Hyperspectral imaging for clinical applications. BioChip J. 2022, 16, 1–12. [Google Scholar] [CrossRef]
Karim, S.; Qadir, A.; Farooq, U.; Shakir, M.; Laghari, A.A. Hyperspectral imaging: A review and trends towards medical imaging. Curr. Med. Imaging 2023, 19, 417–427. [Google Scholar] [CrossRef] [PubMed]
Nisha, A.; Anitha, A. Current advances in hyperspectral remote sensing in urban planning. In Proceedings of the 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), Kannur, India, 11–12 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 94–98. [Google Scholar]
Sawant, S.S.; Prabukumar, M. A survey of band selection techniques for hyperspectral image classification. J. Spectr. Imaging 2020, 9, 1–18. [Google Scholar] [CrossRef]
Ahmad, M.; Shabbir, S.; Roy, S.K.; Hong, D.; Wu, X.; Yao, J.; Khan, A.M.; Mazzara, M.; Distefano, S.; Chanussot, J. Hyperspectral image classification—Traditional to deep models: A survey for future prospects. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 15, 968–999. [Google Scholar] [CrossRef]
Wambugu, N.; Chen, Y.; Xiao, Z.; Tan, K.; Wei, M.; Liu, X.; Li, J. Hyperspectral image classification on insufficient-sample and feature learning using deep neural networks: A review. Int. J. Appl. Earth Obs. Geoinf. 2021, 105, 102603. [Google Scholar] [CrossRef]
Chen, Y.; Nasrabadi, N.M.; Tran, T.D. Hyperspectral image classification using dictionary-based sparse representation. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3973–3985. [Google Scholar] [CrossRef]
Ma, L.; Crawford, M.M.; Tian, J. Local manifold learning-based k-nearest-neighbor for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2010, 48, 4099–4109. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Zhang, Y.; Cao, G.; Li, X.; Wang, B. Cascaded random forest for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 1082–1094. [Google Scholar] [CrossRef]
Zhang, Y.; Cao, G.; Li, X. Multiview-based random rotation ensemble pruning for hyperspectral image classification. IEEE Trans. Instrum. Meas. 2020, 70, 1–14. [Google Scholar] [CrossRef]
Poulinakis, K.; Drikakis, D.; Kokkinakis, I.W.; Spottswood, S.M. Machine-learning methods on noisy and sparse data. Mathematics 2023, 11, 236. [Google Scholar] [CrossRef]
Imani, M.; Ghassemian, H. An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges. Inf. Fusion 2020, 59, 59–83. [Google Scholar] [CrossRef]
Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
Kumar, R.; Kumbharkar, P.; Vanam, S.; Sharma, S. Medical images classification using deep learning: A survey. Multimed. Tools Appl. 2024, 83, 19683–19728. [Google Scholar] [CrossRef]
Sajitha, P.; Andrushia, A.D.; Anand, N.; Naser, M. A review on machine learning and deep learning image-based plant disease classification for industrial farming systems. J. Ind. Inf. Integr. 2024, 38, 100572. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
Jogin, M.; Madhulika, M.S.; Divya, G.D.; Meghana, R.K.; Apoorva, S. Feature extraction using convolution neural networks (CNN) and deep learning. In Proceedings of the 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bengaluru, India, 18–19 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2319–2323. [Google Scholar]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Lee, H.; Kwon, H. Contextual deep CNN based hyperspectral classification. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3322–3325. [Google Scholar]
Liu, X.; Sun, Q.; Meng, Y.; Wang, C.; Fu, M. Feature extraction and classification of hyperspectral image based on 3D-convolution neural network. In Proceedings of the 2018 IEEE 7th Data Driven Control and Learning Systems Conference (DDCLS), Enshi, China, 25–27 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 918–922. [Google Scholar]
Zhang, X.; Shang, S.; Tang, X.; Feng, J.; Jiao, L. Spectral partitioning residual network with spatial attention mechanism for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef]
Zhu, J.; Fang, L.; Ghamisi, P. Deformable convolutional neural networks for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1254–1258. [Google Scholar] [CrossRef]
Zhao, C.; Zhu, W.; Feng, S. Hyperspectral image classification based on kernel-guided deformable convolution and double-window joint bilateral filter. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision—ECCV 2014, 13th European Conference, Proceedings, Part I 13, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 818–833. [Google Scholar]
Bejani, M.M.; Ghatee, M. A systematic review on overfitting control in shallow and deep neural networks. Artif. Intell. Rev. 2021, 54, 6391–6438. [Google Scholar] [CrossRef]
Afjal, M.I.; Mondal, M.N.I.; Mamun, M.A. Effective hyperspectral image classification based on segmented PCA and 3D-2D CNN leveraging multibranch feature fusion. J. Spat. Sci. 2024, 1–28. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Dubey, S.R.; Singh, S.K. Transformer-based generative adversarial networks in computer vision: A comprehensive survey. IEEE Trans. Artif. Intell. 2024, 5, 4851–4867. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar] [CrossRef]
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-based visual segmentation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
Khan, A.; Rauf, Z.; Sohail, A.; Khan, A.R.; Asif, H.; Asif, A.; Farooq, U. A survey of the vision transformers and their CNN-transformer based variants. Artif. Intell. Rev. 2023, 56 (Suppl. S3), 2917–2970. [Google Scholar] [CrossRef]
Arshad, T.; Zhang, J. Hierarchical attention transformer for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
He, J.; Zhao, L.; Yang, H.; Zhang, M.; Li, W. HSI-BERT: Hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Trans. Geosci. Remote Sens. 2019, 58, 165–178. [Google Scholar] [CrossRef]
Wu, Q.; He, M.; Huang, W.; Zhu, F. Dilated Deep MPFormer Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Mei, S.; Song, C.; Ma, M.; Xu, F. Hyperspectral image classification using group-aware hierarchical transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, S.; Zhang, W. Dilated spectral–spatial Gaussian Transformer net for hyperspectral image classification. Remote Sens. 2024, 16, 287. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Yang, J.; Li, A.; Qian, J.; Qin, J.; Wang, L. A Hyperspectral Image Classification Method Based on Pyramid Feature Extraction with Deformable-Dilated Convolution. IEEE Geosci. Remote Sens. Lett. 2023, 21, 1–5. [Google Scholar] [CrossRef]
Chen, D.; Zhang, J.; Guo, Q.; Wang, L. Hyperspectral Image Classification Based on Global Spectral Projection and Space Aggregation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, X.; Li, S.; Plaza, A. Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Soudy, A.H.; Sayed, O.; Tag-Elser, H.; Ragab, R.; Mohsen, S.; Mostafa, T.; Abohany, A.A.; Slim, S.O. Deepfake detection using convolutional vision transformers and convolutional neural networks. Neural Comput. Appl. 2024, 36, 19759–19775. [Google Scholar] [CrossRef]
Gewers, F.L.; Ferreira, G.R.; Arruda, H.F.; Silva, F.N.; Comin, C.H.; Amancio, D.R.; Costa, L.D. Principal component analysis: A natural approach to data exploration. ACM Comput. Surv. (CSUR) 2021, 54, 1–34. [Google Scholar] [CrossRef]
Schmidt, C.; Athar, A.; Mahadevan, S.; Leibe, B. D2conv3d: Dynamic dilated convolutions for object segmentation in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022. [Google Scholar]
Burkholz, R. Batch Normalization is Sufficient for Universal Function Approximation in CNNs. 2024. Available online: https://openreview.net/forum?id=wOSYMHfENq (accessed on 1 January 2025).
Xu, Y.; Zhang, H. Convergence of deep ReLU networks. Neurocomputing 2024, 571, 127174. [Google Scholar] [CrossRef]
Li, H.; Rajbahadur, G.K.; Lin, D.; Bezemer, C.-P.; Jiang, Z.M. Keeping Deep Learning Models in Check: A History-Based Approach to Mitigate Overfitting. IEEE Access 2024, 12, 1. [Google Scholar] [CrossRef]
Gholamalinezhad, H.; Khosravi, H. Pooling methods in deep neural networks, a review. arXiv 2020, arXiv:2009.07485. [Google Scholar]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Med. 2012, 22, 276–282. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the 3DVT network model.

Figure 2. ViT encoder with average pooling projection.

Table 1. Overview of the IP, KSC, and PU datasets.

Dataset	Data Size	Class	Samples	Percentage of Training Samples
IP	145 $\times$ 145 $\times$ 200	16	10,249	10%
SV	512 $\times$ 217 $\times$ 224	16	54,129	1%
PU	610 $\times$ 340 $\times$ 103	9	42,776	3%

Table 2. Contributions of the core components to the 3DVT.

Case	CNN Segment	Conv Embedding	AvgPool Projection	Metric	IP	SV	PU
1	×	$\sqrt$	$\sqrt$	OA AA Kappa	98.77 ± 0.06 98.11 ± 0.22 98.33 ± 0.21	99.06 ± 0.99 99.20 ± 0.34 98.46 ± 0.30	98.99 ± 0.03 98.30 ± 0.07 98.66 ± 0.12
2	$\sqrt$	×	$\sqrt$	OA AA Kappa	98.88 ± 0.13 97.98 ± 0.10 98.50 ± 0.02	99.01 ± 0.17 98.88 ± 0.29 98.99 ± 0.14	99.20 ± 0.11 98.65 ± 0.22 99.03 ± 0.07
3	$\sqrt$	$\sqrt$	×	OA AA Kappa	98.49 ± 0.05 98.02 ± 0.24 97.88 ± 0.35	99.10 ± 0.99 98.77 ± 0.63 99.20 ± 0.17	99.11 ± 0.06 99.01 ± 0.09 99.05 ± 0.10
4	$\sqrt$	$\sqrt$	$\sqrt$	OA AA Kappa	99.36 ± 0.05 98.77 ± 0.41 99.26 ± 0.06	99.64 ± 0.10 99.70 ± 0.05 99.60 ± 0.11	99.27 ± 0.12 98.53 ± 0.19 99.04 ± 0.16

Table 3. Contributions of PCA, Class Token, and Mean Pooling to the 3DVT.

Case	PCA	Class Token	Mean Pooling	OA	AA	Kappa
1	$\sqrt$	×	$\sqrt$	99.28 $\pm$ 0.06	98.95 $\pm$ 0.36	99.21 $\pm$ 0.09
2	×	$\sqrt$	×	98.40 $\pm$ 0.23	97.03 $\pm$ 0.76	98.00 $\pm$ 0.12
3	×	×	$\sqrt$	96.98 $\pm$ 0.68	95.10 $\pm$ 0.99	97.07 $\pm$ 0.50
4	$\sqrt$	$\sqrt$	×	98.80 $\pm$ 0.09	97.03 $\pm$ 0.76	98.80 $\pm$ 0.12

Table 4. 3DVT classification results.

Indian Pines			University of Pavia			Salinas Valleys
(OA: 99.41)			(OA: 99.41)			(OA: 99.77)
Color	Land-cover type	Samples	Color	Land-cover type	Samples	Color	Land-cover type	Samples
	Alfalfa	46		Asphalt	6631		Brocoli green weeds 1	2009
	Corn notill	1428		Meadows	18,649		Brocoli green weeds 2	3726
	Corn mintill	830		Gravel	2099		Fallow	1976
	Corn	237		Trees	3064		Fallow rough plow	1394
	Grass pasture	483		Painted metal sheet	1345		Fallow smooth	2678
	Grass trees	730		Bare Soil	5029		Stubble	3959
	Grass pasture mowed	28		Bitumen	1330		Celery	3579
	Hay windrowed	478		Self Blocking Bricks	3682		Grapes untrained	11,271
	Oats	20		Shadows	947		Soil vineyard develop	6203
	Soybean notill	972					Corn senescedgreen weeds	3278
	Soybean mintill	2455					Lettuce romaine 4wk	1068
	Soybean clean	593					Lettuce romaine 5wk	1927
	Wheat	205					Lettuce romaine 6wk	916
	Woods	1265					Lettuce romaine 7wk	1070
	Buildings Grass Trees Drives	386					Vinyard untrained	7268
	Stone Steel Towers	93					Vinyard vertical trellis	1807
	Total samples	21,025		Total samples	207,400		Total samples	11,104

Table 5. Classification results of different methods for the Indian Pines dataset.

Class	CNN-Based		Transformer-Based
Class	SPRN (2021)	HybridSN (2019)	GAHT (2022)	MorphFormer (2023)	GSPFormer (2023)	GSC-Vit (2024)	3DVT
OA (%)	95.55 ± 0.50	97.90 ± 0.17	97.16 ± 0.16	97.84 ± 0.61	98.45 ± 0.48	98.73 ± 0.18	99.35 ± 0.04
AA (%)	94.04 ± 1.26	97.98 ± 0.49	96.33 ± 0.72	94.12 ± 2.17	97.92 ± 0.32	98.09 ± 0.44	98.76 ± 0.51
k × 100	94.95 ± 0.57	97.61 ± 0.20	96.76 ± 0.11	97.49 ± 0.62	98.19 ± 0.47	98.63 ± 0.24	99.27 ± 0.06

Table 6. Classification results of different methods for the University of Pavia.

Class	CNN-Based		Transformer-Based
Class	SPRN (2021)	HybridSN (2019)	GAHT (2022)	MorphFormer (2023)	GSPFormer (2023)	GSC-Vit (2024)	3DVT (2022)
OA (%)	95.46 ± 0.70	97.40 ± 0.48	97.13 ± 2.24	98.31 ± 0.40	96.91 ± 0.39	98.91 ± 0.65	99.31 ± 0.07
AA (%)	92.47 ± 0.83	95.97 ± 0.77	95.90 ± 2.26	98.25 ± 0.49	96.79 ± 0.42	98.59 ± 0.92	98.61 ± 0.11
k × 100	93.90 ± 0.97	96.51 ± 0.61	96.17 ± 2.82	97.80 ± 0.56	95.88 ± 0.43	98.57 ± 0.80	99.09 ± 0.10

Table 7. Classification results of different methods for the Salinas dataset.

Class	CNN-Based		Transformer-Based
Class	SPRN (2021)	HybridSN (2019)	GAHT (2022)	MorphFormer (2023)	GSPFormer (2023)	GSC-Vit (2024)	3DVT
OA (%)	93.57 ± 1.80	94.92 ± 1.13	96.83 ± 0.25	95.92 ± 0.47	95.82 ± 0.48	96.13 ± 1.15	99.66 ± 0.09
AA (%)	94.12 ± 1.65	97.91 ± 0.72	98.33 ± 0.15	98.01 ± 0.31	97.54 ± 0.40	97.42 ± 0.90	99.73 ± 0.04
k × 100	92.81 ± 2.09	94.31 ± 1.21	96.50 ± 0.28	95.41 ± 0.52	95.34 ± 0.56	95.70 ± 1.31	99.62 ± 0.10

Table 8. Overall Accuracy (OA) with different numbers of encoders.

Dataset	2	3	4	5	6	7	8	9	10
IP	99.17 ± 0.06	99.20 ± 0.08	99.23 ± 0.04	99.30 ± 0.07	99.36 ± 0.05	99.35 ± 0.06	99.35 ± 0.03	99.30 ± 0.05	99.32 ± 0.07
PU	99.19 ± 0.03	99.20 ± 0.06	99.21 ± 0.10	99.20 ± 0.05	99.25 ± 0.07	99.27 ± 0.12	99.26 ± 0.06	99.23 ± 0.06	99.24 ± 0.08
SV	99.49 ± 0.04	99.52 ± 0.07	99.64 ± 0.10	99.62 ± 0.09	99.58 ± 0.02	99.59 ± 0.07	99.50 ± 0.06	99.40 ± 0.02	99.46 ± 0.03

Table 9. Training Times (s) for different numbers of encoders.

Dataset	2	3	4	5	6	7	8	9	10
IP	37.42	39.96	42.88	45.10	50.73	51.55	52.26	54.28	55.38
PU	50.28	56.62	62.88	67.42	69.88	78.27	84.09	99.20	102.58
SV	40.06	44.19	52.31	53.02	53.78	54.39	56.57	59.48	62.60

Table 10. Accuracy with different patch sizes.

Patch Size	9 × 9	11 × 11	13 × 13	15 × 15	17 × 17	19 × 19
IP	99.16 ± 0.07	99.36 ± 0.05	99.30 ± 0.05	99.31 ± 0.06	99.29 ± 0.03	99.22 ± 0.05
PU	99.09 ± 0.03	99.15 ± 0.02	99.27 ± 0.12	99.25 ± 0.07	99.22 ± 0.05	99.16 ± 0.08
SV	97.63 ± 0.08	98.38 ± 0.04	98.93 ± 0.05	99.39 ± 0.04	99.64 ± 0.10	99.60 ± 0.05

Table 11. Training Times (s) for different patch sizes.

Patch Size	9 × 9	11 × 11	13 × 13	15 × 15	17 × 17	19 × 19
IP	38.36	50.73	70.78	102.65	127.38	165.97
PU	35.29	52.20	78.27	102.58	140.26	186.59
SV	22.56	31.11	39.55	47.08	52.31	84.30

Table 12. Computational cost on the IP, PU, and SA datasets.

		SPRN	HybridSN	GAHT	MorphFormer	GSPFormer	GSC-Vit	3DVT
IP	Training Times (s)	26.78	95.88	66.48	18.44	41.27	29.54	50.78
PU	Training Times (s)	31.56	82.11	54.77	17.66	32.46	16.21	78.30
SV	Training Times (s)	22.59	79.64	51.66	15.24	30.11	14.22	52.69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, X.; Shao, J. 3DVT: Hyperspectral Image Classification Using 3D Dilated Convolution and Mean Transformer. Photonics 2025, 12, 146. https://doi.org/10.3390/photonics12020146

AMA Style

Su X, Shao J. 3DVT: Hyperspectral Image Classification Using 3D Dilated Convolution and Mean Transformer. Photonics. 2025; 12(2):146. https://doi.org/10.3390/photonics12020146

Chicago/Turabian Style

Su, Xinling, and Jingbo Shao. 2025. "3DVT: Hyperspectral Image Classification Using 3D Dilated Convolution and Mean Transformer" Photonics 12, no. 2: 146. https://doi.org/10.3390/photonics12020146

APA Style

Su, X., & Shao, J. (2025). 3DVT: Hyperspectral Image Classification Using 3D Dilated Convolution and Mean Transformer. Photonics, 12(2), 146. https://doi.org/10.3390/photonics12020146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

3DVT: Hyperspectral Image Classification Using 3D Dilated Convolution and Mean Transformer

Abstract

1. Introduction

2. Proposed Method

2.1. PCA for Dimensionality Reduction

2.2. The 3D Dilated Convolution Layer

2.3. The 2D Convolution Layer

2.4. The Embedding Layer

2.5. The Transformer Encoder Layer

3. Experiments

3.1. Ablation Study

3.2. Contributions of PCA, Class Token, and Mean Pooling to the 3DVT

3.3. Comparative Analysis of Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI