Advancing Hyperspectral Image Analysis with CTNet: An Approach with the Fusion of Spatial and Spectral Features

Hyperspectral image classification remains challenging despite its potential due to the high dimensionality of the data and its limited spatial resolution. To address the limited data samples and less spatial resolution issues, this research paper presents a two-scale module-based CTNet (convolutional transformer network) for the enhancement of spatial and spectral features. In the first module, a virtual RGB image is created from the HSI dataset to improve the spatial features using a pre-trained ResNeXt model trained on natural images, whereas in the second module, PCA (principal component analysis) is applied to reduce the dimensions of the HSI data. After that, spectral features are improved using an EAVT (enhanced attention-based vision transformer). The EAVT contained a multiscale enhanced attention mechanism to capture the long-range correlation of the spectral features. Furthermore, a joint module with the fusion of spatial and spectral features is designed to generate an enhanced feature vector. Through comprehensive experiments, we demonstrate the performance and superiority of the proposed approach over state-of-the-art methods. We obtained AA (average accuracy) values of 97.87%, 97.46%, 98.25%, and 84.46% on the PU, PUC, SV, and Houston13 datasets, respectively.


Introduction
Hyperspectral imaging captures highly detailed spectral information across numerous narrow bands.In contrast to traditional imaging systems that record data in only a few broad spectral channels (e.g., RGB), hyperspectral sensors can acquire data in hundreds or thousands of contiguous and narrow bands [1].This vast amount of spectral information provides high capabilities for various applications, including agriculture, environmental monitoring, mineral exploration, urban planning, and military surveillance [2].Hyperspectral imaging can capture the unique spectral signature of materials, surfaces, and objects.Each pixel in a hyperspectral image contains a spectral curve representing the reflectance or emissivity of the corresponding material at different wavelengths.Analyzing these spectral curves empowers researchers and practitioners to gain valuable insights into the composition and characteristics of the observed scene, enabling the identification of specific materials, vegetation species, mineral deposits, and pollution levels, among others.However, the effective utilization of hyperspectral data remains a formidable challenge.The primary obstacle arises from the high dimensionality of hyperspectral datasets, where Sensors 2024, 24, 2016 2 of 26 each pixel contains a vast number of spectral bands [3].This substantial increase in data dimensions poses difficulties for traditional image processing and classification.
Furthermore, the challenges imposed by high dimensionality and limited spatial resolution make conventional classification methods less effective.In recent years, significant progress has been made in machine learning (ML) and deep learning (DL).However, ML-based methods require handcrafted features for training.Therefore, performance could be more optimal [4].In contrast, convolutional neural networks (CNNs) have demonstrated remarkable feature extraction, pattern recognition, and image-classification capabilities.Several CNN-based methods have utilized spatial features for land cover classification.Further, several research studies have utilized joint spectral and spatial features to improve the classification [5].Vision transformers (ViTs) have recently been proposed to provide long-range dependency on spatial and spectral features for the classification of land objects [6].
Hyperspectral image classification categorizes pixels or regions within a hyperspectral image into predefined classes or land cover categories.Meanwhile, the supervised machine learning methods support vector machine (SVM) [7] and random forest (RF) [8] have been widely used in the early stages of hyperspectral image analysis using texture and color features of the land covers.These methods rely on spectral signatures to discriminate between different classes of land cover.However, hyperspectral data are characterized by high dimensionality, as each pixel contains many spectral bands.Researchers have thus explored various techniques to address these challenges and enhance classification accuracy.Feature-extraction methods, such as principal component analysis (PCA) [9] and minimum noise fraction (MNF) [10], have been utilized to reduce data dimensionality while preserving relevant information.Additionally, dimensionality reduction algorithms like non-negative matrix factorization (NMF) [11] and t-distributed stochastic neighbor embedding (t-SNE) [12] have been employed to enhance the separability of different classes in reduced feature spaces.
Recently, DL techniques have gained popularity in hyperspectral image classification.Deep learning models, particularly convolutional neural networks (CNNs), have demonstrated exceptional capabilities in automatically learning hierarchical and discriminative features from raw data.The CNN-based approach is extensively employed in many image-related applications because of its inherent local connectivity and translational invariance properties.In the context of HSI, CNNs are typically constructed by considering both spatial and spectral dimensions.Several studies [13] have employed a 2-D CNN approach to concurrently extract spatial and spectral information to classify hyperspectral images (HSIs).Moreover, a previous study [14] has used a three-dimensional convolutional neural network (3-D CNN) to extract spectral details for land cover classification.The study [15] proposes a novel approach called the spectral-spatial residual network (SSRN).This method combines continuous spectral and spatial residual blocks to extract feature maps from hyperspectral images.The primary objective of this approach is to mitigate the issue of gradient disappearance commonly observed in neural networks.An end-to-end residual spectral-spatial attention network (RSSAN) was suggested for hyperspectral image classification by Zhu et al. [16].
Classification accuracy is improved by combining the spatial and spectral attention modules.In their paper, Xing et al. [17] presented DIKS, a novel deep network with a self-expressive property and irregular convolutional kernels.Hyperspectral image classification was the primary motivation for developing this network.The Multilevel Feature Network and Spectral-Spatial Attention Model (MFNSAM) is a new method presented in [18].In this approach, a CNN is integrated with the attention mechanism.A multilayer convolutional neural network (CNN) and a spectral-spatial attention module make up the MFNSAM.Fusion techniques play a pivotal role in remote sensing applications when integrating data from several sources to acquire complementary and comprehensive information about the scene under observation.Information from hyperspectral pictures is fused with data from other sources in hyperspectral image analysis, such as multispec-tral photos, auxiliary RGB images, or LiDAR data [19].Improving classification accuracy, enhancing geographical features, and providing a more comprehensive picture of the scene are all goals of merging disparate data types.Hyperspectral image classification presents several unique obstacles, and various fusion methods have been investigated to solve these issues [20].These methods include pixel-level fusion, feature-level fusion, and decision-level fusion.The spatial and spectral properties were enhanced by merging them at four scales using a lightweight deep CNN model based on residuals, as demonstrated by Li et al. [21].In a similar study, Wang et al. [22] improved the spatial information of HSI by multispectral image through cross-modality information extracted by a multi-hierarchical cross transformer (MCT).
Pixel-level fusion involves merging individual pixels from multiple data sources to create a new image that integrates spectral and spatial information [23].This technique is beneficial when the spatial resolution of hyperspectral images is lower than that of ancillary data sources, as it allows for enhanced spatial details in the final fused image [24].On the other hand, feature-level fusion involves extracting features from different data sources and combining them to create a new feature representation that captures complementary information from both sources.Feature-level fusion can preserve the original data sources' spectral and spatial characteristics while reducing data redundancy and increasing classification accuracy [25].We have added a summary of some recent methods for hyperspectral classification in Table 1.In short, the ML-based method fails to achieve high performance on HSI datasets due to its dependency on handcrafted features.On the other hand, the CNN approach improved performance but lacked correlation with the long-range features.Further, ViT improved the long-range dependency of the spatial and spectral features.However, computational costs also increased.In the hyperspectral image, spatial resolution is low and spectral resolution is high due to continuous narrow spectral bands.When classifying the land covers in the hyperspectral data, spatial and spectral features play crucial roles.In an RGB image, spatial resolution is high, and spectral resolution is low.Our primary motivation was to improve the spatial and spectral resolution.Therefore, we designed two module-based models, including CTNet.We first generated a synthetic RGB image from the HSI data in one module using a spectral weighting technique.We utilized a pre-trained ResNeXt model to improve the spatial features.In the second module, we first reduced the dimensions of the HSI data using PCA since the processing of many bands requires high costs and time.After that, an enhanced attention-based transformer model was utilized to improve the spectral features and provide long-range dependency.Finally, spatial and spectral features were fused to classify the land covers.
The significant contributions of the method are as follows.

1.
We demonstrate the effectiveness of improving spatial features through synthetic RGB images using a pre-trained ResNeXt to classify the land covers.

2.
We develop and optimize a multiscale attention module of the transformer block to provide long-range dependency of the spectral features.

3.
We designed a fusion module to generate enhanced spatial and spectral features obtained through convolution and transformer modules.4.
We conducted extensive experiments to evaluate the performance of the proposed method on four benchmark datasets.
The rest of the paper is arranged as follows.
In Section 2, a description of the proposed model architecture for HIS classification is discussed.Further, in Section 3, quantitative and visual results on different datasets are illustrated.In Section 4, we discuss the results, Finally, in Section 5, we discuss the conclusion, limitations, and future scope of the proposed method.

Materials and Methods
In the proposed study, we designed a dual-block convolution and transformer-based model.The transformer block extracts spectral features, and the convolution block enhances the spatial features using virtual RGB images.The detailed architecture of the proposed model is shown in Figure 1.
Sensors 2024, 24, x FOR PEER REVIEW 4 of 28 also increased.In the hyperspectral image, spatial resolution is low and spectral resolution is high due to continuous narrow spectral bands.When classifying the land covers in the hyperspectral data, spatial and spectral features play crucial roles.In an RGB image, spatial resolution is high, and spectral resolution is low.Our primary motivation was to improve the spatial and spectral resolution.Therefore, we designed two module-based models, including CTNet.We first generated a synthetic RGB image from the HSI data in one module using a spectral weighting technique.We utilized a pre-trained ResNeXt model to improve the spatial features.
In the second module, we first reduced the dimensions of the HSI data using PCA since the processing of many bands requires high costs and time.After that, an enhanced attentionbased transformer model was utilized to improve the spectral features and provide long-range dependency.Finally, spatial and spectral features were fused to classify the land covers.The significant contributions of the method are as follows.
The rest of the paper is arranged as follows.
In Section 2, a description of the proposed model architecture for HIS classification is discussed.Further, in Section 3, quantitative and visual results on different datasets are illustrated.In Section 4, we discuss the results, Finally, in Section 5, we discuss the conclusion, limitations, and future scope of the proposed method.

Materials and Methods
In the proposed study, we designed a dual-block convolution and transformer-based model.The transformer block extracts spectral features, and the convolution block enhances the spatial features using virtual RGB images.The detailed architecture of the proposed model is shown in Figure 1.

Enhanced Attention-Based Vision Transformer (EAVT)
Suppose the hypercube of the hyperspectral image (HSI) is I ∈ R M×N×B , where M and N indicate the width and height dimensions, and B denotes the total number of bands.Each pixel inside image I encompasses both spatial and spectral characteristics.Their one-hot encoding is represented by a vector, denoted as H = {h 1 , h 2 , . . . . . .h C }, where C represents the various land covers.In HSI, numerous continuous bands offer significant spectral information.However, this increased number of bands also leads to higher computational costs and redundancy.Principal component analysis (PCA) is employed on band B to address this issue.After PCA is performed, the resulting band is denoted as D, and it is represented as Y ∈ R M×N×D .The pixel-wise spectral input is defined as Y spec = {y 1 , y 2 , y 3 , . . . . . .y D } ∈ R 1×D .After that, the spectral band is converted to tokens, and positional encoding is performed as follows.
where N = number of bands in the token T, Y CLS ∈ R 1×D class tokens, Y band ∈ R N×D band tokens, Y ′ = output after positional encoding, and Y spec ∈ R (1+N)×D is generated after position encoding.The attention weight A k j of the jth input with neighbor size k and relative positional bias B(i, j) is calculated as follows.
In Equation ( 1), the nearest neighbor of the k-th input is denoted by σ(k).Query (Q) and Key (K) are the linear projection token vectors.After that, the linear projection neighbor is calculated using Equation (2).
where v k j is a matrix that represents k nearest neighbor linear projection value of the j-th input.Finally, the attention to the j-th tokens with neighbor size k is defined using Equation (4).
where √ d is the scaling factor.The attention obtained using Equation ( 4) is repeated for every pixel in the feature map.The detailed architectures designed for the attention module of the classical transformer and the proposed one are shown in Figure 2.

Synthetic RGB Image Formation
Let H be the hyperspectral image cube with dimensions M × N × P, where M and N are the spatial dimensions (height and width) and P is the number of hyperspectral bands.We defined the intensity of the image at spatial position (I, j) in the K-th band with H ijk .Further, the weight matrix W of each RGB channel with dimension 3 × P is defined, in which the rows represent the red, green, and blue channels and the columns represent the weight for each hyperspectral band in producing the RGB channel.We applied weights to each hyperspectral band to enhance the quality and relevance of the derived data.Additionally, it optimizes computational resources for accurately identifying constituent materials and data processing for specific applications.Band weighting refines hyperspectral data, making land cover classification more accurate and efficient.We applied band weighting to each channel c and spectral band P and calculated the intensity of the spectral band as follows.
where I ij c,K = intensity of the spectral band K for channel c, at spatial position (i, j).H ijk = intensity of the hyperspectral image at position (i, j) in the k-th band.W cK = weight of the k-th spectral band for c-th RGB channel.

Synthetic RGB Image Formation
Let H be the hyperspectral image cube with dimensions M × N × P, where M and N are the spatial dimensions (height and width) and P is the number of hyperspectral bands.We defined the intensity of the image at spatial position (I, j) in the K-th band with Hijk.Further, the weight matrix W of each RGB channel with dimension 3 × P is defined, in which the rows represent the red, green, and blue channels and the columns represent the weight for each hyperspectral band in producing the RGB channel.We applied weights to each hyperspectral band to enhance the quality and relevance of the derived data.Additionally, it optimizes computational resources for accurately identifying constituent materials and data processing for specific applications.Band weighting refines hyperspectral data, making land cover classification more accurate and efficient.We applied band weighting to each channel c and spectral band P and calculated the intensity of the spectral band as follows.After this, we populated each channel for the spatial dimension at position (i, j) as follows.
RGB_I(i, j, For each channel (R, G, B) of the synthetic image, the minimum and maximum intensity values are calculated to ensure all channels have values within the same range.
In hyperspectral images, different bands might have been captured under slightly different illumination conditions.Normalization mitigates these differences, ensuring that the brightness and contrast are consistent across bands.The normalization operation is performed as follows.
where I N (i, j, 1) = normalized intensity pixel value of red channel at a spatial position (i, j).RGB_I(i, j, 1) = intensity of the pixel at spatial position (i, j) in the red channel before normalization.minval R = minimum intensity value of red channel.maxval R = maximum intensity of red channel.I N (i, j, 2) = normalized intensity pixel value of blue channel at a spatial position (i, j).RGB_I(i, j, 2) = intensity of the pixel at spatail position (i, j) in blue channel before normalization.minval R = minimum intensity value of the blue channel.maxval R = maximum intensity of the blue channel.I N (i, j, 3) = normalized intensity pixel value of green channel at a spatial position (i, j).RGB_I(i, j, 3) = intensity of the pixel at spatail position (i, j) in the green channel before normalization.minval R = minimum intensity value of green channel maxval R = maximum intensity of green channel.
After normalization, we rounded each pixel value to the nearest integer in each channel, and finally, the image was constructed as follows.
The synthesized RGB image is passed to the pre-trained ResNeXt for spatial feature extraction.The Algorithm 1 to generate synthetic RGB is shown below.

Algorithm 1: Steps to generate synthetic RGB image
Input: Hyperspectral image cube H, with dimensions M × N × P and Weight matrix (W) (1) For each channel (Red, Green, Blue) and each spectral band (P), calculate the intensity of the spectral band as follows.
(c,K) = intensity of k-th spectral band for channel c at spatial position (i, j).H ijk = intensity of the hyperspectral image at position (i, j) in the k-th band.W cK = weight of the k-th spectral band for c-th RGB channel.
(2) Calculate the intensity of the R, G, and B channels for the synthetic image using Equation ( 6), Equation (7) and Equation ( 8), respectively.
(5) Round each pixel value to the nearest integer in each channel as follows.
where I_N = Normalized pixel value rounded to the nearest integer and I_N(i, j, c ) = normalized intensity value of the pixel at position (i, j) in channel c. (6) Construct the final RGB image using the normalized and rounded values in each channel as follows.
RGB_ f inal(i, j, c) = I_N(i, j, c) where RGB_ f inal(i, j, c) = pixel value in the final RGB image at position (i, j) in channel c and I_N(i, j, c) = normalized and round intensity value of pixel value at position (i, j) in channel c.Output: RGB image

Enhanced Spatial Features Using Virtual RGB Images
The labeled hyperspectral image data are limited.Significant differences in imaging settings, spectral bands, and ground objects make hyperspectral data unsuitable for training using nature.The CNN can classify HSI by determining the pixel level of each land cover.In the proposed study, we utilized a three-channel synthetic RGB image to enhance spatial features using a pre-trained ResNeXt model trained on natural images for pixel-level classification.A residual block is mathematically defined as follows.
where F represents the residual, x is the input, and y is the output.In the ResNeXt, the input is split into several branches, processing each distinctively and subsequently merging them.The split and merge function are expressed as follows for a specific layer.
where C is the cardinality and T represents each branch's transformation function.The residual block is shown in Figure 3, and the branch's transformation is represented as follows.
where T(x, w) is the output of the convolution layer for input x, "conv" refers to the convolutional process, "BN" indicates batch normalization, "ReLU" is the rectified linear activation, and W 1 and W 2 are convolutional operation weights.After that, a global average pooling and a fully connected layer are added to classify the land covers.The detailed architecture of the ResNeXt model is shown in Figure 4.
where T(x, w) is the output of the convolution layer for input x, "conv" refers to the convolutional process, "BN" indicates batch normalization, "ReLU" is the rectified linear activation, and W1 and W2 are convolutional operation weights.After that, a global average pooling and a fully connected layer are added to classify the land covers.The detailed architecture of the ResNeXt model is shown in Figure 4.
where T(x, w) is the output of the convolution layer for input x, "conv" refers to the convolutional process, "BN" indicates batch normalization, "ReLU" is the rectified linear activation, and W1 and W2 are convolutional operation weights.After that, a global average pooling and a fully connected layer are added to classify the land covers.The detailed architecture of the ResNeXt model is shown in Figure 4.

Spectral-Spatial Feature Fusion for HSI Classification
We observed that the spatial features obtained through FCN and spectral features extracted by transformer blocks differed in the range and distribution of values.Therefore, features were normalized and integrated to merge spatial and spectral features.Suppose  ∈ ℝ is a spectral feature obtained from a Transformer with band Ds and  ∈

Spectral-Spatial Feature Fusion for HSI Classification
We observed that the spatial features obtained through FCN and spectral features extracted by transformer blocks differed in the range and distribution of values.Therefore, features were normalized and integrated to merge spatial and spectral features.Suppose T out ∈ R W×H×D s is a spectral feature obtained from a Transformer with band D s and C out ∈ R W×H×D spa is a spatial feature obtained from pre-trained FCN with dimension D spa .Here, W × H is the size of the size of the feature that will be fused to generate an enhanced features vector.Before the fusion process, spectral features are normalized as follows.Similarly, we normalized the spatial feature.After normalization, we concatenated the spectral and spatial features to generate an enhanced feature vector as follows.
Finally, the enhanced F e is passed to the Softmax layer for the classification of the land covers.The loss of the model on each dataset having N training samples is calculated as follows.
where P = Total land cover categories, I jp = Indicator function.It takes a value of 1 if the j-th category is p and otherwise 0. V jp = Probability value of the j-th samples belongs to p-th class.The Algorithm 2 for the proposed method is shown below.

Experimental Results and Discussion
In this section, we have demonstrated the quantitative and visual results obtained on four datasets.

Datasets Description
In this section, we discussed the datasets used to evaluate the proposed method for hyperspectral image classification.Four benchmark hyperspectral datasets, PU (Pavia University), PUC (Pavia University Centre), SV (Salina Valley), and Houston-13, were selected for this study.The PU dataset captures an area covering Pavia University, Italy, and was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor.The spatial dimensions of the dataset are 610 × 340 pixels, and each pixel represents a ground area of 1.3 m × 1.3 m, which is the spatial resolution of the dataset, which has nine land cover classes.The PUC dataset has spatial dimensions of 1096 × 715 pixels.The image in PUC is larger than the Pavia University dataset, which has 610 × 340 pixels.It contains 102 bands, and it has nine land cover classes.
The SV data were captured using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor, and they have spatial dimensions of around 512 × 217 pixels.The dataset typically contains 224 contiguous spectral bands.The spectral bands usually cover a range from 0.4 µm to 2.5 µm, and they have 16 land covers.The Houston13 dataset contains both hyperspectral and LiDAR data from an urban area in Houston, Texas, USA.This dataset has 144 spectral bands in the 380 nm to 1050 nm region and has been calibrated to at-sensor spectral radiance units.The spatial dimensions of the dataset are 349 × 1905 pixels, with a spatial resolution of 2.5 m.The detailed description of the datasets is shown in Table 2, and a color map of the land covers is shown in Figure 5.

Performance Metrics
Standard performance metrics for classification tasks were employed to comprehensively evaluate the proposed approach's performance and compare it with baseline methods.The following metrics were utilized: Overall Accuracy (OA): Overall accuracy represents the ratio of correctly identified

Performance Metrics
Standard performance metrics for classification tasks were employed to comprehensively evaluate the proposed approach's performance and compare it with baseline methods.The following metrics were utilized: Overall Accuracy (OA): Overall accuracy represents the ratio of correctly identified instances to the total number of instances.
where N = total testing sample, T = total diagonal, and CM = confusion matrix.Average accuracy (AA): Average accuracy is the mean of accuracies obtained for each individual class.
where N is the number of classes and CA i represents class-specific accuracy.
Sensors 2024, 24,2016 Kappa core (KS): kappa measures the observed agreement between two classifiers compared to the agreement that would be expected purely by chance.This metric can be used to evaluate the reliability and consistency of a classifier on a categorical problem.Kappa is calculated using the following formula: where P o is the proportion of instances where the two classifiers agree and P e is the proportion of instances where the two classifiers would agree by chance.

Experimental Setup
We

Comparative Analysis with Baseline Methods
In this subsection, a comprehensive comparative analysis was performed to evaluate the effectiveness of the proposed approach against traditional classification methods and state-of-the-art deep-learning-based methods.

Quantitative Results
We evaluated the performance of several methods under the same experimental environment.For the PU dataset, 95% of the samples were used for training and 5% for validation.The 2DCNN is a five-layer sequential convolutional neural network.It has three convolutional, two max pooling, and one fully connected layer.On the other hand, 3DCNN has three convolutional layers to extract spectral features.
Further, BTA-Net is an attention-based model designed using 1D and 2D convolutional layers to extract spatial features.The HybridSN utilized 2D and 3D CNN layers to improve performance using spatial and spectral features.At the same time, UML applied multiscale depth-wise 1D and 3D convolutional layers for joining spatial and spectral features.SiT and 3DSwinT provide long-range dependency on the spatial and spectral features using ViT to improve the accuracy of the classification of land covers.The performance measures of the proposed CTNet and other methods are shown in Table 3. Table 3 shows that the HybridSN achieved the highest classification accuracy of 97.53% for painted metal sheets, whereas UML classifies Bare Soil with an accuracy of 96.42%.The transformer-based model SiT obtained 96.53% accuracy for the tree class.The proposed model CTNet achieved the highest classification accuracies of 98.65%, 95.37%, 94.17%, 98.76%, and 97.58% for the Asphalt, Gravel, Bitumen, Self-Blocking Bricks, and Shadows classes, respectively.
In the PUC dataset, for nine land covers, 7456 samples are available, which is less than the PU dataset.In addition, to avoid overfitting, we trained all models on 90% samples and validate on 10% samples.Other experimental setups were the same as those used for the PU dataset.The performance for each class and OA, AA, and Kappa value is shown in Table 4. Table 4 shows that the BTA-Net achieved an accuracy of 98.12% for the Self-Blocking Bricks land cover.UML achieved the highest classification, 97.82%, for Asphalt, whereas SiT obtained 98.84% accuracy for Tiles class.The 3DswinT obtained an accuracy of 97.57% for the Meadows class.The SV dataset contains 54,129 samples.The dataset is divided into 95% and 5% for training and validation.In Table 5, we can see that the 2DCNN performance could be better in several classes.However, 3DCNN improved the performance in the Let-tuce_romaine_7wk and Vinyard_vertical_trellis classes.Moreover, BTA-Net has the highest classification accuracy for the Lettuce_romaine_6wk class.The HybridSN discriminates Fallow_rough_plow and Stubble land covers with the highest quantitative value.Further, UML showed improved results in several classes.The SiT methods achieved more than 95% classification accuracy.The 3DSwinT and proposed CTNet achieved similar performance in several classes.However, CTNet dominates in classification accuracy, where there are fewer samples.The Houston 13 dataset contains very few samples for each class.Therefore, we split the dataset into 90% and 10% for training and validation.The quantitative results of the 2DCNN and 3DCNN are less in several classes, as shown in Table 6.The BTA-Net and HybridSN improve the performance.The UML achieved the highest classification accuracy for the Trees class and 3DSwinT for Non-residential buildings.Moreover, CTNet classification accuracy is highest in the five land covers.

Visual Results Analysis
In Figures 6-9, we present visual maps for the classes of the PU, PUC, SA, and Hous-ton13 datasets.Specifically, Figure 6 reveals that the 2DCNN-based land cover classification map is not consistently accurate with the ground truth (GT) across various classes.This discrepancy is particularly evident in the Asphalt, Bitumen, Self-Blocking Bricks, and Shadows classes.In contrast, 3DCNN offers enhanced visual maps for several classes.The 3DCNN method displays superior object visualization, mainly producing a map almost identical to the GT for the Painted metal sheets class.The BTA-Net's representation of the Meadows class outperforms other techniques, while HybridSN's depiction of the Asphalt class closely aligns with the GT.The UML method leverages global feature attention to refine its land cover classification map, and CTNet's visualizations closely match the GT in the Trees, Bare Soil, Bitumen, and Shadows classes.
Figure 7 further showcases that the classification maps of 2DCNN, 3DCNN, and BTA-Net for water and Shadows land covers appear noisy.In contrast, HybridSN provides a superior representation for the tiles class.The UML methods provide better visuals in the Asphalts class.Meanwhile, SiT and 3DSwinT improved the visual map of the Tree and Meadows classes.Furthermore, our proposed approach's classification maps align closely with the GT across multiple classes Water, Trees, Bitumen, Shadows, and Bare Soil.
class closely aligns with the GT.The UML method leverages global feature attention to refine its land cover classification map, and CTNet's visualizations closely match the GT in the Trees, Bare Soil, Bitumen, and Shadows classes.
Figure 7 further showcases that the classification maps of 2DCNN, 3DCNN, and BTA-Net for water and Shadows land covers appear noisy.In contrast, HybridSN provides a superior representation for the tiles class.The UML methods provide better visuals in the Asphalts class.Meanwhile, SiT and 3DSwinT improved the visual map of the Tree and Meadows classes.Furthermore, our proposed approach's classification maps align closely with the GT across multiple classes Water, Trees, Bitumen, Shadows, and Bare Soil.In Figure 8, the 2DCNN-based method classification map for land covers does not align closely with the GT across various classes.Discrepancies are noticeable in several specific classes.On the other hand, the 3DCNN method offers more accurate visual representations in multiple classes.Compared to the GT, the BTA-Net technique showcases superior object detail, which is especially evident in its near-perfect depiction of the Let-tuce_romaine_6wk class.Similarly, HybridSN's representation of the Fallow_rough_plow and Stubble classes closely mirrors the GT.The UML method refines its portrayal using global feature attention, especially in the Brocoli_green_weeds_2 and Lettuce_ro-maine_4wk classes.At the same time, SiT showed better visual maps for the Fal-low_smooth and Corn_senesced_green_weeds classes.The proposed CTNet improved the visual maps aligning with the GT across Alfalfa, Corn-mintill, Hay-windrowed, Oats, Soybean-clean, Woods, Buildings-Grass-Trees-Drives, and Stone-Steel-Towers classes.
In Figure 9, we can observe that the classification maps from 2DCNN and 3DCNN appear noisy.The BTA-Net offers a refined visualization, particularly for the Non-residential buildings class.The HybridSN excels in representing the Grass healthy class compared to other techniques.UML and 3DSwinT yield superior visualizations for the Trees

Discussions
We evaluated the CTNet on PU, PUC, SV and Houston13 and achieved better quantitative visual results compared to its counterparts 2DCNN, 3DCNN, BTA-Net, HybridSN, UML, SiT, and 3DSwinT, as discussed in Section 3. The proposed model enhances spatial features using virtual RGB and ResNeXt.Further, we enhanced the spectral features using an enhanced attention-based vision transformer (EAVT).ViTs are In Figure 9, we can observe that the classification maps from 2DCNN and 3DCNN appear noisy.The BTA-Net offers a refined visualization, particularly for the Non-residential buildings class.The HybridSN excels in representing the Grass healthy class compared to other techniques.UML and 3DSwinT yield superior visualizations for the Trees and Non-residential buildings classes, respectively.Additionally, our suggested approach's classification maps closely resonate with the GT across various classes.

Discussion
We evaluated the CTNet on PU, PUC, SV and Houston13 and achieved better quantitative visual results compared to its counterparts 2DCNN, 3DCNN, BTA-Net, HybridSN, UML, SiT, and 3DSwinT, as discussed in Section 3. The proposed model enhances spatial features using virtual RGB and ResNeXt.Further, we enhanced the spectral features using an enhanced attention-based vision transformer (EAVT).ViTs are advanced natural language processing (NLP) techniques representing pairwise interactions among tokens and capturing long-range correlations [35].The transformer-based technique has been effectively implemented in computer vision applications, and pre-trained transformers now have a robust multipurpose backbone.To implement the classical ViT, we split the input image I ∈ R H×W×D into patches P ∈ R n×(p H ×p W ×D) with a size of p H p W .The ViT encoder uses alternating multi-head self-attention (MSA) and feed-forward (FF) blocks with layer normalization (LN) to encode and generate embedded data z.The quadratic complexity of the attention mechanism for a given input token is the primary impediment to implementing ViT on high-dimensional data.The complexity of self-attention has been reduced in several research studies, and self-attention has been applied individually instead of pairwise among all tokens to increase the effectiveness of transformers for large numbers of tokens.Our EAVT is inspired by the Swin and convolution self-attention mechanism that enhanced the spectral features.

Patch Size Effect on Model Performance
Vision transformers depend on patch size, the length, and the width of the nonoverlapping patches created from the input images.The transformer receives tokens that are linearly integrated with these patches.Figure 10 demonstrates that the CTNet performance is lower for 9 × 9 and 11 × 11 patch sizes, while the highest accuracy for classification is attained for 15 × 15 patches.In addition, growing the patch size decreases the accuracy of classification.

Training Loss of the Proposed Model
We have calculated the training loss of the CTNet on the four-dataset using the method described in Equation ( 14) for the PU, PUC, SV, and Houston13 datasets shown in Figure 11.In Figure 11a, the training loss of the proposed method is initially high; after 30 epochs, it reaches zero.On PUC, dataset training loss reaches a value close to zero after 25 epochs.However, the SA dataset reaches a value close to zero after 75 epochs.Furthermore, on the Houston 13 dataset, training loss is relatively high due to the small size of the dataset.

Computation of the Training and Validation Time
In Table 7, we compare the training and validation times of various methods, including 2DCNN [24], 3DCNN [36], BTA-Net [37].The CNN based require large volume of data for training [38,39].In addition, we also compared with HybridSN [40], UML [41], SiT [42], 3DSwinT [43], and CTNet.The CTNet demonstrates relatively faster performance than other methods, excluding 2DCNN.This indicates that our approach can reduce computation time and enhance classification efficiency.The high training and validation time for SiT and 3DSwinT are attributed to their deeper network layers, requiring extensive computational cycles per iteration.However, CTNet takes slightly longer than 2DCNN due to utilizing a ResNeXt for spatial feature extraction.
Vision transformers depend on patch size, the length, and the width of the non-overlapping patches created from the input images.The transformer receives tokens that are linearly integrated with these patches.Figure 10 demonstrates that the CTNet performance is lower for 9 × 9 and 11 × 11 patch sizes, while the highest accuracy for classification is attained for 15 × 15 patches.In addition, growing the patch size decreases the accuracy of classification.

Training Loss of the Proposed Model
We have calculated the training loss of the CTNet on the four-dataset using the method described in Equation ( 14) for the PU, PUC, SV, and Houston13 datasets shown in Figure 11.In Figure 11a, the training loss of the proposed method is initially high; after 30 epochs, it reaches zero.On PUC, dataset training loss reaches a value close to zero after 25 epochs.However, the SA dataset reaches a value close to zero after 75 epochs.Furthermore, on the Houston 13 dataset, training loss is relatively high due to the small size of the dataset.Further, we plotted the bar plot for the computation time comparison on the PU, PUC, SV, and Houston13 datasets, shown in Figure 12.We can notice that the training time of all the models on the SV training dataset is relatively high.For the Houston13 dataset, the training and test times are the lowest.

Effects of Training Samples (%) on OA Accuracy
The general thought for the CNN model is that it requires a large volume of training data for better classification performance [44].We plotted the training sample (in %) and the OA accuracy curve for the PU.The PUC, SV, and Houston datasets are shown in Figure 13.The OA accuracy is less in all the datasets for a small percentage of the training samples.As we increased the samples, the OA accuracy also increased.The highest OA on 90% of the data was obtained in the PUC datasets due to the large samples present in each class of the PUC dataset.The lowest OA accuracy of 84% on 90% training was obtained on the Houston13 dataset due to the fewer samples in each land cover.
SiT [43] 370 Further, we plotted the bar plot for the computation time comparison on the PU, PUC, SV, and Houston13 datasets, shown in Figure 12.We can notice that the training time of all the models on the SV training dataset is relatively high.For the Houston13 dataset, the training and test times are the lowest.

Effects of Training Samples (%) on OA Accuracy
The general thought for the CNN model is that it requires a large volume of training data for better classification performance [44].We plotted the training sample (in %) and the OA accuracy curve for the PU.The PUC, SV, and Houston datasets are shown in Figure 13.The OA accuracy is less in all the datasets for a small percentage of the training samples.As we increased the samples, the OA accuracy also increased.The highest OA on 90% of the data was obtained in the PUC datasets due to the large samples present in each class of the PUC dataset.The lowest OA accuracy of 84% on 90% training was obtained on the Houston13 dataset due to the fewer samples in each land cover.

Bar Plot Based Comparison
We experimentally evaluated the performance of the 2DCNN [24], 3DCNN [36], BTA-Net [40], HybridSN [37], UML [42], SiT [43], and 3DSwinT datasets [41] and proposed CTNet on the PU, PUC, SV, and Houston13 datasets.All the methods were evaluated under the same experimental conditions for a fair comparison.We plotted these models' AA, OA, and Kappa scores, as shown in Figure 14.In Figure 14a, we notice that the OA accuracy of classical CNN-based methods is relatively low compared to transformers.The 3DSwinT obtained the second-highest OA of 95.68%, whereas 2DCNN achieved the lowest OA of 86.75% on the PU dataset.On the PUC dataset shown in Figure 14b, UML, SiT, and 3DSwinT, we obtained OA values that were very close to each other.Meanwhile, the proposed CTNet showed superior performance compared to other methods.In Figure 14c, we can observe that the AA accuracy of the SiT is very close to that of the proposed CTNet.At the same time, the lowest AA accuracy can be noticed in the 2DCNN and 3DCNN methods.In addition, the Kappa values of the HybridSN and 3DSWinT are very close to each other.On the Houston 13 dataset shown in Figure 14d, the OA of the classical

Bar Plot Based Comparison
We experimentally evaluated the performance of the 2DCNN [24], 3DCNN [36], BTA-Net [40], HybridSN [37], UML [42], SiT [43], and 3DSwinT datasets [41] and proposed CTNet on the PU, PUC, SV, and Houston13 datasets.All the methods were evaluated under the same experimental conditions for a fair comparison.We plotted these models' AA, OA, and Kappa scores, as shown in Figure 14.In Figure 14a, we notice that the OA accuracy of classical CNN-based methods is relatively low compared to transformers.The 3DSwinT obtained the second-highest OA of 95.68%, whereas 2DCNN achieved the lowest OA of 86.75% on the PU dataset.On the PUC dataset shown in Figure 14b, UML, SiT, and 3DSwinT, we obtained OA values that were very close to each other.Meanwhile, the proposed CTNet showed superior performance compared to other methods.In Figure 14c, we can observe that the AA accuracy of the SiT is very close to that of the proposed CTNet.At the same time, the lowest AA accuracy can be noticed in the 2DCNN and 3DCNN methods.In addition, the Kappa values of the HybridSN and 3DSWinT are very close to each other.On the Houston 13 dataset shown in Figure 14d, the OA of the classical CNN and transformer-based methods is below 90% due to there being fewer samples in each land cover.The OA of the 2DCNN and 3DCNN are close to each other.Transformerbased methods SiT and 3DSWinT obtained AA values of 68.7% and 70.05%, respectively.Meanwhile, the proposed CTNet achieved an AA value of 83.58%.

Conclusions
In the proposed study, the fusion of spectral and spatial information has resulted in a remarkable improvement in classification accuracy, surpassing traditional methods and even outperforming deep learning models that do not incorporate RGB data.Integrating RGB and hyperspectral data allows for a more comprehensive characterization of the observed scene, empowering effective discrimination between land cover classes with distinct spectral and spatial patterns.Further, high-dimension spatial features are extracted by pre-trained ResNeXt to improve the spatial features.At the same time, it takes less computation time due to the pre-trained model.In addition, the enhanced attention-based transformer network extracts spectral features to provide a long-range dependency of the

Conclusions
In the proposed study, the fusion of spectral and spatial information has resulted in a remarkable improvement in classification accuracy, surpassing traditional methods and even outperforming deep learning models that do not incorporate RGB data.Integrating RGB and hyperspectral data allows for a more comprehensive characterization of the observed scene, empowering effective discrimination between land cover classes with distinct spectral and spatial patterns.Further, high-dimension spatial features are extracted by pre-trained ResNeXt to improve the spatial features.At the same time, it takes less computation time due to the pre-trained model.In addition, the enhanced attention-based transformer network extracts spectral features to provide a long-range dependency of the features.Furthermore, the fusion of spatial and spectral features enhanced the classification performance.We experimentally evaluated the CTNet on the four standard datasets, PU, PUC, SV, and Houston13.The average accuracies on PU, PUC, SV, and Houston13 is 96.83%, 96.72%, 96.74, and 83.58%, respectively.Moreover, the visual map of the CTNet on these datasets is closer to the GT.The proposed approach can be utilized in agriculture remote sensing to monitor crop health and stress measurement.In addition, it can be also used for the classification of different types of crops.Furthermore, an automated system can be designed for the diagnosis of different types of disease in crops.In the environment of remote sensing, it can be used to monitor the land cover changes, vegetation dynamics, and ecosystem health.In addition, it can be also used for biodiversity assessment by mapping habitats, identifying biodiversity hotspots, and monitoring changes in species distribution.
The major limitations of the proposed method include accurately aligned RGB data with high spatial resolution.Misalignment can disrupt the fusion process and affect classification accuracy.Furthermore, a pre-trained model is required to improve the spatial resolution.The fusion process improved the classification performance, but noise in the data can lead to potential misclassification.Additionally, the approach's success may be contingent upon the availability of labeled data for training and diverse datasets to achieve optimal performance.We will include axillary data from Radar and other transfer learning and domain adaptation methods in future studies.Further, the interpretability of the hyperspectral image with explainable AI and ensemble learning techniques with real-time applications can be explored.

Figure 1 .
Figure 1.The proposed CTNet architecture for the classification of the land covers.

Figure 2 .
Figure 2. The architecture of the self-attention (a) and enhanced attention (b) block.

Figure 2 .
Figure 2. The architecture of the self-attention (a) and enhanced attention (b) block.

28 Figure 3 .
Figure 3.The residual block of the model.

Figure 3 . 28 Figure 3 .
Figure 3.The residual block of the model.

Figure 4 .
Figure 4. Architecture of the ResNeXt for spatial feature extraction.

Figure 4 .
Figure 4. Architecture of the ResNeXt for spatial feature extraction.
where ∼ T d = mean of the spectral features, σ d = standard deviation, F T ij = normalized spectral features.

Algorithm 2 :
The proposed method's algorithm INPUT: Hyperspectral image I ∈ R H×W×D and ground truth label X ∈ R H×W .1. Apply PCA and set dimension D = 30, and pass it to the transformer block.2. Generate RGB image from I ∈ R H×W×D using spectral weighting.3.For I = 1 to 200, do (a) Train the ResNeXT using synthesize image.(b) Apply spectral linear projection to generate Q, K, and V and pass to EAVT.(c) Train the EAVT.end 4. Apply Equations (19) and (20) to generate enhanced features.5. Test the model for classification of land covers.6. Plot the training loss curve.OUTPUT: Classified label of the test dataset (I ∈ R H×W×C )

Figure 5 .
Figure 5.The ground truth map with the class label colors of the PU, PUC, SV, and Houston13 shown in (a), (b), (c), and (d), respectively.

Figure 5 .
Figure 5.The ground truth map with the class label colors of the PU, PUC, SV, and Houston13 shown in (a), (b), (c), and (d), respectively.
experimented on Dell Precision 7920 Workstation, which has the following configuration: Intel Xeon Gold 5222 3.8 GHz Processor (Intel Corporation, Santa Clara, CA, USA), Kingston 128 GB DDR4 2933 RAM (Kingston Technology Company, Fountain Valley, CA, USA), Kingston 1 TB 7200 RPM SATA HDD (Kingston Technology Company, Fountain Valley, CA, USA), Kingston 500 GB SSD (Kingston Technology Company, Fountain Valley, CA, USA), Nvidia Quadro RTX 4000 8 GB Graphics Card (Nvidia Corporation, Santa Clara, CA, USA), 24 Inch Dell TFT Monitor (Dell, Round Rock, TX, USA), Dell USB Mouse (Dell, Round Rock, TX, USA), Dell KB216 Wired Keyboard (Dell, Round Rock, TX, USA), Microsoft Windows 10 Operating System (Microsoft Corporation, Redmond, WA, USA), Python 3.8 Programming Language (Python Software Foundation (PSF), Wilmington, DE, USA), and Tensor Flow 2.0 open-source Machine Learning Framework (Google, Menlo Park, CA, USA).The Adam optimizer with an initial learning rate of 0.001 accelerates the training process and trains each model for 200 epochs with a batch size of 128.

Figure 6 .
Figure 6.Visual map of land covers using (a) 2DCNN (b) 3DCNN (c) BTA-Net (d) HybridSN (e) UML (f) SiT (g) 3DSwinT, and (h) CTNet on PU dataset.In Figure 8, the 2DCNN-based method classification map for land covers does not align closely with the GT across various classes.Discrepancies are noticeable in several specific classes.On the other hand, the 3DCNN method offers more accurate visual representations in multiple classes.Compared to the GT, the BTA-Net technique showcases superior object detail, which is especially evident in its near-perfect depiction of the Let-tuce_romaine_6wk class.Similarly, HybridSN's representation of the Fallow_rough_plow and Stubble classes closely mirrors the GT.The UML method refines its portrayal using global feature attention, especially in the Brocoli_green_weeds_2 and Lettuce_romaine_4wk classes.At the same time, SiT showed better visual maps for the Fallow_smooth and Corn_senesced_green_weeds classes.The proposed CTNet improved the visual maps aligning with the GT across Alfalfa, Corn-mintill, Hay-windrowed, Oats, Soybean-clean, Woods, Buildings-Grass-Trees-Drives, and Stone-Steel-Towers classes.

Figure 10 .
Figure 10.Illustration of the patch size on the OA, AA, and Kappa on PU, PUC, SV, and Houston13 datasets is shown in (a), (b), (c), and (d), respectively.

Figure 10 .
Figure 10.Illustration of the patch size on the OA, AA, and Kappa on PU, PUC, SV, and Houston13 datasets is shown in (a), (b), (c), and (d), respectively.

Figure 11 .
Figure 11.Illustration of the training loss on the PU, PUC, SV, and Houston13 datasets is shown in (a), (b), (c), and (d), respectively.

Figure 12 .
Figure 12.Bar-plot-based comparison of computation time on different datasets.

Figure 13 .
Figure 13.Effect of training sample on OA.

Figure 13 .
Figure 13.Effect of training sample on OA.

Table 1 .
Summary of the recent method for HSI classification.

Table 2 .
Details of the sample in each land cover with their ground truth and color map.

Table 3 .
Quantitative performance comparison on the PU dataset (in %).

Table 4 .
Quantitative performance comparison on the PUC Dataset (in %).

Table 5 .
Quantitative performance comparison on the SV Dataset (in %).

Table 7 .
Comparison of training and validation times of various methods.