CESA-MCFormer: An Efficient Transformer Network for Hyperspectral Image Classification by Eliminating Redundant Information

Hyperspectral image (HSI) classification is a highly challenging task, particularly in fields like crop yield prediction and agricultural infrastructure detection. These applications often involve complex image types, such as soil, vegetation, water bodies, and urban structures, encompassing a variety of surface features. In HSI, the strong correlation between adjacent bands leads to redundancy in spectral information, while using image patches as the basic unit of classification causes redundancy in spatial information. To more effectively extract key information from this massive redundancy for classification, we innovatively proposed the CESA-MCFormer model, building upon the transformer architecture with the introduction of the Center Enhanced Spatial Attention (CESA) module and Morphological Convolution (MC). The CESA module combines hard coding and soft coding to provide the model with prior spatial information before the mixing of spatial features, introducing comprehensive spatial information. MC employs a series of learnable pooling operations, not only extracting key details in both spatial and spectral dimensions but also effectively merging this information. By integrating the CESA module and MC, the CESA-MCFormer model employs a “Selection–Extraction” feature processing strategy, enabling it to achieve precise classification with minimal samples, without relying on dimension reduction techniques such as PCA. To thoroughly evaluate our method, we conducted extensive experiments on the IP, UP, and Chikusei datasets, comparing our method with the latest advanced approaches. The experimental results demonstrate that the CESA-MCFormer achieved outstanding performance on all three test datasets, with Kappa coefficients of 96.38%, 98.24%, and 99.53%, respectively.


Introduction
With the continuous advancement of spectral imaging technology, hyperspectral data have achieved significant improvements in both spatial and spectral resolution.Compared to multispectral and RGB images, hyperspectral images (HSI) possess narrower bandwidths and a greater number of bands, allowing them to provide more detailed and continuous spectral information [1][2][3].As a result, HSI have demonstrated tremendous potential in various earth observation fields [4] such as precision agriculture [5,6], urban planning [7,8], environmental management [9][10][11], and target detection [12][13][14][15].Consequently, research on HSI classification has rapidly progressed.
While traditional HSI classification methods, such as nearest neighbor [16], Bayesian estimation [17], multinomial logistic regression [18,19], and Support Vector Machine (SVM) [20][21][22][23], have their merits in certain scenarios, these methods often have limitations in data representation and fitting capability, struggling to produce satisfactory classification results on more complex datasets.In contrast, in recent years, methods underpinned by deep learning, thanks to their outstanding feature extraction capabilities, have gradually become the focus of research in HSI classification.
Convolutional Neural Networks (CNNs) dominate the field of deep learning and are capable of accumulating in-depth spatial features through layered convolution.As such, CNNs have been extensively applied to and researched in the classification of HSI [24,25].Notably, Roy and colleagues [26] introduced a model named HybridSN.This model initially employs a 3D-CNN to extract spatial-spectral features from spectral bands that have undergone PCA dimensionality reduction.Subsequently, it uses a 2D-CNN to delve deeper into more abstract spatial feature hierarchies.Compared to a 3D-CNN, this hybrid approach simplifies the model architecture while effectively merging spatial and spectral information.Building on this, subsequent researchers have incorporated one-dimensional convolution based on central pixels to compensate for spectral information that might be lost after PCA reduction.Examples of this approach include the Cubic-CNN model proposed by J. Wang et al. [27] and the JigsawHSI model introduced by Moraga and others [28].
The Vision Transformer (ViT) model [29], which evolved from the natural language processing (NLP) domain, has also increasingly become a focal point in the field of deep learning.The ViT model segments images into fixed-size patches and leverages embedding techniques to obtain a broader receptive field.Furthermore, with the help of multi-head attention mechanisms, it adeptly captures the dependencies between different patches, thereby achieving higher processing efficiency and remarkable image recognition performance.Consequently, numerous studies have been dedicated to exploring the application of this model in HSI classification.For instance, a research team proposed the Spatial-Spectral Transformer (SST) model in [30].They utilized VGGNet [31], from which several convolutional layers were removed, as a feature extractor to capture spatial characteristics from hyperspectral images.Subsequently, they employed the DenseTransformer to discern relationships between spectral sequences and used a multi-layer perceptron for the final classification task.Qing et al. introduced SATNet in [32], which effectively captures spectral continuity by adding position encoding vectors and learnable embedding vectors.Meanwhile, Hong and colleagues presented the SpectralFormer (SF) model in [33].This model adopts the Group-wise Spectral Embedding (GSE) module to encode adjacent spectra, ensuring spectral information continuity, and utilizes the Cross-layer Adaptive Fusion (CAF) technique to minimize information loss during hierarchical transmission.X.He and their team introduced the SSFTT network in [34].This model significantly simplifies the SST structure and incorporates Gaussian-weighted feature tagging for feature transformation, thus reducing computational complexity while enhancing classification performance.In recent studies, researchers have continued to explore more lightweight and effective methods for feature fusion and extraction based on the transformer architecture.For instance, Xuming Zhang and others proposed the CLMSA and PLMSA modules [35], while Shichao Zhang and colleagues introduced the ELS2T [36].
Due to the high correlation between adjacent bands in HSI, there is a significant amount of redundant information within HSI.Commonly, to mitigate the impact of this redundancy, the methods mentioned above [26,30,31] preprocess HSI using Principal Component Analysis (PCA).However, not using PCA leads to a significant decrease in model prediction accuracy, highlighting the model's deficiency in extracting key spectral information.As an unlearnable dimensionality reduction technique, PCA's process is often irreversible and can lead to information loss, such as the loss of spectral continuity [37].Models reliant on PCA may thus produce suboptimal results.In transfer learning or fewshot image classification tasks, HSI are required to feed a large number of channels into the model to preserve as much original information as possible.This input of extensive channel data elevates the demands on feature extractors, necessitating their capability to efficiently process and extract key information from these numerous channels.Moreover, different datasets might require dimensionality reduction to different extents, making the selection of appropriate dimensions for each dataset a time-consuming operation.Therefore, we propose the CESA-MCFormer, which effectively extracts key information from HSI under conditions of limited samples and numerous channels, achieving higher classification accuracy in downstream tasks without relying on PCA for dimension reduction.To achieve this, we have incorporated attention mechanisms and mathematical morphology.
Attention mechanisms have been extensively applied in various domains of machine learning and artificial intelligence.Hu et al. [38] introduced a "channel attention module" in their SE network structure to capture inter-channel dependencies.Woo et al. [39] proposed CBAM, which combines channel and spatial attention, adaptively learning weights in both dimensions to enhance the network's expressive power and robustness.Meanwhile, Zhong et al. [40] presented a deep convolutional neural network model, integrating both a "global attention mechanism" and a "local attention mechanism" in sequence to capture both global and local contextual information.Inspired by these advancements, researchers began incorporating spatial attention into HSI classification.Several studies [41][42][43][44] combine spectral and spatial attention mechanisms, enabling adaptive selection of key features within HSI.However, in HSI classification, a common practice is to segment the HSI into small patches and classify each patch based on its center pixel.Yet, these methods do not sufficiently consider the importance of the center pixel.This approach makes the information provided by the center pixel crucial.Recent studies have recognized this, such as those cited in [45,46], which employed the Central Attention Module (CAM).This module determines feature weights by analyzing the correlation of each pixel with the center pixel.However, considering the phenomena of same material, different spectra and different materials, same spectra in HSI, relying solely on similarity to the center pixel for weight allocation might overlook important spatial information provided by other pixels.Therefore, effectively weighting the center pixel while taking global spatial information into account remains a challenge.
Mathematical Morphology (MM) primarily focuses on studying the characteristics of object morphology, processing and describing object shapes and structures using mathematical tools such as set theory, topology, and functional analysis [47].In previous HSI classification tasks, researchers often utilized attribute profiles (APs) and extended morphological profiles (EPs) to extract spatial features more effectively [48][49][50][51].However, this approach typically requires many structuring elements (SEs), which are non-trainable and thus unable to effectively capture dynamic feature changes.To overcome these limitations, Roy et al. proposed the Morphological Transformer (morphFormer) in [52], combining trainable MM operations with transformers, thereby enhancing the interaction between HSI features and the CLS token through learnable pooling operations.However, this method involves simultaneous dilation and erosion of spatio-spectral features, where each SE introduces a significant number of parameters.This not only risks losing fine-grained feature information during feature selection but also leads to model overfitting and reduced robustness, especially in scenarios with limited data.Hence, there is substantial room for improvement in the application of MM in HSI classification.
The core contributions of this study are as follows: The rest of the paper is organized as follows: In Section 2, we provide an overview of the CESA-MCFormer's overall framework and detail our proposed CESA and MC modules.Section 3 describes the experimental datasets, results under various parameter settings, and an analysis of the model parameters.Finally, Section 4 concludes with our research findings.

Methodology
The architecture of CESA-MCFormer is illustrated in Figure 1.For an HSI patch of size c × h × w, the spectral continuity information is initially extracted through a 3D-2D Conv Block [52], and the dimensionality is transformed to 64.Subsequently, the HSI feature of size 64 × h × w is fed into the Emb Block for mixing spatial and spectral features, generating a 64 × 64 feature matrix.Then, a learnable CLS token, initialized to zero, is introduced for feature aggregation, along with a learnable matrix of size 65 × 64, also initialized to zero, for spatial-spectral position encoding.After combining the feature map with the position encoding, it is passed through multiple iterations of the Transformer Encoder for deep feature extraction, and the extracted features are then input into the Classifier Head for downstream classification tasks.Next, we will provide a detailed introduction to the Emb Block and Transformer Encoder.

Emb Block
Given that the HSI patches input into the model are generally small (with a spatial size of 11 × 11 adopted in this study), we introduce the Emb Block to directly mix and encode global spatial features.This approach equips the model with a global receptive field before deep feature extraction, as illustrated in Figure 2. Since the information provided by the central pixel of the HSI patch is crucial, CESA is first used to weight information at different positions, aiding the model in actively eliminating redundant information.Then, we introduce a learnable weight matrix Wa ∈ R 64×64 initialized using Xavier normal initialization, composed of 64 scoring vectors.By calculating the dot product between HSI features and each scoring vector, we score the features of each pixel.The scores are then transformed into mixing weights using the softmax function.Another learnable weight matrix Wb ∈ R 64×64 , initialized in the same manner, is introduced to remap the HSI features of each pixel point through matrix multiplication.Finally, by multiplying the two matrices, we mix the spatial features based on the mixing weights to obtain the final feature encoding matrix.The overall architecture of CESA is illustrated in Figure 3.To comprehensively consider both global information and the importance of the central pixel, we meticulously designed two modules: Soft CESA and Hard CESA.Hard CESA, a non-learnable module, statically assigns higher weights to pixels closer to the center.Soft CESA, conversely, is a learnable module that uses global information as a reference, enabling the model to adaptively select more important spatial information.This design aims to effectively integrate both global and local information, enhancing the overall performance of the model.
Specifically, CESA takes an HSI or its feature map (F in ) as input.Both Hard CESA and Soft CESA calculate and output the hard probabilistic diversity map (M h ) and the soft probabilistic diversity map (M s ), respectively.The M h and M s maps are added together and then expanded along the channel dimension to match the size of F in before being element-wise multiplied with F in .Finally, an optional simple convolutional module is used to adjust the dimensions of the output feature (F out ).The implementation details of both Hard CESA and Soft CESA are presented in the following sections.

Hard CESA
The output M h of Hard CESA depends only on the size of F in and the hyperparameter K.For a pixel q in F in , its position coordinates are defined as (x, y), and its spectral features are denoted by p = [p 1 , p 2 , ..., p c ] ∈ R 1×c .We define q c as the center pixel of the patch, and its coordinates in the image are defined as (x c , y c ).The distance d between q c and q is defined by the following Equation ( 1): q w for pixel q is defined as follows in Equation ( 2): where h is the length of the F in .The hyperparameter K (K ∈ [0.5, 1)) controls the importance gap between the center and edge pixels.As K becomes larger, the weight of the center pixels becomes larger and the weight of the edge pixels becomes smaller.When K = 0.5, all pixels in the patch have equal weights, and therefore Hard CESA will not have any effect.

Soft CESA
As shown in Figure 4, Soft CESA processes F in into three feature maps, F 1 , F 2 , and F 3 .F 1 and F 2 are used to represent the overall features of F in , while F 3 is used to introduce the feature of the center pixel.
Specifically, for a pixel q in F in , its position coordinates are defined as (x, y), and its spectral features are denoted by p = [p 1 , p 2 , ..., p c ] ∈ R 1×c .The value of F 1 at position (x, y), denoted as m 1 (x, y), can be calculated as follows: The value of F 2 at position (x, y), denoted as m 2 (x, y), can be calculated as follows: To effectively extract the center overall feature in Soft CESA, we introduce a central weight vector r = [r 1 , r 2 , ..., r c ] ∈ R c to weight F in .Therefore, the value of F 3 at position (x, y) can be represented as follows: We extract the spectral features of the central pixel and its eight neighboring pixels, flatten them into a one-dimensional vector, and use this as the central feature vector v c ∈ R 9c .We introduce a matrix A b ∈ R c×(9c) composed of c learnable spectral feature encoding vectors and a vector l b ∈ R c comprised of c bias terms to weight and sum the spectral bands at each position.The specific formula for calculating the corresponding r is as follows: Finally, we concatenate F1, F2, and F3 along the channel dimension to form the final feature matrix F. After passing through a convolutional layer with a kernel size of 3 × 3, a softmax activation function is applied to produce the final soft probabilistic diversity map M s : It can be observed that the entire CESA model uses only c × (9c + 1) + (9 × 3 + 1) learnable parameters, and the parameter c can be flexibly adjusted through the preceding conv Block.This means that the computational cost of CESA is very low, allowing it to be easily embedded into other models without significantly increasing the complexity of the original model.

Transformer Encoder
The primary function of the Transformer Encoder module is to extract deep spatialspectral features through multiple iterations.As shown in Figure 5, in each iteration, HSI features are first processed through Spectral Morph and Spatial Morph for feature selection and extraction, followed by an interaction with the CLS token through Cross Attention, aggregating the spatial-spectral features into the CLS token.To capture multi-dimensional features, we employ a multi-head attention mechanism in Cross Attention [29].The input CLS token and HSI features are uniformly divided into eight parts along the spectral feature dimension, each with a feature length of eight.For each segmented feature, the CLS token serves as the query q ∈ R 1×8 , and the matrix formed by concatenating the CLS token and HSI features is used as the key and value k, v ∈ R 65×8 .The calculation method for Cross Attention is as follows: In this process, w q , w k , and w v ∈ R 8×8 are all learnable parameters, while l is the feature length, set to eight in this study.After obtaining all eight groups of X attn , they are reassembled along the spectral feature dimension.Then, they are processed through a linear layer followed by a dropout layer, resulting in the updated CLS token n * ∈ R 1×64 .This is then added to the input CLS token n to produce the final CLS token n+1 .
Inspired by morphFormer [52], we have also incorporated a Spectral Morph Block and a Spatial Morph Block into our model.The overall architecture of these two modules is identical, as shown in Figure 6.Both modules process HSI features through erosion and dilation modules.After processing, the Spectral Morph block utilizes a 1 × 1 convolution layer (corresponding to the blue Conv block in Figure 6) to extract deeper channel information, while the Spatial Morph block uses a 3 × 3 convolution layer to aggregate more channel information.The Morphological Convolution (MC) we propose is represented by the erosion and dilation modules in Figure 6.Next, we will elaborate on how MC is implemented.MC's primary function is to eliminate redundant data during the feature extraction process, ensuring that as the depth of the encoder increases, the HSI feature retains only pivotal information.To accomplish this, we apply multiple learnable Structuring Elements (SEs) to the HS feature for morphological convolution.Through dilation, we select maximum values from adjacent features, emphasizing boundary details.In contrast, erosion allows us to identify the minimum values, effectively attenuating minor details.Additionally, directly employing SEs might inflate the parameter count, posing overfitting risks.To mitigate this, we separate the spectral and spatial SEs, significantly reducing parameters and thereby boosting the model's resilience.
Specifically, when using SEs with a spatial size of k × k, to maintain the consistency of input and output dimensions of the module, we first reshape the spatial dimension of the HS feature into two dimensions and then pad its boundaries, resulting in the feature matrix H ∈ R (8+(k−1))×(8+(k−1))×64 .Next, by adopting a sliding window with a stride of 1, we segment H into 64 sub-blocks of size k × k × 64, referred to as Xpatch.Subsequently, we further decompose Xpatch in both spatial and spectral directions.Spatially, Xpatch is divided into k × k vectors of dimension 64, denoted as {X a1 , X a2 , . . ., X ak×k }.Spectrally, Xpatch is parsed into 64 vectors of dimension k × k, represented as {X b1 , X b2 , . . ., X b64 }.We then introduce multiple SEs groups, where each group consists of a spatial vector of length k × k and a spectral vector of length 64.For simplicity, we name one group of SEs W, with its spectral vector labeled W a and the spatial vector W b .For any given Xpatch and W, the dilation operation of the morphological convolution is shown in Figure 7.
First, we add each segmented feature vector to the corresponding W a and W b at their respective positions, then take the maximum value to obtain h dil ∈ R 1 : Then, we introduce two learnable vectors h a ∈ R (k×k) and h b ∈ R 64 , along with two learnable bias terms β a and β b .We concatenate the results from the previous step into two one-dimensional vectors, which are then dot-multiplied with h a and h b , respectively, and added to β a and β b , resulting in g dil ∈ R 1 : Finally, we concatenate the two obtained feature values to form the convolution result of that Xpatch under the specified W a and W b , referred to as f dil ∈ R 2 : In actual experiments, 16 groups of W were used in the erosion block.Therefore, after computing all the W with Xpatch, the final HSI feature size obtained through the dilation module is 32 × 64.Similarly, for any X patch and W, the following formula describes the erosion operation f ero (X patch , W) in the morphological convolution: g ero (X a , W a ) = concat(h ero (X a1 , W a ), h ero (X a2 , W a ), . . ., h ero (X ak×k , W a )) × h a + β a (15) Overall, we process the 64 X patch using 32 sets of SEs.Specifically, 16 sets are responsible for the dilation operation, while the other 16 sets handle the erosion operation.This results in two 32 × 64 feature matrices.After spatial-spectral separation, the required parameter count for the SEs is reduced from 2 × 32 × k × k × 64 to 2 × 2 × 16 × (k × k + 64 + 1).Additionally, MC operates similarly to traditional convolutional layers, allowing it to directly replace convolutional layers in models.This attribute endows MC with significant versatility and adaptability.

Dataset Description
To validate the effectiveness of the CESA-MCFormer feature extractor, we tested its performance in two types of classification tasks.Specifically, for the semantic segmentation task, we used the Indian Pines dataset (IP), Pavia University dataset (UP), and Chikusei dataset; while, for the few-shot learning (FSL) task, the datasets included the IP, UP, Chikusei dataset, Botswana dataset, KSC dataset, and Salinas Valley dataset.The detailed information about these datasets is presented in Table 1.

Semantic Segmentation Task
In the semantic segmentation task, we randomly selected 1% of the pixels from the UP and Chikusei datasets as the training set, with the remaining pixels as the test set.Given that the Oats class in the IP dataset has only 20 pixels, we randomly extracted 5% of the pixels from the IP dataset as the training set and the rest as the test set.In constructing the training set, we did not employ any augmentation methods nor use any dimensionality reduction techniques on the datasets.The specific number of samples for each category in each dataset is shown in Tables 2-4.Given the extensive category requirements for few-shot learning (FSL) training, our study utilized six datasets, with the Chikusei, Botswana, KSC, and Salinas Valley used for model pretraining, and the IP and UP datasets for testing.To ensure consistency in input data sizes across all datasets in the FSL experiments, we standardized the dimensions of each dataset to 100 using BS-Nets [53].
For the datasets involved in pretraining, we selected classes with over 250 samples, randomly allocating 50 samples to the support set and 200 to the query set for each class.Specifically, Chikusei contributed 17 classes, Botswana 8, KSC 9, and Salinas Valley 16, totaling 50 distinct training classes.After pretraining, we randomly selected 10 samples from each class in the IP and UP datasets for model fine-tuning and testing.
In our study, during the pretraining on the IP dataset, each task randomly selected 16 out of 50 available classes, following a 16-way, 10-shot method.For the UP dataset, each training iteration randomly chose nine classes, also using a 10-shot approach.For each class, the support set consisted of 10 randomly selected samples out of 50, while the query set used all 200 samples.Moreover, no form of data augmentation was used to expand the datasets, neither in the pretraining nor in the fine-tuning stages.

Configuration
All experiments were designed and conducted using PyTorch on a Ubuntu 18.04 x64 machine with 13th Gen Intel(R) Core(TM) i5-13600KF CPU, 32GB RAM, and an NVIDIA Geforce RTX 4080 16GB GPU.

Training Details
In our semantic segmentation task, we directly classify using the cls token connected to a fully connected layer, as depicted in the Classifier Head block in Figure 1.The Adam optimizer is used with a learning rate of 0.001, and CrossEntropy Loss functions as the loss criterion.For models such as HybridSN [26], Vit [29], SF [33], and SSFTT [34], we maintain a batch size of 64.For morphFormer [52] and our developed CESA-MCFormer, the batch size is set at 32. Table 5 displays the floating point operations (FLOPS) and the number of parameters for various models.
For the FSL task, we incorporate two sets of trainable weights, summing HSI features weighted along spatial and spectral directions to create two feature vectors of length 64 each.These are concatenated with the cls token, forming a final vector of length 192.We average features of 10 samples from each class in the support set to represent class prototypes, with classification based on distances between query set features and these prototypes.For the convolutional network (HybridSN), we alter the final linear layer's output to a 192-length feature vector for uniformity in feature output length across models.The SGD optimizer is employed, setting the learning rate at 0.00001 and weight decay at 0.0005.

Evaluation Indicators
We used four quantitative evaluation metrics including overall accuracy (OA), average accuracy (AA), kappa coefficient (κ), and class-specific accuracy to quantitatively analyze the effectiveness of CESA.The higher the values of these metrics, the better the classification performance.

Classification Results
To verify the advanced nature of our CESA-MCFormer model, we conducted comparative experiments with several recently proposed models.These include HybridSN [26], Vit [29], SF [33], and SSFTT [34], which originally required PCA for dimensionality reduction in their respective papers.Conversely, morphFormer [52] does not require such reduction.In our experiments, we used a patch size of 11 × 11 as the input for all models, with K set to 0.8 in Hard CESA.
Tables 6-8 display the classification results of various models without using PCA dimensionality reduction, while Figures 8 and 9 present the visualization results on the IP and UP datasets.The experimental data demonstrate that CESA-MCFormer exhibits superior performance across all datasets, highlighting its exceptional feature extraction capability in the presence of abundant redundant information.Moreover, as observed from Figures 8 and 9, the combination of EMB Block and MC to eliminate redundancy has notably enhanced the model's accuracy in classifying complex pixels, especially in edge areas and categories with limited samples.This further underscores the outstanding performance of CESA-MCFormer.
Tables 9 and 10 present the classification results of HybridSN, SpectralFormer, and SS-FTT on the IP and UP datasets after applying PCA dimensionality reduction.Specifically, the dimensionality of the IP dataset was reduced to 30, while that of the UP dataset was reduced to 15.The results indicate that the performance of HybridSN, SpectralFormer, and SSFTT improved significantly after PCA reduction.However, their accuracy still falls short of our CESA-MCFormer model.

Ablation Experiment
To better understand the roles of CESA and MC within the model, we conducted several ablation studies on the IP dataset.In these experiments, we continued to adopt the hyperparameter settings from the previous section.Given the similar overall architecture of CESA-MCFormer and morphFormer, and the outstanding performance of morphFormer when compared to other models, we chose morphFormer as our baseline and built upon it by adding modules.
Table 11 demonstrates the influence of MC and CESA on the final classification performance of the model.It is evident from the table that both modules have significantly enhanced the OA.Furthermore, when both modules are used in conjunction, there is an additional improvement in accuracy.However, in contrast to the pronounced improvement in AA brought about by CESA, the contribution of MC is relatively limited.This can be primarily attributed to the lower classification accuracy for classes with fewer samples.In scenarios with limited sample sizes, prior information becomes particularly crucial.Without the spatial prior information provided by CESA, relying solely on MC to process hyperspectral features that encapsulate comprehensive spatial information proves to be more challenging.
To further validate the superiority of CESA, we replaced it with the traditional Spatial Attention block (SA) and CAM in the CESA-MCFormer model and conducted comparative experiments.The results, presented in Table 12, demonstrate that CESA achieved the highest classification accuracy.This is attributed to CESA's combination of Hard and Soft components, which not only incorporate prior information but also ensure the learnability of the entire module.Thus, CESA effectively amalgamates the advantages of SA and CAM, leading to improved performance.

Few-Shot Learning Task Experimental Results
To assess the generality and effectiveness of CESA-MCFormer with extremely limited samples, we conducted FSL experiments on the IP and UP datasets, using only 10 samples per class.The experimental results, as shown in Tables 13 and 14, indicate that CESA-MCFormer achieved optimal performance on both datasets.

Conclusions
This paper presents the CESA-MCFormer feature extractor, which boosts the model's feature extraction capabilities with a "selection-extraction" strategy, enabling effective image feature extraction without reliance on PCA.CESA enables the model to incorporate spatial prior knowledge guided by an attention mechanism while maintaining the learnability of the module.The MC module introduces learnable pooling operations that effectively filter key information during deep feature extraction.Additionally, CESA-MCFormer adapts to various classification tasks by modifying the classifier head, and both CESA and MC modules can be flexibly integrated into other models to improve their feature extraction performance.Comparisons with other models in semantic segmentation and FSL tasks confirm the versatility and effectiveness of CESA-MCFormer, and ablation studies of CESA and MC attest to the efficacy of these two components.
The CESA-MCFormer has demonstrated exceptional versatility in surface object classification tasks.In future research, we intend to further explore the model's application to subsurface exploration tasks (such as soil composition analysis) and are committed to further optimizing its performance.

Figure 2 .
Figure 2. Overall architecture of EMB Block.The symbol "T" represents the transpose of a matrix.The symbol "•" represents element-wise multiplication of matrices, and the symbol "x" denotes matrix multiplication.

Figure 6 .
Figure 6.Overall architecture of the Spectral Morph Block and Spatial Morph Block.

Figure 8 .
Figure 8. Visualization results on the IP dataset.

Figure 9 .
Figure 9. Visualization results on the UP dataset.

Table 2 .
Detailed information on the training and testing data samples for each class in the IP dataset.

Table 3 .
Detailed information on the training and testing data samples for each class in the UP dataset.

Table 4 .
Detailed information on the training and testing data samples for each class in the Chikusei dataset.

Table 5 .
The floating point operations (FLOPS) and the number of parameters for various models."CESA-MCFormer *" refers to the CESA-MCFormer model without the inclusion of the CESA.

Table 6 .
Classification accuracy of various models on the IP dataset (without PCA).

Table 7 .
Classification accuracy of various models on the UP dataset (without PCA).

Table 8 .
Classification accuracy of various models on the Chikusei dataset (without PCA).

Table 9 .
Classification accuracy of various models on the IP dataset (with PCA)."CESA-MCFormer *" refers to the CESA-MCFormer model without PCA.

Table 10 .
Classification accuracy of various models on the UP dataset (with PCA)."CESA-MCFormer *" refers to the CESA-MCFormer model without PCA.

Table 11 .
Ablation study results for CESA and MC."CESA" stands for replacing the Emb Block, and "MC" stands for replacing the Transformer Encoder.

Table 12 .
Comparative experiments of the traditional Spatial Attention block (SA), CAM, and CESA.

Table 13 .
Classification accuracy of various models on the IP dataset.

Table 14 .
Classification accuracy of various models on the UP dataset.