Next Article in Journal
ARLO: Augmented Reality Localization Optimization for Real-Time Pose Estimation and Human–Computer Interaction
Previous Article in Journal
Quantitative Methodology for Assessing the Quality of Direct Laser Processing of 316L Steel Powder Using Type I and Type II Control Errors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MDFFN: Multi-Scale Dual-Aggregated Feature Fusion Network for Hyperspectral Image Classification

by
Ge Song
1,
Xiaoqi Luo
1,*,
Yuqiao Deng
2,
Fei Zhao
3,
Xiaofei Yang
4,
Jiaxin Chen
1 and
Jinjie Chen
1
1
College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China
2
School of Statistics and Mathematics, Guangdong University of Finance and Economics, Guangzhou 510120, China
3
College of Humanities and Law, South China Agricultural University, Guangzhou 510642, China
4
School of Electronics and Communication Engineering, Guangzhou University, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(7), 1477; https://doi.org/10.3390/electronics14071477
Submission received: 3 March 2025 / Revised: 27 March 2025 / Accepted: 1 April 2025 / Published: 7 April 2025
(This article belongs to the Section Artificial Intelligence)

Abstract

:
Employing the multi-scale strategy in hyperspectral image (HSI) classification enables the exploration of complex land-cover structures with diverse shapes. However, existing multi-scale methods still have limitations for fine feature extraction and deep feature fusion, which hinder the further improvement of classification performance. In this paper, we propose a multi-scale dual-aggregated feature fusion network (MDFFN) for both balanced and imbalanced environments. The network comprises two main core modules: a multi-scale convolutional information embedding (MCIE) module and a dual aggregated cross-attention (DACA) module. The proposed MCIE module introduces a multi-scale pooling operation to aggregate local features, which efficiently highlights discriminative spectral–spatial information and especially learns key features in small target samples in the imbalanced environment. Furthermore, the proposed DACA module employs a cross-scale interaction strategy to realize the deep fusion of multi-scale features and designs a dual aggregation mechanism to mitigate the loss of information, which facilitates further spatial–spectral feature enhancement. The experimental results demonstrate that the proposed method outperforms state-of-the-art methods on three classical HSI datasets, proving the superiority of the proposed MDFFN.

1. Introduction

Hyperspectral images (HSIs) have hundreds of continuous and narrow spectral bands, which can provide rich spectral and spatial information. Currently, HSIs have been widely used in agriculture [1] environmental engineering [2], geological exploration [3], and other fields. With the continuous development of hyperspectral image technology in terms of spatial and spectral resolution, hyperspectral image processing has emerged as a research hotspot within the industry, including HSI classification [4] and HSI target detection [5]. Among them, the HSI classification has attracted much attention. It assigns each pixel to a different ground object category, helping to understand the semantics in remote sensing images. However, realizing high-precision classification for HSI remains a challenging task due to problems like the curse of dimensionality, mixed pixels, spectral variability, and class imbalance [6].
In the research of HSI classification, compared with traditional machine learning-based methods [7,8,9], deep learning (DL) can automatically learn complex feature representations from the original data, effectively exploiting spatial and spectral information from the HSI to improve classification accuracy. Many deep learning-based methods have been applied to HSI classification, such as stacked autoencoder (SAE) [10], deep belief network (DBN) [11], convolutional neural network (CNN) [12], recurrent neural network (RNN) [13], and generative adversarial network (GAN) [14]. Among these models, CNN has become one of the mainstream methods for HSI classification due to the strong ability of local feature extraction. Nevertheless, CNN-based methods typically rely on fixed-size receptive fields to aggregate information, ignoring the long-range dependency modeling among the input elements. Recently, transformer-based methods have shown advantages in dealing with tasks requiring global dependencies due to powerful multi-head self-attention (MHSA). They show excellent classification performance on HSI classification, which can better cope with complex classification tasks.
Despite advancements, existing deep learning methods still have limitations. For traditional methods, CNN-based methods are usually only implemented as a simple cascade of several convolutions using a single-scale convolution kernel [15]. While transformer-based methods can capture global information by establishing correlations between tokens, they have not sufficiently explored the multi-scale characteristics of the spatial–spectral information inherent in HSI (e.g., ViT [16], DeepViT [17], etc.). Although the work in [18] adopts a multi-scale feature fusion strategy to improve performance, its computational cost is high and may lead to the attenuation of important relevant features. For existing CNN–transformer hybrid methods, although numerous studies have employed multi-scale strategies to explore the complex land-cover structures with diverse shapes in HSI, deficiencies still exist in fine feature extraction and deep feature fusion. These limitations hinder further improvements in classification performance. In terms of feature extraction, even though the studies have attempted to incorporate various CNN modules [19,20] to improve the extraction of local information, the direct flattening and linear projection operations of transformers still destroy the local spatial–spectral and positional information, causing the loss of vital features in classification. Therefore, improving the multi-scale feature extraction module to sufficiently extract the discriminative features of HSI remains an urgent problem to be solved. In terms of feature fusion, the work in [21,22] employed fusion strategies in series or parallel, ignoring the differences in different features and failing to establish the interaction between features completely. The work in [23] still does not avoid the loss of information during the alternation of attention mechanisms. Although the work in [24] achieves effective feature fusion, it requires the application of self-attention to each branch before aggregation. Therefore, it is still worth exploring how to better design the feature fusion module to improve HSI classification performance. For imbalanced data, while both sampling strategies and data augmentation have been proven to be effective solutions, network architecture optimization remains a crucial direction as well. Existing studies have demonstrated that employing multi-scale feature extraction and information fusion modules can highlight discriminative features [25], helping to cope with imbalanced HSI classification [6]. However, it is still challenging to achieve high-precision HSI classification with imbalanced samples. For example, although the work in [26] can reduce the misclassification of small target samples and discrete samples in imbalanced datasets by using a multi-scale feature extraction module, it still cannot avoid the effect of redundant features.
In summary, this paper addresses the following three key issues.
  • In hyperspectral image classification tasks, transformer-based methods offer advantages in establishing global dependencies. However, they cannot extract fine-grained local features as effectively as CNN-based methods. Therefore, it is necessary to explore an architecture that can capture critical spatial–spectral features both locally and globally, thereby improving classification performance.
  • Existing multi-scale CNN–transformer hybrid methods are capable of capturing more discriminative features. However, their embedding methods tend to distort local spatial–spectral and positional information, resulting in the loss of valuable information during the classification process. This loss makes it challenging to distinguish cases of “same objects with different spectra” and “different objects with the same spectrum”. Thus, exploring new multi-scale feature extraction modules that can highlight discriminative information is crucial for addressing the class imbalance issue in HSI classification.
  • Existing multi-scale approaches typically rely on attention mechanisms for feature fusion [18,23]. Nevertheless, the interaction between different features is still insufficient, which limits the effective exploration of spatial and spectral diversity in complex environments. Consequently, we should design a more flexible multi-scale feature fusion method.
Based on the above-mentioned analysis, we propose the multi-scale dual-aggregated feature fusion network (MDFFN) for HSI classification. In MDFFN, we design a multi-scale convolutional information embedding (MCIE) module to extract multi-scale spatial–spectral features that generate more representative information. In addition, we propose the dual aggregated cross-attention (DACA) module by improving cross attention [18], which can sufficiently realize the fusion and interaction of different features. The main contributions are summarized as follows:
  • In order to fully extract discriminative features locally and globally, we propose a new multi-scale network for HSI classification called MDFFN. MDFFN combines the respective representational capabilities of CNN and transformer, offering significant advantages in multi-scale feature extraction and feature fusion, showing superior classification performance.
  • To address the issue of critical information loss in patch embedding of existing multi-scale methods, the MCIE module is designed to extract multi-scale spatial–spectral features under different receptive fields of HSI. The module employs a multi-scale pooling operation to aggregate local features, smoothing the data and suppressing noise while retaining more spatial and spectral information, thus enabling the network to achieve satisfactory performance even with imbalanced datasets.
  • To adequately fuse multi-scale features, we designed the DACA module to realize the interaction of features at different scales to capture the complementary and relevant information between the features, which can alleviate the problem of information loss caused by the feature fusion process.
  • We conducted extensive experiments on three benchmark HSI datasets. The results reveal that the MDFFN outperforms the state-of-the-art methods in classification performance, particularly when dealing with imbalanced datasets, leading to substantial improvements in classification accuracy.
The rest of the paper is organized as follows. In Section 2, we review related works including deep learning for HSI classification, multi-scale CNN, and multi-scale transformer. Section 3 introduces the details of the proposed method. Section 4 discusses the datasets, experimental setup, and experimental results and analysis. Finally, the paper is summarized in Section 5.

2. Related Work

2.1. Deep Learning for HSI Classification

With the rapid development of deep learning in computer vision, CNN has been widely used in HSI classification. In early research, it mainly utilized spectral features for classification. Hu et al. [27] employed 1D-CNN for discriminative spectral feature extraction. Similarly, the spatial information in hyperspectral images provides an important basis for land-cover classification. Hamouda et al. [28] employed a smart feature extraction method to compress the spectral information and used 2D-CNN to extract deep features. In addition, some existing architectures, including VGGNet [29] and ResNet [30], have been applied to HSI for deep spatial feature extraction to achieve high classification accuracy. Nevertheless, a single consideration of both spatial and spectral information is inadequate for addressing the challenge of classification in intricate scenarios. For the extraction of the spatial–spectral features, Hu et al. [31] applied PCA to reduce the data dimensionality and combined 1D-CNN and 2D-CNN to extract the joint features. Considering the advantages of 3D-CNN, Zhang et al. [32] designed a 3D-CNN based on stacked blocks to extract hidden spatial–spectral features and utilize discriminative information for classification. While using 3D-CNN improves the performance, its significant increase in the number of parameters incurs additional computational costs. To avoid using 3D-CNN exclusively leading in a complex model, Roy et al. [33] proposed a hybrid spectral CNN (HybridSN) by combining the advantages of 3D-CNN to extract joint spatial–spectral features and 2D-CNN to learn abstract spatial representation. Although CNN can extract fine spatial–spectral features, it usually contains noise interference, especially in small samples. Moreover, due to the overwhelming reliance on fixed-size receptive fields to aggregate information, CNN-based methods typically neglect modeling the long-range dependencies among input elements.
Recently, the transformer has demonstrated advantages in capturing global information due to its powerful self-attention mechanism. Dosovitskiy et al. [16] applied the transformer to visual tasks and introduced the vision transformer (ViT), which has shown promising results in various benchmark image classification studies. This has opened up a new direction for research in image classification using transformer-based methods. Regarding the high dimensionality of hyperspectral images, Hong et al. [34] proposed SpectralFormer, which can learn the local spectral information from the neighboring bands of HSI, demonstrating superiority over the traditional transformers. However, the computational load of the self-attention mechanism in the transformer scales quadratically with the length of the sequence. To mitigate attention collapse, Zhou et al. [17] proposed DeepViT and designed the re-attention to regenerate the attention maps to enhance their diversity at different layers with negligible computational cost. Traditional transformers model the contextual dependencies of input patches, but they struggle to capture local information with rich spatial patterns and geometric structures. In recent years, many approaches have been developed that combine the characteristics of CNN and transformer to further improve classification accuracy. In the field of HSI classification, Sun et al. [19] addressed the limitation of CNN in obtaining deep semantic features by combining transformer and proposed the spectral–spatial feature tokenization transformer (SSFTT) to capture spectral–spatial features and high-level semantic information. Yang et al. [35] proposed the hyperspectral image transformer (HiT) network, designing the convolution permutator module utilizing depthwise convolution operations to encode feature representations along the height, width, and spectral dimensions. Although transformer-based methods have shown excellent performance in HSI classification tasks, most of the above approaches focus on extracting spatial–spectral features from a single scale, overlooking the rich multi-scale feature information inherent in the data.

2.2. Multi-Scale CNN

In vision tasks, adopting multi-scale representations to extract features at different granularities can lead to more discriminative features. Due to the rich contextual information in multi-scale structures, CNN-based multi-scale methods show advantages in HSI classification. He et al. [36] designed a multi-scale 3D deep convolutional neural network (M3D-DCNN) that could jointly learn both 2D multi-scale spatial features and 1D spectral features from HSI data in an end-to-end approach. To mitigate the issue of a large number of parameters in the traditional 3D-CNNs, Xu et al. [37] designed multi-scale convolution to extract contextual features of different scales for HSI. And employing Octave 3D-CNN to decompose the mixed feature maps by frequency to reduce spatial redundancy and enlarge the receptive field.
In previous studies, CNNs were typically stacked in serial to extract deeper features. However, this approach fails to fully exploit the multi-scale information inherent in HSI. Additionally, due to the class imbalance in HSI datasets, deep models are prone to overfitting, which severely impacts classification accuracy. In response to the overfitting problems caused by class imbalance, Chen et al. [38] incorporated multiple multi-scale strategies, enhancing the information captured in each layer and furnishing rich features for the fitted samples. From the perspective of network architecture optimization, Wang et al. [6] utilized three convolutional kernels of different sizes to design the multi-scale spectral residual self-attention, which can fully extract high-dimensional and intricate spectral information from HSIs, even with limited labeled samples and imbalanced distributions. However, improving the feature extraction capability of the network to achieve high-accuracy classification remains a significant challenge in HSI classification, particularly in cases of class imbalance. Li et al. [39] introduced depthwise separable convolution into the 3DCNN-AM network, which improved in terms of reducing time consumption and maintaining comparable classification accuracy but only performed well in imbalanced data. Wang et al. [26] proposed a multi-scale residual spectral–spatial feature extraction module that mitigated the information loss in the feature stream and reduced the misclassification of small target samples and discrete samples in imbalanced datasets, but this stacking of 3D-CNN blocks may not only obtain redundant features but also bring high computational cost. While multi-scale CNN-based methods are effective for extracting multi-scale spatial and spectral information, the limitations of CNN in capturing global dependencies highlight the need for further exploration on how to integrate multi-scale CNN with transformer.

2.3. Multi-Scale Transformer

Due to the excellent performance of multi-scale CNN-based methods, researchers have begun to explore the application of multi-scale strategies within transformer models. Liu et al. [40] proposed a hierarchical swin transformer to capture multi-scale features and utilized shifted windows to effectively capture global information. Chen et al. [18] designed a dual-branch transformer named CrossViT, to combine patches of different image sizes for generating more robust features in image classification tasks. Their proposed cross-attention mechanism effectively facilitates feature fusion, which reduces the originally quadratic computational complexity to linear, significantly lowering the computational cost of the network. Recently, various multi-scale transformer models have also been applied to HSI classification. He et al. [41] proposed the cross-spectral vision transformer (CSiT) to extract pixel-wise multi-scale features, designing a multi-scale spectral embedding module to enhance local details between neighboring spectral bands. The hybrid architecture combining transformer and CNN has attracted widespread attention in building lightweight, high-performance models. To learn intrinsic shape information, Roy et al. [20] introduced morphological convolution operations into the transformer, presenting morphFormer, which utilized convolution kernels of two different sizes and HetConv2D to extract multi-scale information. However, its embedding methods tend to destroy the local spatial–spectral and positional information, causing the loss of valuable information in classification. Therefore, improving the multi-scale feature extraction module to sufficiently extract the discriminative features of HSI is crucial.
Studies have shown that feature fusion strategies can significantly enhance network performance [25,42]. In the current research on multi-scale feature fusion, Qiao et al. [22] designed a multi-scale neighborhood preserving transformer (MSNAT) model that leverages different local window sizes through parallel weighting to extract multi-scale spatial information. However, this parallel fusion approach overlooks the inherent differences among features and fails to adequately establish their correlations. Additionally, Chen et al. [18] proposed cross-attention for effective multi-scale feature fusion, yet it requires applying self-attention to each branch before aggregation executing cross-attention, thereby consuming additional computational resources. Furthermore, Wang et al. [23] introduced a transformer with the CNN-enhanced cross-attention mechanism to explore multi-scale CNN-enhanced features by swapping the value of two branches for fusion. Nevertheless, it still fails to avoid the problem of information loss caused by the alternation of attention mechanisms. Based on the above studies, it can be observed that the interaction between features of different scales can effectively explore the spatial and spectral diversity in complex environments, and it significantly impacts the final classification performance. Therefore, it remains challenging to design feature fusion modules to improve HSI classification performance.

3. Materials and Methods

Existing multi-scale methods have limitations in fine feature extraction and deep feature fusion, restricting further improvement in classification performance [18,20], especially for imbalanced datasets [26]. To address the aforementioned issues, we present the multi-scale dual-aggregated feature fusion network (MDFFN) from the perspective of feature optimization. It mainly consists of the multi-scale convolutional information embedding module (MCIE) and the dual aggregation cross-attention module (DACA). The proposed methodology is described in detail below.

3.1. The Structure of the Multi-Scale Dual-Aggregated Feature Fusion Network

The MDFFN integrates the advantages of CNN and transformer to realize more comprehensive local and global feature extraction, which leverages the rich spatial–spectral information in HSI and enhances the performance and robustness of the model. The framework of the proposed MDFFN is shown in Figure 1 and primarily consists of two key modules: the MCIE module and the DACA module.
To obtain comprehensive information efficiently, our MDFFN adopts a dual-branch structure: (1) S-branch: a small branch with more patch tokens ( P S ) utilizing fine-grained patch size ( N S ) for embedding. (2) L-branch: a large branch with fewer patch tokens ( P L ) utilizing coarse-grained patch size ( N L ) for embedding.
The overall process is summarized as follows:
Initially, the proposed method eliminates the redundant spectral bands of the HSI by applying principal component analysis (PCA) [43]. Numerous studies have confirmed the effectiveness of PCA for dimensionality reduction in HSI [19,44]. It primarily eliminates the correlation between spectra by K-L transform, identifying important spectral features based on the contribution of each principal component. After dimensionality reduction, HSI patches divided by spatial size are input to the MCIE module to extract multi-scale shallow features (see more details about the MCIE module in Section 3.2). This step can be formally defined as follows: for the given input x in = x 1 , x 2 , , x m C × D × H × W , the result is denoted as x pca C × D × H × W after PCA processing, where H , W , D , and C represent the height, width, depth, and number of channels of the image, and D represents the spectral dimensions after PCA. The specific formulations are shown in Equations (1) and (2).
x patch S = MCIE S PCA x in = s 1 , s 2 , , s j
x patch L = MCIE L PCA x in = l 1 , l 2 , , l k
where MCIE S   and MCIE L   represent the embedding operations for the small-scale branch and the large-scale branch respectively, x patch S N S × C S and x patch L N L × C L represent the patch tokens, N S and N L are the number of patch tokens, and C S and C L are the embedding dimensions.
Subsequently, the CLS token, a learnable embedding used for classification, is concatenated with the patch tokens, producing sequences of   j + 1 and   k + 1 tokens, respectively. The trainable position embeddings are added to each token to preserve the positional information of the HSI tokens. This process is described by Equations (3) and (4).
x S = x cls S x patch S + x pos S
x L = x cls L x patch L + x pos L
where x cls S 1 × C S and x cls L R 1 × C L are the CLS tokens for each branch, x pos S 1 + N S × C S and x pos L 1 + N L × C L represent the position embedding.
Then, the multi-scale transformer encoders are employed to fuse the dual-branch information deeply. Each transformer encoder comprises a DACA module and a feed-forward network (FFN). In this structure, the DACA module effectively integrates multi-scale information, enhancing feature representation and discrimination capabilities (See more details about the DACA module in Section 3.3). The above step is described in the following equations:
The S-branch:
y d S = x d 1 S + DACA LN x d 1 S , x d 1 L
x d S = y d S + FFN LN y d S
The L-branch:
y d L = x d 1 L + DACA LN x d 1 L , x d S
x d L = y d L + FFN LN y d L
where DACA   performs feature fusion using the attention mechanism. d 0 , D represents the dth multi-scale transformer encoder.
Finally, the CLS tokens from the two branches are concatenated and fed into the multilayer perceptron (MLP) layer to exploit the resulting discriminative spatial–spectral features for pixel-level classification.

3.2. Multi-Scale Convolutional Information Embedding

To address the issue of key information loss in existing multi-scale methods, limiting the classification performance on imbalanced datasets, we propose a simple and effective multi-scale convolutional information embedding module, as shown in Figure 2. This module is developed with a dual-branch structure, consisting of two 3D convolutional blocks to capture local features and an embedding operation for feature representation. The 3D convolutional blocks are designed based on different patch sizes, enabling the extraction of multi-scale features effectively. The specific descriptions of the MCIE module are given below.
  • Three-Dimensional Convolutional Blocks
We construct two branches with different patch sizes ( P S and P L ) based on large and small scales. This design not only captures contextual information over a larger range but also facilitates the extraction of fine details. To better extract multi-scale local features, we employ two 3D convolutional blocks. Each block contains a convolutional layer, a batch normalization (BN) layer, a rectified linear unit (ReLU) layer, and a pooling layer. Specifically, we use 8 P S × P S × P S and 8 P L × P L × P L convolutional kernels covering each patch to enable sufficient extraction of local information at different scales. The BN layer and ReLU layer can effectively mitigate the vanishing gradient problem and accelerate model convergence. The above is specified in Equation (9). Additionally, consistent with the principle of existing transformer-based methods, for an input image of a specific spatial size, we partition it into uniform patches based on the designated patch size. Therefore, we employ a multi-scale pooling operation, i.e., we aggregate local features in pooling sizes of 1 × P S × P S and 1 × P L × P L . This approach replaces direct flattening and prevents the loss of key information. The multi-scale pooling operation not only smooths the data but also retains more spatial and spectral information while reducing the number of parameters, which helps with learning the key features in small target samples, as shown in Equation (10).
x conv = Relu BN Conv P × P × P x pca   ,   x conv C × D × H × W
x avg = AvgPool 1 × P × P x conv   ,   x avg C × D × H × W
where P represents the patch size, with P S denoting the small patch size for the S-branch and P L denoting the large patch size for the L-branch. After pooling, there are H = H P W = W P .
2.
Embedding Operation
The embedding operation is applied to process feature vectors for further deep processing, as shown in Equation (11). As for the specifics, we utilize the rearrange feature to adjust the tensor dimensions, as presented in Equation (12). The layer norm is employed for standardization. The linear layer maps the dimensions, as indicated in Equation (13). To address the potential overfitting caused by class imbalance, we use the dropout to enhance the model’s generalization ability.
x patch = F embedding x avg
Rearrange :   C × D × H × W H × W × C × D N × E
Linear :   N × E N × F
where F embedding   represents the embedding operation, N represents the number of patch tokens in each branch, E refers to the dimensionality of the sequence after the rearrange, and F is a custom-defined dimension.
Compared to traditional CNN–transformer hybrid methods [19,45], the proposed MCIE module innovatively introduces a multi-scale pooling operation that differs from standard attention-weighting paradigms before embedding. Instead of directly flattening the input, the module employs average pooling of different sizes to retain more effective spatial and spectral information. Specifically, it not only smooths the data and suppresses noise, enhancing the stability of subsequent spatial–spectral features, but also aggregates local features to mitigate the loss of spatial–spectral and positional information in the embedding process, generating more representative semantic features. As a result, our module enables the network to extract discriminative information even for classes with fewer samples, effectively reducing the misclassification of small target samples in imbalanced datasets.
In summary, the MCIE module could effectively extract local features and help mine more representative information from different scales. It would address the issue in [16], where a single convolution might overlook important details, enhancing the ability of the network to handle complex scenes.

3.3. Dual Aggregated Cross-Attention

An effective feature fusion strategy is crucial for learning multi-scale feature representations. However, existing attention-based feature fusion methods have not yet achieved sufficient cross-scale information fusion [18,23]. To further enhance the feature fusion capability of the model, we designed a dual aggregation cross-attention module. This module adopts a cross-scale interaction strategy to fully achieve information perception between features of different scales, while employing a dual aggregation mechanism to effectively mitigate information loss caused by fusion. The DACA module promotes spatial–spectral feature enhancement and provides high-level semantic information for the entire architecture. Figure 3 illustrates the DACA module for the S-branch.
The DACA module uses the cross-scale interaction strategy, i.e., exchanging the K (key) and V (value) at different scales to achieve deep information fusion. Meanwhile, adopting the dual aggregation mechanism, i.e., utilizing the 1D convolutional layer to adjust the sequence dimensions of the dot products of Q (query) and K (key) for aggregation, mitigating information loss during the attention mechanism’s alternating process. For the above step, given the input sequences x S and x L (see Equations (3) and (4)), the computational formula for the S-branch can be expressed as the following:
q S = x S W q S , k S = x S W k S
k L = x L W k L , v L = x L W v L
A S = softmax F 1 D q S k S T + q S k L T C / h
y S = A S v L
where W q S , W k S , W k L , W v L F × F / h are the learnable parameters, q S , k S 1 + N S × F / h refer to the Q and K of the S-branch, and k L , v L 1 + N L × F / h represent the K and V of the L-branch. F and h represent the embedding dimensions and the number of heads, respectively. F 1 D ( ) is the 1D convolution operation. Given q S k L T 1 + N S × 1 + N L and q S k S T 1 + N S × 1 + N S , processing q S k S T with F 1 D   yields F 1 D q S k S T 1 + N S × 1 + N L to achieve aggregation. In addition, the L-branch follows the same process.
Our proposed DACA provides a more flexible feature fusion strategy than cross-attention. Cross-attention leverages the class token of one branch to exchange information with the patch tokens of another branch and then maps the information back to the original branch. This fusion strategy facilitates the integration of information from different scales. However, it is essential to ensure the class token has already learned abstract information from its own branch through MHSA before interacting. Meanwhile, this strategy may potentially overlook some important features that exhibit weak self-correlation but are prominent in cross-scale information. Notably, the DACA module eliminates the need for prior attention-aware processing. Instead, it directly interacts with all tokens from different branches, integrating information learned from other branches into the original branch. And the dual-aggregation mechanism is additionally utilized to mitigate the information loss. Therefore, our strategy could enrich the token representations and assure the deeper fusion of different scale features.
The proposed DACA module is able to take full advantage of leveraging the strong complementarity and correlation between features at different scales, improving its capability to understand and model the inherent structure of the data.

4. Experimental Results and Discussion

In this section, we first introduce three widely used datasets for HSI classification and show the training details for the experiments conducted on these datasets, and then we set up five experiments to validate the performance of the proposed MDFFN. For a fair comparison, all experiments are conducted in the same environment.

4.1. Dataset Description

To evaluate the performance of the proposed method, we select three public hyperspectral datasets, namely the Indian Pines, Pavia University, and Houston 2013 datasets. In the experiments, all datasets are divided such that 10% of samples are selected for training and 90% for testing. The detailed information for each of these datasets is provided below.
  • Indian Pines Dataset
The Indian Pines (IP) dataset, gathered by the AVIRIS sensor developed by NASA’s Jet Propulsion Laboratory in Pasadena, CA, USA, is an important hyperspectral remote sensing image resource. It consists of 145 × 145 pixels and 220 spectral bands, with 16 vegetation classes, covering the wavelength range of 0.4–2.5 μm. After removing 20 noisy bands, 200 spectral bands are retained for training. The details of the IP dataset are given in Table 1. It is worth mentioning that this dataset has limited samples and an imbalanced distribution of various classes in quantity. For instance, the eleventh class contains 2455 samples, while the ninth class contains only 20 samples (with a ratio of 1:122). This imbalanced dataset may lead to poor model generalization and potentially cause overfitting, posing a challenge for classification.
2.
Pavia University Dataset
The Pavia University (PU) dataset was acquired by the ROSIS sensor over Pavia in Northern Italy. The sensor was developed jointly by Dornier Satellite Systems (DSS, former MBB), GKSS Research Centre Geesthacht, and the German Aerospace Center (DLR). This dataset includes 610 × 340 pixels and 115 spectral bands, covering the wavelength range of 0.43–0.86 μm. The dataset contains nine land-cover classes. After removing 12 noisy bands, 103 spectral bands are retained for training. The details of the PU dataset are shown in Table 2. The distribution of the land-cover sample numbers is also imbalanced, with the ninth class having 947 samples and the second class having 18,649 samples (with a ratio of 1:19). Nevertheless, compared to the IP dataset, the sample numbers in the PU dataset are more abundant.
3.
Houston 2013 Dataset
The Houston 2013 dataset was acquired with the ITRES CASI-1500 sensor, developed by ITRES Research Limited in Calgary, AB, Canada, capturing remote sensing images of Houston and its surrounding rural areas. It consists of 349 × 1905 pixels and 144 spectral bands. The wavelength range is 0.38–1.05 μm. The dataset includes 15 challenging land-cover classes, and the details are shown in Table 3.

4.2. Experimental Setup

  • Training details: For a fair comparison, all experiments are conducted on an Intel® Xeon® Platinum 8358P CPU @ 2.60GHz processor and an NVIDIA GeForce RTX 3090 GPU (24GB) (NVIDIA Corporation, Santa Clara, CA, USA), utilizing the PyTorch 1.11 framework with Python 3.8. To minimize the errors in the experiments, all results are the average and standard deviation of 10 independent runs. For model training, we utilize the Adam optimizer to learn the weights, set the batch size to 100 and the learning rate to 1 × 10−4. To ensure the model is sufficiently trained and performing at its best, we run the model for 100 epochs.
  • Evaluation metrics: We employ three widely used metrics to evaluate the classification performance of different models: overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa). Additionally, to facilitate a more intuitive comparison among the models, we visualize the classification results for qualitative analysis.
  • Experimental details: In order to more rationally validate the performance of the proposed MDFFN method, we set up five detailed experiments.
    (a)
    Comparison with state-of-the-art methods: we compare the proposed MDFFN model with representative baseline methods and state-of-the-art backbone methods for validating the performance.
    (b)
    Visual evaluation: the experimental results are visualized to provide a clear and intuitive demonstration of the progressiveness of the model.
    (c)
    Efficiency analysis: we quantitatively evaluate the computational efficiency of the proposed method by comparing parameter counts, FLOPs, training time, and testing time with state-of-the-art methods.
    (d)
    Ablation experiments: ablation experiments are conducted to verify the effectiveness of the MCIE and DACA modules in the network.
    (e)
    Parameter sensitivity experiments: parameter sensitivity experiments are set up to evaluate the performance of MDFFN with different combinations of S-branch size and L-branch size, different depths of the multi-scale transformer, and different proportions of training samples on the experimental results.
  • Comparison with state-of-the-art backbone networks: To validate the performance of the proposed MDFFN, we selected several representative baseline methods and state-of-the-art backbone methods, including CNN-based methods (i.e., 2D-CNN [46], 3D-CNN [46], HybridSN [33], and M3D-DCNN [36]), transformer-based methods (i.e., ViT [16], CrossViT [18], DeepViT [17], and SpectralFormer [34]), as well as CNN–transformer hybrid-based methods (i.e., SSFTT [19], morphFormer [20], and MSNAT [22]). In order to make a reasonable comparison, the experiments basically maintain consistent parameter settings. Regarding parameters, we set the spatial size of morphFormer and MSNAT to 11 × 11, and the other methods to 15 × 15. For the patch size, ViT and DeepViT are 3, while for the multi-scale methods CrossViT and MDFFN, the small branch is set to 3 and the large branch to 5. The depth of ViT, CrossViT, DeepViT, and MDFFN is set to 3, and the dimension is set to 512. Otherwise, all other parameters not listed are set to their default values. More details can be found in the original papers.

4.3. Results and Analysis

4.3.1. Comparison with State-of-the-Art Methods

The quantitative analysis results on the three datasets are displayed in Table 4, Table 5 and Table 6, showing that the proposed MDFFN demonstrates considerable advantages compared to the state-of-the-art backbone methods.
  • The performance in the overall method
Observation of Table 4, Table 5 and Table 6 reveals that our MDFFN obtains the best OA, AA, and Kappa with relatively low standard deviation on different datasets compared to existing methods.
Compared to CNN-based methods, ours achieves OA of 99.96% ± 0.02 and 99.42% ± 0.07 on the PU dataset and Houston 2013 dataset, respectively, and AA of 98.05% ± 0.32 on the IP dataset. In contrast, the OA for 2D-CNN and 3D-CNN are 99.34% ± 0.09 and 99.39% ± 0.12 on the PU dataset, HybridSN and M3D-DCNN are 98.41% ± 0.30 and 96.62% ± 0.55 on Houston 2013 dataset. In Table 4, it can be seen that our method is 22.42% and 29.95% higher compared to HybridSN and M3D-DCNN. The possible reason is that the IP dataset has a more concentrated distribution of the same type of pixels compared to other datasets. The methods with the smaller convolution kernels have limited receptive fields, ignoring important global information in the feature maps.
Compared to transformer-based methods, the proposed MDFFN also obtains the best performance, with its OA on the PU dataset surpassing ViT, CrossViT, DeepViT, and SpectralFormer by 0.62%, 0.58%, 0.45%, and 0.52%, respectively. This is due to the fact that the transformer-based methods are still deficient in capturing local information, although they have advantages in handling long-range dependencies.
Compared to other CNN-transformer hybrid-based methods, SSFTT (99.69% ± 0.04), morphFormer (99.65% ± 0.06), and MSNAT (99.80% ± 0.14) show relatively better performance on the PU dataset. These approaches are effective in combining the benefits of CNN in local feature extraction with the advantages of transformer in dealing with long-range dependencies, enhancing the ability of the network to handle complex scenes. Notably, our MDFFN indicates more favorable results, with the OA improved by 0.27%, 0.31%, and 0.16%, and Kappa improved by 0.36%, 0.41%, and 0.22%, respectively. The possible reason is that SSFTT and morphFormer directly flatten the features obtained from convolution, resulting in the loss of some key information. And MSNAT simply fuses multi-scale features by parallel summation, which overlooks the differences among the various features. For our approach, combining the MCIE module with the DACA module shows superiority in fine feature extraction and deep feature fusion. Consequently, MDFFN could adequately exploit the abundant spatial–spectral features in HSI, enhancing the model’s performance and robustness.
2.
The performance in multi-scale feature extraction and fusion
On the Houston 2013 dataset, the OA is 99.42% ± 0.07 and the Kappa is 99.37% ± 0.08 for MDFFN, whose OA outperforms the single-scale methods such as ViT (98.13% ± 0.12), DeepViT (98.55% ± 0.20), and SSFTT (98.19% ± 0.27). While single-scale approaches typically capture limited information, our method employs the multi-scale strategy to enable a more comprehensive extraction of spatial and spectral features in HSI. Meanwhile, MDFFN outperforms the multi-scale methods such as CrossViT, morphFormer, and MSNAT by 0.97%, 1.1%, and 1.3%, respectively, in terms of OA on the Houston 2013 dataset, and outperforms the best-performing MSNAT by 0.16% on the PU dataset. The above results demonstrate the potential of MDFFN in multi-scale local feature extraction. Unlike the better-performing CrossViT that directly uses a flattening operation, MDFFN utilizes the MCIE module to retain more effective spatial and spectral information, highlighting discriminative features. (See more explanations about the performance of the MCIE module in Section 4.3.4 for Ablation Experiments).
Observing Table 4, compared with the methods using feature fusion, MDFFN improves OA by 2.84% and Kappa by 3.24% compared to SpectralFormer, and the AA shows improvements of 6.49% and 1.55% over CrossViT and morphFormer based on attention mechanism, respectively, on the IP dataset. Similarly, our method is superior to the above across all metrics on other datasets. This demonstrates that MDFFN exhibits superior feature fusion capability, which could facilitate further spatial–spectral feature enhancement by utilizing the DACA module for deep multi-scale feature fusion. (See more explanations about the performance of the DACA module in Section 4.3.4 for Ablation Experiments).
In conclusion, our proposed MDFFN utilizes the MCIE module to extract multi-scale spatial–spectral features and adopts the DACA module to fully realize the information perception between different features, which can better solve the multi-scale challenges in different land-cover types.
3.
The performance on imbalanced datasets
The issue of class imbalance poses a challenge for HSI classification. For example, the seventh class (grass—pasture-mowed) and the ninth class (oats) in the IP dataset, as well as the fifth (painted metal sheets), seventh (bitumen), and ninth (shadows) classes in the PU dataset.
Observing Table 4, SSFTT (36.11% ± 14.96), MSNAT (48.89% ± 19.85), and other well-performing methods still underperform for the accuracy of the ninth class in the IP dataset. Similarly, DeepViT is only 64.40% ± 12.58, and SSFTT is 78.40% ± 28.41 for the accuracy of the seventh class. However, morphFormer achieves relatively better performance in the seventh class (100.00% ± 0.00) and the ninth class (70.56% ± 16.11), likely due to the introduction of mathematical morphology into the model, which enables it to learn more intrinsic shape information. It is worth noting that our MDFFN obtains the best classification results on the seventh (100.00% ± 0.00) and ninth (87.78% ± 4.16) classes, improving 21.6% and 51.67% compared to SSFTT, and 5.6% and 38.89% compared to MSNAT, respectively. In terms of overall performance, the AA of MDFFN is 98.05% ± 0.32, which increases by 1.55% compared to the second-best performing model, morphFormer. Precisely because our method enables effective fine multi-scale feature extraction, which better addresses the classification of small target samples. In addition, on the PU dataset (see Table 5), MDFFN reaches 100% optimal accuracy in the fifth, seventh, and ninth classes. The above results confirm that our approach could comprehensively extract discriminative features from HSI, enabling the network to achieve satisfactory performance even with imbalanced data, thus ensuring an overall improvement in accuracy.
4.
The visualization of confusion matrices on imbalanced datasets
To further visualize the contributions of various methods for different classes in imbalanced datasets, Figure 4 and Figure 5 present the confusion matrix visualizations for different methods on the IP and PU datasets. Due to the significant disparities in sample numbers between the classes in these imbalanced datasets, we normalize the confusion matrices to more clearly evaluate the classification performance. Observing Figure 4 and Figure 5, we find that the main diagonal of MDFFN’s confusion matrix is brighter and more consistent compared to other methods, indicating that our approach achieves satisfactory performance across classes, particularly in minority classes. For minority classes, Table 4 shows that the color blocks for the seventh and ninth classes in MDFFN are more concentrated than others, signifying fewer misclassifications. Similarly, Table 5 reveals that the color for the ninth class in MDFFN is brighter than that in HybridSN, M3D-DCNN, and SSFTT, suggesting higher accuracy. Additionally, in other classes, such as the 1st and 16th classes in the IP dataset and the 3rd class in the PU dataset, our method also outperforms others. This indicates that the proposed MDFFN not only improves the accuracy of small-sample classes but also ensures the stability of other classes, verifying its superiority in processing imbalanced datasets.

4.3.2. Visual Evaluation

To visually demonstrate the classification results, Figure 6, Figure 7 and Figure 8 present the performance in the form of classification maps of all compared methods on the IP, PU, and Houston 2013 datasets. From visual classification maps, the results indicate that MDFFN achieves the maps closest to the ground truth, with fewer misclassified pixels and cleaner boundaries, delivering the best classification performance.
Specifically, for the IP dataset, where the pixels of the same class are relatively concentrated, our method obtains the results closest to the ground truth compared to other methods. As shown in Figure 6, the “Soybean—clean” class contains more noise and is frequently misclassified as the “Corn—no till” class, especially in CNN-based methods (See Figure 6c–f). In contrast, MDFFN shows the most consistent color distribution. The PU and Houston 2013 datasets have more isolated pixels and small regions, which could pose a challenge for the classifier. Remarkably, MDFFN maintains the best overall consistency in various regions compared to other models. This advantage is particularly evident in critical regions like “Self-blocking bricks” in PU and “Parking lot 2” in Houston 2013. The visual comparisons further highlight the effectiveness of the proposed MDFFN. Our methodology combines the multi-scale feature extraction capability of the MCIE module with the cross-scale deep fusion mechanism of the DACA module. This design demonstrates the significant advantages of the MDFFN in fine feature extraction and deep feature fusion, enabling effective characterization of land-cover types with varying spatial shapes and sizes, thereby improving classification accuracy.
However, Figure 6 also reveals that MDFFN still exhibits some misclassifications between adjacent pixels in small regions, such as at the boundary between the “Soybean—min till” and “Corn—no till” classes in the IP dataset. Although our method achieves a clearer division in this region compared to others, there remains room for improvement. This may be due to the similar spectral–spatial features between different classes in the transitional zone of the small regions. Future work could explore strategies such as morphological operations for boundary prior constraints to enhance the ability of the model to detect subtle boundaries, thereby further improving its discriminative performance in complex scenarios.

4.3.3. Efficiency Analysis

To comprehensively evaluate the efficiency of different methods in hyperspectral image classification, Table 7 lists the number of parameters, the floating point operations (FLOPs), the training time, and the testing time of each method on the IP datasets. From the results, the proposed MDFFN exhibits higher parameters and FLOPs compared to other methods, and its speed efficiency is less than satisfactory, which represents a limitation of our approach. This is due to the complexity of the MDFFN model design based on the improved multi-scale CrossViT. In this design, the transformer requires multiple stacks of self-attention modules to learn features, affecting the model’s efficiency. However, previous comparative experiments and visual evaluations have validated the superiority of MDFFN in classification performance (see Section 4.3.1 and Section 4.3.2). The outstanding classification performance of our approach combined with its effective handling of imbalanced data compensates for the weakness of efficiency. Overall, we consider the balance between performance and efficiency in MDFFN acceptable. In addition, as shown in Table 7, the parameters and FLOPs of MDFFN are improved by nearly three times compared with its counterpart CrossViT. The training time is lower than CrossViT and SpectralFormer, and close to morphFormer. This further demonstrates that MDFFN could achieve excellent classification performance at an appropriate computational cost, offering a feasible solution for HSI classification tasks.

4.3.4. Ablation Experiments

We further conduct ablation experiments on the three datasets to explore the effect of the MCIE and DACA modules in the proposed MDFFN. The experimental configurations are as follows:
(a)
The CrossViT model with multi-scale feature fusion is selected as the basic architecture.
(b)
To validate the effectiveness of the multi-scale convolutional information embedding module in MDFFN, the experiment uses the linear projection embedding module based on CrossViT as a comparison.
(c)
To validate the effectiveness of the dual aggregation cross-attention module in MDFFN, the experiment adopts the cross-attention module based on CrossViT as a comparison.
  • The performance for different combinations of the modules
We evaluated the performance of the modules by considering four different combinations with quantitative comparisons tabulated in Table 8. Case 1 represents the configuration without the MCIE and DACA modules, Case 2 uses the DACA module, Case 3 uses the MCIE module, and Case 4 incorporates both the MCIE and DACA modules. To further explore the contributions of each module, Figure 9 shows the classification accuracy of classes for different cases on the imbalanced dataset.
The contribution of the MCIE module significantly improves the classification performance of the network, as demonstrated by the results of Case 1 and Case 3 in Table 8. Specifically, on the IP, PU, and Houston 2013 datasets, the OA increased by 7.03%, 1.32%, and 2.53%, respectively, while the AA improved by 7.05%, 1.97%, and 3.38%. Furthermore, the Kappa shows an enhancement of 8.05%, 1.76%, and 2.74%. These results indicate that the MCIE module can adequately capture important spatial–spectral features at different scales to highlight discriminative information for improved classification accuracy. Additionally, Figure 9 shows that for small sample classes such as the seventh and ninth classes, Case 3 achieves the fastest improvement in accuracy. For classes with a relatively ample number of samples, such as the 2nd and 12th classes, Case 3 also exhibits significant improvement. This validates the superiority of the MCIE module in handling imbalanced samples, as it can preserve more effective spatial and spectral information, enabling the extraction of key features even in classes with fewer samples.
Similarly, the contribution of the DACA module is reflected in the improvement in classification performance on different datasets. As shown in Case 1 and Case 2 in Table 8, the OA is improved by 3.58%, 0.57%, and 1.59%, AA by 3.31%, 0.87%, and 1.74%, and the Kappa by 4.10%, 0.76%, and 1.72% on the IP, PU, and Houston 2013 datasets, respectively, which validates the DACA module is more effective in achieving deep feature perception compared to employing cross-attention. This may be because cross-attention leads to the weakening of some important features that are weakly autocorrelated in the single branch but significant in cross-scale information. In contrast, our method effectively mitigates the above issue by employing a cross-scale interaction strategy and a dual aggregation mechanism. Additionally, as shown in Figure 9, Case 2 achieves significant improvements in classes such as the 7th and 15th classes. This proves the effectiveness of the DACA module, highlighting its ability to fully facilitate deep fusion between multi-scale features and providing a strong feature representation capability.
According to Table 8, it can be observed that the combination of two modules (Case 4) achieves the highest OA, AA, and Kappa on all datasets, with the OA of 98.85% ± 0.09, 99.96% ± 0.02, and 99.42% ± 0.07. Figure 9 also clearly demonstrates that Case 4 achieves the highest accuracy across all classes, which further validates the superiority of the proposed MDFFN. This is precisely because our method fully leverages the rich spatial–spectral information in HSI, enabling a more comprehensive extraction of local and global features. By employing the MCIE module to effectively extract multi-scale fine features, and combining it with the DACA module to enhance the synergy between different features, the shallow feature extraction from the MCIE module is elevated to deep feature fusion, resulting in superior performance.
2.
The visualization of feature maps of each module
To further demonstrate the feature extraction capabilities of each module, we use t-distributed stochastic neighbor embedding (T-SNE) [47] to visualize the feature separability extracted from the PU dataset for different cases, as shown in Figure 10. From Figure 10, it can be found that feature clustering is achieved for all four cases. Specifically, Case 1 has an obvious overlap between different classes, and the points of the same class are more scattered, failing to form a good clustering effect. Comparing Case 1 and Case 3, Case 3 can clearly separate the classes, with almost no mixing between the “Trees” and “Self-blocking bricks” classes, which proves the superior feature extraction ability of the MCIE module. Furthermore, comparing Case 3 and Case 4, the feature clusters in Case 4 are more complete and compact. It better separates the “Asphalt” and “Self-blocking bricks” classes, and the distributions for the “Trees” and “Shadows” classes are more uniform. This indicates that incorporating the DACA module further enhances the features, making it easier for the network to extract features conducive to classification and reducing interference by neighboring classes.

4.3.5. Parameter Sensitivity Experiments

To assess the performance of the model comprehensively, we conduct experiments to investigate the parameter sensitivity of the proposed MDFFN for analyzing the effects of different parameter settings on HSI classification, including various combinations of S-branch size and L-branch size, different depths of the multi-scale transformer, and different proportions of training samples. The detailed results are presented in Table 9 and Table 10 and Figure 11 and Figure 12.
  • Combinations in multi-scale patch sizes
The different scale sizes would affect the classification performance of the model. Table 9 shows the classification results of the MDFFNB with different scale combinations on the IP dataset. In the experiment, we set the spatial sizes to 10, 12, 15, 18, 20, and 24, selecting 2, 3, and 4 as the patch sizes for S-branch as well as 5 and 6 for L-Branch. In Table 9, the results reveal that the OA, AA, and Kappa exhibit a trend of first increasing and then decreasing as the spatial size increases, where the best values are 99.04% ± 0.12, 98.34% ± 0.42, and 98.90% ± 0.14. The reason is that although the larger image blocks provide more spatial information, they lead to more complexity in extracting the spatial–spectral features. We could also find that as the patch size increases, both the params and FLOPs increase, consuming more resources and potentially affecting the performance of the model (e.g., combinations of (4, 5) and (4, 6)). Moreover, we can intuitively observe that the combination of (3, 6) for the S-branch and L-branch achieves the best performance. The combination not only allows the extraction of more diverse and richer features but also avoids excessive repetitive and redundant information. Therefore, we can comprehensively consider both model performance and efficiency, selecting the appropriate combination according to the specific application scenario to achieve an optimal balance. For real-time processing tasks, we should prioritize combinations with higher computational efficiency. Conversely, for tasks requiring high accuracy, a higher computational cost can be accepted to obtain better performance, such as the combination of (3, 6).
2.
Number of multi-scale transformer layers
Depth is defined as the number of multi-scale transformer layers applied. We investigate the impact of different depths of the multi-scale transformer in MDFFN on the classification results. Generally, increasing the depth can capture more complex features and long-range dependencies, but an excessively deep architecture may lead to the attenuation of dependency information and increase the computational burden. In the experiment, we set the depth to 1, 3, 5, 7, and 9. According to Table 10, in the IP dataset, as the number of layers increases from 1 to 5, the classification performance is improved due to the greater feature extraction capability of the model and the increase in the number of learnable parameters. Specifically, the OA is improved by 0.63% from 98.28% ± 0.15 to 98.91% ± 0.04. However, continuously increasing the number of stacks does not lead to an additional performance boost. When the number of layers is increased from 5 to 9, the classification performance declines. This could be due to the small proportion of training samples. The use of multiple transformer layers introduces a large number of redundant parameters, increasing the complexity of the model. This leads to overfitting and ultimately impacts performance.
3.
Proportion of training samples
To assess the generalization ability of the model, we conduct experiments with varying proportions of training samples. Taking into account the sample size of each dataset and the performance in previous experiments, we set the training sample ratio for the IP and Houston 2013 datasets, which have relatively fewer samples, to range from 5% to 30% (with a 5% interval). For the PU dataset, the ratio is set from 3% to 13% (with a 2% interval). This could more comprehensively reflect the impact of training data variations on model performance. As shown in Figure 11, MDFFN achieves the highest OA of 96.36% and 98.78% at the 5% training sample ratio for the IP and Houston 2013 datasets, respectively. For the PU dataset, the highest OA of 99.66% is achieved at the 3% training sample ratio. These results demonstrate that our approach performs best with small samples, compared to the others. As the proportion of training samples increases, the performance of each method gradually improves and tends to stabilize. Among them, MDFFN shows a more stable trend across different datasets. For example, in the 3–5% range of the PU dataset, MDFFN increased by only 0.18%, while M3D-DCNN increased by 2.44% and CrossViT by 1.67%, which proves the model’s excellent stability. Compared to other methods, MDFFN consistently outperforms the others across different training sample ratios, confirming that the proposed MDFFN has superior classification performance and generalization ability.
In addition, to further explore the performance of the proposed MDFFN in an imbalanced environment, we adjust the training sample ratio to 5% and choose different types of methods for comparison, including 3D-CNN, CrossViT, and morphFormer. These methods outperformed the other same type methods in the minority classes of the previous experiments (with a 10% training sample ratio). According to Figure 12, the proposed MDFFN still performs best in minority classes, particularly in the seventh class, where its accuracy is significantly higher than that of 3D-CNN and CrossViT. This further validates the robust generalization capability of MDFFN on imbalanced data.

5. Conclusions

In this paper, in order to effectively achieve fine feature extraction and deep feature fusion for HSI, we propose the multi-scale dual-aggregated feature fusion network for HSI classification. Under the multi-scale framework, the MDFFN method combines the advantages of CNN in local feature extraction and transformer in global feature extraction, resulting in outstanding classification performance. In particular, we design the MCIE module, which employs a multi-scale pooling operation to aggregate local features, highlighting discriminative information. We also propose the DACA module, which uses a cross-scale interaction strategy to ensure the deep fusion of different features and adopts a dual aggregation mechanism to mitigate the information loss, further enhancing the spatial–spectral features and providing high-level semantic information for the entire architecture. Extensive experiments conducted on the three HSI datasets demonstrate that the proposed MDFFN achieves more competitive classification performance compared to other state-of-the-art methods, and even could learn critical features from small target samples, effectively reducing misclassification in imbalanced datasets. These results highlight the strong feature representation capability and superior generalization performance of the proposed method.
Building upon this research, we could attempt two aspects of work. The first involves exploring more lightweight and efficient architectures that further reduce the model complexity while maintaining the classification performance, thereby improving the model’s applicability in complex real-world scenarios. The second focuses on integrating multimodal data to improve classification accuracy and robustness, thus providing technical support for realizing more reliable classification tasks.

Author Contributions

Conceptualization: G.S. and X.L.; methodology: G.S., X.L. and F.Z.; resources: Y.D. and X.Y.; software: X.L. and J.C. (Jinjie Chen); supervision: Y.D. and X.Y.; validation: J.C. (Jinjie Chen) and J.C. (Jiaxin Chen); visualization: J.C. (Jiaxin Chen); writing—original draft: G.S. and X.L.; writing—review and editing: G.S., X.L. and F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 62002122, the Major Program of Guangdong Provincial Department of Agriculture and Rural Affairs grant number 202405040306058, the Guangdong Provincial Education Science Planning Project grant number 2024GXJK371, the Guangdong Provincial Education Science Planning Project grant number 2023GXJK295, the Guangdong Provincial Philosophy and Social Science Planning Project grant number GD24XTS02, the South China Agricultural University Curriculum Civics Demonstration Project grant number kcsz2023091, and the Undergraduate Teaching Quality and Teaching Reform Engineering Project of South China Agricultural University grant number ZLGC202429.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors are grateful to the editor and reviewers for their constructive comments, which have significantly improved this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MDFFNMulti-scale dual-aggregated feature fusion network
MCIEMulti-scale convolutional information embedding
DACADual aggregated cross-attention
HSIHyperspectral image
DLDeep learning
2D-CNNTwo-dimensional convolutional neural networks
3D-CNNThree-dimensional convolutional neural networks
HybridSNHybrid spectral CNN
M3D-DCNNMulti-scale 3D deep convolutional neural network
MHSAMulti-head self-attention
ViTVision transformer
SSFTTSpectral–spatial feature tokenization transformer
MSNATMulti-scale neighborhood attention transformer
PCAPrincipal component analysis
FFNFeed-forward network
MLPMultilayer perceptron
IPIndian Pines
PUPavia University
T-SNET-distributed stochastic neighbor embedding

References

  1. Khan, A.; Vibhute, A.D.; Mali, S.; Patil, C.H. A Systematic Review on Hyperspectral Imaging Technology with a Machine and Deep Learning Methodology for Agricultural Applications. Ecol. Inform. 2022, 69, 101678. [Google Scholar] [CrossRef]
  2. Alboody, A.; Vandenbroucke, N.; Porebski, A.; Sawan, R.; Viudes, F.; Doyen, P.; Amara, R. A New Remote Hyperspectral Imaging System Embedded on an Unmanned Aquatic Drone for the Detection and Identification of Floating Plastic Litter Using Machine Learning. Remote Sens. 2023, 15, 3455. [Google Scholar] [CrossRef]
  3. Sousa, F.J.; Sousa, D.J. Hyperspectral Reconnaissance: Joint Characterization of the Spectral Mixture Residual Delineates Geologic Unit Boundaries in the White Mountains, CA. Remote Sens. 2022, 14, 4914. [Google Scholar] [CrossRef]
  4. Zhu, W.; Sun, X.; Zhang, Q. DCG-Net: Enhanced Hyperspectral Image Classification with Dual-Branch Convolutional Neural Network and Graph Convolutional Neural Network Integration. Electronics 2024, 13, 3271. [Google Scholar] [CrossRef]
  5. Rao, W.; Gao, L.; Qu, Y.; Sun, X.; Zhang, B.; Chanussot, J. Siamese Transformer Network for Hyperspectral Image Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5526419. [Google Scholar] [CrossRef]
  6. Zhang, H.; Liu, H.; Yang, R.; Wang, W.; Luo, Q.; Tu, C. Hyperspectral Image Classification Based on Double-Branch Multi-Scale Dual-Attention Network. Remote Sens. 2024, 16, 2051. [Google Scholar] [CrossRef]
  7. Liu, G.; Wang, L.; Liu, D.; Fei, L.; Yang, J. Hyperspectral Image Classification Based on Non-Parallel Support Vector Machine. Remote Sens. 2022, 14, 2447. [Google Scholar] [CrossRef]
  8. Yuan, S.; Sun, Y.; He, W.; Gu, Q.; Xu, S.; Mao, Z.; Tu, S. MSLM-RF: A Spatial Feature Enhanced Random Forest for On-Board Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5534717. [Google Scholar] [CrossRef]
  9. Wang, X. Hyperspectral Image Classification Powered by Khatri-Rao Decomposition-Based Multinomial Logistic Regression. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5530015. [Google Scholar] [CrossRef]
  10. Lv, N.; Han, Z.; Chen, C.; Feng, Y.; Su, T.; Goudos, S.; Wan, S. Encoding Spectral-Spatial Features for Hyperspectral Image Classification in the Satellite Internet of Things System. Remote Sens. 2021, 13, 3561. [Google Scholar] [CrossRef]
  11. Chen, C.; Ma, Y.; Ren, G. Hyperspectral Classification Using Deep Belief Networks Based on Conjugate Gradient Update and Pixel-Centric Spectral Block Features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4060–4069. [Google Scholar] [CrossRef]
  12. Yu, C.; Han, R.; Song, M.; Liu, C.; Chang, C.-I. Feedback Attention-Based Dense CNN for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5501916. [Google Scholar] [CrossRef]
  13. Mei, S.; Li, X.; Liu, X.; Cai, H.; Du, Q. Hyperspectral Image Classification Using Attention-Based Bidirectional Long Short-Term Memory Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5509612. [Google Scholar] [CrossRef]
  14. Wang, J.; Guo, S.; Huang, R.; Li, L.; Zhang, X.; Jiao, L. Dual-Channel Capsule Generation Adversarial Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5501016. [Google Scholar] [CrossRef]
  15. Ye, Z.; Li, C.; Liu, Q.; Bai, L.; Fowler, J. Computationally Lightweight Hyperspectral Image Classification Using a Multiscale Depthwise Convolutional Network with Channel Attention. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
  16. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929v2. [Google Scholar]
  17. Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; Feng, J. DeepViT: Towards Deeper Vision Transformer. arXiv 2021, arXiv:2103.11886. [Google Scholar]
  18. Chen, C.-F.; Fan, Q.; Panda, R. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. 2021. Available online: https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper.pdf (accessed on 7 September 2024).
  19. Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
  20. Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–Spatial Morphological Attention Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503615. [Google Scholar] [CrossRef]
  21. Shu, Z.; Wang, Y.; Yu, Z. Dual Attention Transformer Network for Hyperspectral Image Classification. Eng. Appl. Artif. Intell. 2024, 127, 107351. [Google Scholar] [CrossRef]
  22. Qiao, X.; Roy, S.K.; Huang, W. Multiscale Neighborhood Attention Transformer with Optimized Spatial Pattern for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5523815. [Google Scholar] [CrossRef]
  23. Wang, X.; Sun, L.; Lu, C.; Li, B. A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification. Remote Sens. 2024, 16, 1180. [Google Scholar] [CrossRef]
  24. Wang, W.; Liu, L.; Zhang, T.; Shen, J.; Wang, J.; Li, J. Hyper-ES2T: Efficient Spatial–Spectral Transformer for the Classification of Hyperspectral Remote Sensing Images. Int. J. Appl. Earth Obs. Geoinf. 2022, 113, 103005. [Google Scholar] [CrossRef]
  25. Gong, H.; Li, Q.; Li, C.; Dai, H.; He, Z.; Wang, W.; Li, H.; Han, F.; Tuniyazi, A.; Mu, T. Multiscale Information Fusion for Hyperspectral Image Classification Based on Hybrid 2D-3D CNN. Remote Sens. 2021, 13, 2268. [Google Scholar] [CrossRef]
  26. Wang, A.; Zhang, K.; Wu, H.; Iwahori, Y.; Chen, H. Multi-Scale Residual Spectral–Spatial Attention Combined with Improved Transformer for Hyperspectral Image Classification. Electronics 2024, 13, 1061. [Google Scholar] [CrossRef]
  27. Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep Convolutional Neural Networks for Hyperspectral Image Classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
  28. Hamouda, M.; Ettabaa, K.S.; Bouhlel, M.S. Smart Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IET Image Process. 2020, 14, 1999–2005. [Google Scholar] [CrossRef]
  29. Fei, X.; Wu, S.; Miao, J.; Wang, G.; Sun, L. Lightweight-VGG: A Fast Deep Learning Architecture Based on Dimensionality Reduction and Nonlinear Enhancement for Hyperspectral Image Classification. Remote Sens. 2024, 16, 259. [Google Scholar] [CrossRef]
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  31. Hu, T.; Yuan, J.; Wang, X.; Yan, C.; Ju, X. Spectral-Spatial Features Extraction of Hyperspectral Remote Sensing Oil Spill Imagery Based on Convolutional Neural Networks. IEEE Access 2022, 10, 127969–127983. [Google Scholar] [CrossRef]
  32. Zhang, X.; Guo, Y.; Zhang, X. Hyperspectral Image Classification Based on Optimized Convolutional Neural Networks with 3D Stacked Blocks. Earth Sci. Inf. 2022, 15, 383–395. [Google Scholar] [CrossRef]
  33. Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN Feature Hierarchy for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 277–281. [Google Scholar] [CrossRef]
  34. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5518615. [Google Scholar] [CrossRef]
  35. Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. Hyperspectral Image Transformer Classification Networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5528715. [Google Scholar] [CrossRef]
  36. He, M.; Li, B.; Chen, H. Multi-Scale 3D Deep Convolutional Neural Network for Hyperspectral Image Classification. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3904–3908. [Google Scholar]
  37. Xu, Q.; Xiao, Y.; Wang, D.; Luo, B. CSA-MSO3DCNN: Multiscale Octave 3D CNN with Channel and Spatial Attention for Hyperspectral Image Classification. Remote Sens. 2020, 12, 188. [Google Scholar] [CrossRef]
  38. Chen, Y.; Wang, X.; Zhang, J.; Shang, X.; Hu, Y.; Zhang, S.; Wang, J. A New Dual-Branch Embedded Multivariate Attention Network for Hyperspectral Remote Sensing Classification. Remote Sens. 2024, 16, 2029. [Google Scholar] [CrossRef]
  39. Li, W.; Chen, H.; Liu, Q.; Liu, H.; Wang, Y.; Gui, G. Attention Mechanism and Depthwise Separable Convolution Aided 3DCNN for Hyperspectral Remote Sensing Image Classification. Remote Sens. 2022, 14, 2215. [Google Scholar] [CrossRef]
  40. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 2021. Available online: https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf (accessed on 10 September 2024).
  41. He, W.; Huang, W.; Liao, S.; Xu, Z.; Yan, J. CSiT: A Multi-Scale Vision Transformer for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9266–9277. [Google Scholar] [CrossRef]
  42. Chen, H.; Zendehdel, N.; Leu, M.C.; Yin, Z. Fine-Grained Activity Classification in Assembly Based on Multi-Visual Modalities. J. Intell. Manuf. 2024, 35, 2215–2233. [Google Scholar] [CrossRef]
  43. Chen, H.; Miao, F.; Chen, Y.; Xiong, Y.; Chen, T. A Hyperspectral Image Classification Method Using Multifeature Vectors and Optimized KELM. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2781–2795. [Google Scholar] [CrossRef]
  44. Ma, Y.; Wang, S.; Du, W.; Cheng, X. An Improved 3D-2D Convolutional Neural Network Based on Feature Optimization for Hyperspectral Image Classification. IEEE Access 2023, 11, 28263–28279. [Google Scholar] [CrossRef]
  45. Gu, Q.; Luan, H.; Huang, K.; Sun, Y. Hyperspectral Image Classification Using Multi-Scale Lightweight Transformer. Electronics 2024, 13, 949. [Google Scholar] [CrossRef]
  46. Yang, X.; Ye, Y.; Li, X.; Lau, R.Y.K.; Zhang, X.; Huang, X. Hyperspectral Image Classification with Deep Learning Models. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5408–5423. [Google Scholar] [CrossRef]
  47. van der Maaten, L.; Hinton, G. Visualizing Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. Overview of the MDFFN for HSI classification. MDFFN comprises a multi-scale convolutional information embedding (MCIE) module and a stack of D multi-scale transformer encoders. Each multi-scale transformer encoder employs two different branches (i.e., S-branch and L-branch) to process image tokens of different sizes ( P S and P L , P S < P L ) and uses the dual aggregated cross-attention (DACA) module to achieve efficient fusion. In the figure, * indicates the CLS tokens.
Figure 1. Overview of the MDFFN for HSI classification. MDFFN comprises a multi-scale convolutional information embedding (MCIE) module and a stack of D multi-scale transformer encoders. Each multi-scale transformer encoder employs two different branches (i.e., S-branch and L-branch) to process image tokens of different sizes ( P S and P L , P S < P L ) and uses the dual aggregated cross-attention (DACA) module to achieve efficient fusion. In the figure, * indicates the CLS tokens.
Electronics 14 01477 g001
Figure 2. Multi-scale convolutional information embedding module.
Figure 2. Multi-scale convolutional information embedding module.
Electronics 14 01477 g002
Figure 3. Dual aggregated cross-attention module for S-branch.
Figure 3. Dual aggregated cross-attention module for S-branch.
Electronics 14 01477 g003
Figure 4. Visualization of confusion matrices for different methods on the IP dataset: (a) 2D-CNN, (b) 3D-CNN, (c) HybridSN, (d) M3D-DCNN., (e) ViT, (f) CrossViT, (g) DeepViT, (h) SpectralFormer, (i) SSFTT, (j) morphFormer, (k) MSNAT, (l) MDFFN.
Figure 4. Visualization of confusion matrices for different methods on the IP dataset: (a) 2D-CNN, (b) 3D-CNN, (c) HybridSN, (d) M3D-DCNN., (e) ViT, (f) CrossViT, (g) DeepViT, (h) SpectralFormer, (i) SSFTT, (j) morphFormer, (k) MSNAT, (l) MDFFN.
Electronics 14 01477 g004
Figure 5. Visualization of confusion matrices for different methods on the PU dataset: (a) 2D-CNN, (b) 3D-CNN, (c) HybridSN, (d) M3D-DCNN, (e) ViT, (f) CrossViT, (g) DeepViT, (h) SpectralFormer, (i) SSFTT, (j) morphFormer, (k) MSNAT, (l) MDFFN.
Figure 5. Visualization of confusion matrices for different methods on the PU dataset: (a) 2D-CNN, (b) 3D-CNN, (c) HybridSN, (d) M3D-DCNN, (e) ViT, (f) CrossViT, (g) DeepViT, (h) SpectralFormer, (i) SSFTT, (j) morphFormer, (k) MSNAT, (l) MDFFN.
Electronics 14 01477 g005
Figure 6. Visualizations of classification results on IP dataset. (a) Input image. (b) Ground truth. (c) 2D-CNN. (d) 3D-CNN. (e) HybridSN. (f) M3D-DCNN. (g) ViT. (h) CrossViT. (i) DeepViT. (j) SpectralFormer. (k) SSFTT. (l) morphFormer. (m) MSNAT. (n) MDFFN.
Figure 6. Visualizations of classification results on IP dataset. (a) Input image. (b) Ground truth. (c) 2D-CNN. (d) 3D-CNN. (e) HybridSN. (f) M3D-DCNN. (g) ViT. (h) CrossViT. (i) DeepViT. (j) SpectralFormer. (k) SSFTT. (l) morphFormer. (m) MSNAT. (n) MDFFN.
Electronics 14 01477 g006
Figure 7. Visualizations of classification results on PU dataset. (a) Input image. (b) Ground truth. (c) 2D-CNN. (d) 3D-CNN. (e) HybridSN. (f) M3D-DCNN. (g) ViT. (h) CrossViT. (i) DeepViT. (j) SpectralFormer. (k) SSFTT. (l) morphFormer. (m) MSNAT. (n) MDFFN.
Figure 7. Visualizations of classification results on PU dataset. (a) Input image. (b) Ground truth. (c) 2D-CNN. (d) 3D-CNN. (e) HybridSN. (f) M3D-DCNN. (g) ViT. (h) CrossViT. (i) DeepViT. (j) SpectralFormer. (k) SSFTT. (l) morphFormer. (m) MSNAT. (n) MDFFN.
Electronics 14 01477 g007
Figure 8. Visualizations of classification results on Houston 2013 dataset. (a) Input image. (b) Ground truth. (c) 2D-CNN. (d) 3D-CNN. (e) HybridSN. (f) M3D-DCNN. (g) ViT. (h) CrossViT. (i) DeepViT. (j) SpectralFormer. (k) SSFTT. (l) morphFormer. (m) MSNAT. (n) MDFFN.
Figure 8. Visualizations of classification results on Houston 2013 dataset. (a) Input image. (b) Ground truth. (c) 2D-CNN. (d) 3D-CNN. (e) HybridSN. (f) M3D-DCNN. (g) ViT. (h) CrossViT. (i) DeepViT. (j) SpectralFormer. (k) SSFTT. (l) morphFormer. (m) MSNAT. (n) MDFFN.
Electronics 14 01477 g008
Figure 9. Classification accuracy of classes for different cases on the IP dataset.
Figure 9. Classification accuracy of classes for different cases on the IP dataset.
Electronics 14 01477 g009
Figure 10. Visualization of feature separability for different cases on the PU dataset. (a) Case 1. (b) Case 2. (c) Case 3. (d) Case 4.
Figure 10. Visualization of feature separability for different cases on the PU dataset. (a) Case 1. (b) Case 2. (c) Case 3. (d) Case 4.
Electronics 14 01477 g010
Figure 11. The performance of the MDFFN in terms of OA (%) with different training samples on three HSI datasets. (a) Indian Pines. (b) Pavia University. (c) Houston 2013.
Figure 11. The performance of the MDFFN in terms of OA (%) with different training samples on three HSI datasets. (a) Indian Pines. (b) Pavia University. (c) Houston 2013.
Electronics 14 01477 g011
Figure 12. Classification accuracy of minority classes on the IP dataset with a 5% training sample ratio.
Figure 12. Classification accuracy of minority classes on the IP dataset with a 5% training sample ratio.
Electronics 14 01477 g012
Table 1. Land-cover classes and the sample number of training and testing on Indian Pines dataset.
Table 1. Land-cover classes and the sample number of training and testing on Indian Pines dataset.
No.ClassTrainingTesting
1Alfalfa541
2Corn—no till1431285
3Corn—min till83747
4Corn24213
5Grass—pasture48435
6Grass—trees73657
7Grass—pasture–mowed325
8Hay—windrowed48430
9Oats218
10Soybean—no till97875
11Soybean—min till2452210
12Soybean—clean59534
13Wheat20185
14Woods1261139
15Building—grass-trees-drives39347
16Stone—steel-towers984
Total 10249225
Table 2. Land-cover classes and the sample numbers for training and testing in Pavia University dataset.
Table 2. Land-cover classes and the sample numbers for training and testing in Pavia University dataset.
No.ClassTrainingTesting
1Asphalt6635968
2Meadows186516,784
3Gravel2101889
4Trees3062758
5Painted metal sheets1341211
6Bare soil5034526
7Bitumen1331197
8Self-blocking bricks3683314
9Shadows95852
Total 427738,499
Table 3. Land-cover classes and the sample numbers of training and testing on Houston 2013 dataset.
Table 3. Land-cover classes and the sample numbers of training and testing on Houston 2013 dataset.
No.ClassTrainingTesting
1Healthy grass1251126
2Stressed grass1251129
3Synthetic grass70627
4Tree1241120
5Soil1241118
6Water33292
7Residential1271141
8Commercial1241120
9Road1251127
10Highway1231104
11Railway1231112
12Parking lot 11231110
13Parking lot 247422
14Tennis court43385
15Running track66594
Total 150213,527
Table 4. Comparison results on the IP dataset.
Table 4. Comparison results on the IP dataset.
ClassCNN-Based MethodsTransformer-Based MethodsCNN-Transformer Hybrid-Based Methods
2D-CNN3D-CNNHybridSNM3D-DCNNViTCrossViTDeepViTSpectralFormerSSFTTmorphFormerMSNATMDFFN
142.68 ± 7.5079.76 ± 8.9324.39 ± 10.1232.68 ± 12.2080.73 ± 9.6670.24 ± 9.8089.02 ± 2.7373.41 ± 4.1594.15 ± 5.8996.59 ± 5.7971.46 ± 12.9193.90 ± 2.50
290.05 ± 0.9691.05 ± 1.1590.29 ± 4.4781.73 ± 2.2487.22 ± 0.9285.75 ± 0.8387.99 ± 1.0191.23 ± 0.4795.14 ± 0.3994.44 ± 0.6393.81 ± 2.1195.82 ± 0.21
395.03 ± 1.4293.98 ± 1.5392.37 ± 5.9965.92 ± 9.8892.77 ± 1.3693.64 ± 1.2992.78 ± 1.4996.12 ± 0.4999.59 ± 0.4098.18 ± 0.6499.09 ± 0.9199.91 ± 0.19
478.64 ± 3.0675.73 ± 8.7063.66 ± 15.2230.19 ± 10.6277.28 ± 5.8889.01 ± 2.7990.05 ± 2.3991.69 ± 2.75100.00 ± 0.0099.20 ± 1.0995.73 ± 3.1699.20 ± 0.37
595.29 ± 1.4398.09 ± 1.3194.76 ± 1.3591.08 ± 1.9396.14 ± 0.8596.16 ± 0.7396.90 ± 0.3496.94 ± 0.5099.52 ± 0.3099.15 ± 0.8195.24 ± 1.1099.38 ± 0.40
698.40 ± 0.5197.93 ± 0.4997.78 ± 3.4096.41 ± 1.6899.24 ± 0.3498.58 ± 0.3898.25 ± 0.6499.07 ± 0.3899.73 ± 0.4099.36 ± 0.5997.78 ± 1.1999.01 ± 0.27
734.40 ± 11.4897.20 ± 5.0840.00 ± 24.9830.40 ± 19.0397.60 ± 2.6592.00 ± 7.1664.40 ± 12.5895.20 ± 5.0078.40 ± 28.41100.00 ± 0.0094.40 ± 6.97100.00 ± 0.00
899.93 ± 0.1599.51 ± 0.4999.91 ± 0.1599.74 ± 0.2899.98 ± 0.07100.00 ± 0.00100.00 ± 0.0099.88 ± 0.21100.00 ± 0.0099.98 ± 0.0799.93 ± 0.1599.95 ± 0.09
913.33 ± 3.6958.89 ± 16.1434.44 ± 10.4828.89 ± 24.7060.00 ± 12.3780.00 ± 11.9739.44 ± 9.4450.00 ± 8.6136.11 ± 14.9670.56 ± 16.1148.89 ± 19.8587.78 ± 4.16
1093.25 ± 1.2294.39 ± 2.0094.19 ± 2.1181.82 ± 3.2894.69 ± 0.8591.99 ± 1.0092.77 ± 1.3394.59 ± 0.7198.75 ± 0.6398.56 ± 0.8497.25 ± 1.2499.01 ± 0.40
1198.11 ± 0.3497.38 ± 0.6997.97 ± 0.8693.92 ± 1.1296.58 ± 1.1796.36 ± 0.6897.03 ± 0.4297.69 ± 0.3699.47 ± 0.2199.06 ± 0.4098.05 ± 1.0699.47 ± 0.17
1279.48 ± 3.3884.61 ± 3.6170.86 ± 18.7459.10 ± 10.2079.74 ± 3.4877.96 ± 2.0783.41 ± 2.8693.65 ± 1.3797.98 ± 0.6195.34 ± 1.2496.67 ± 1.4097.72 ± 0.38
1398.32 ± 1.0498.38 ± 1.5793.57 ± 10.5594.97 ± 3.71100.00 ± 0.0099.73 ± 0.2797.51 ± 1.8097.73 ± 0.9999.62 ± 0.4998.65 ± 0.9799.24 ± 0.69100.00 ± 0.00
1499.71 ± 0.1799.41 ± 0.7099.65 ± 0.2997.39 ± 1.1499.27 ± 0.2798.72 ± 0.2398.36 ± 0.3699.30 ± 0.4199.97 ± 0.0499.95 ± 0.0799.51 ± 0.2699.99 ± 0.03
1595.68 ± 2.9995.53 ± 2.0755.62 ± 22.0597.39 ± 1.1493.20 ± 1.6997.03 ± 0.7893.11 ± 1.5294.24 ± 1.9199.08 ± 1.1198.50 ± 0.8295.79 ± 2.0599.65 ± 0.28
1675.00 ± 2.7784.40 ± 4.7560.60 ± 33.2053.45 ± 17.4096.07 ± 3.1197.86 ± 1.3982.62 ± 6.6396.43 ± 2.7188.33 ± 3.4096.55 ± 1.9591.67 ± 6.0097.98 ± 0.76
OA
(%)
94.04 ± 0.3494.80 ± 0.7891.23 ± 3.3983.57 ± 1.8693.82 ± 0.4193.57 ± 0.2393.95 ± 0.3696.01 ± 0.1498.52 ± 0.1598.15 ± 0.2197.10 ± 0.8198.85 ± 0.09
AA
(%)
80.46 ± 1.0390.39 ± 1.7675.63 ± 7.2168.10 ± 4.2590.66 ± 1.3891.56 ± 1.0287.73 ± 1.1991.70 ± 0.8892.86 ± 2.3396.50 ± 1.1592.16 ± 2.2298.05 ± 0.32
Kappa
(%)
93.18 ± 0.3994.06 ± 0.9089.95 ± 3.9181.03 ± 2.1892.94 ± 0.4692.65 ± 0.2693.09 ± 0.4295.45 ± 0.1698.32 ± 0.1797.89 ± 0.2396.70 ± 0.9298.69 ± 0.10
The bold entities indicate the highest value.
Table 5. Comparison results on the PU dataset.
Table 5. Comparison results on the PU dataset.
ClassCNN-Based MethodsTransformer-Based MethodsCNN-Transformer Hybrid-Based Methods
2D-CNN3D-CNNHybridSNM3D-DCNNViTCrossViTDeepViTSpectralFormerSSFTTmorphFormerMSNATMDFFN
199.52 ± 0.1199.52 ± 0.1899.48 ± 0.6899.14 ± 0.4399.38 ± 0.1999.40 ± 0.1999.79 ± 0.0899.63 ± 0.1599.94 ± 0.0599.91 ± 0.0599.89 ± 0.1499.99 ± 0.02
2100.00 ± 0.0099.98 ± 0.01100.00 ± 0.0199.95 ± 0.0399.98 ± 0.0199.99 ± 0.0099.98 ± 0.0199.98 ± 0.0199.99 ± 0.0199.99 ± 0.0199.99 ± 0.01100.00 ± 0.00
395.80 ± 0.8895.98 ± 0.9897.20 ± 1.0091.20 ± 0.9393.01 ± 1.2995.85 ± 0.5295.64 ± 0.9495.99 ± 0.8799.49 ± 0.1698.54 ± 0.4898.34 ± 1.3399.70 ± 0.27
498.86 ± 0.1398.85 ± 0.4398.76 ± 1.1298.13 ± 0.4699.65 ± 0.1498.98 ± 0.1898.42 ± 0.4099.19 ± 0.1998.61 ± 0.2498.60 ± 0.3499.72 ± 0.1399.85 ± 0.06
599.90 ± 0.0999.84 ± 0.1599.70 ± 0.6399.93 ± 0.20100.00 ± 0.00100.00 ± 0.0099.88 ± 0.1299.98 ± 0.0599.97 ± 0.0899.97 ± 0.0799.99 ± 0.02100.00 ± 0.00
6100.00 ± 0.0099.98 ± 0.0399.99 ± 0.0199.75 ± 0.2699.59 ± 0.1699.83 ± 0.0899.96 ± 0.0499.85 ± 0.08100.00 ± 0.00100.00 ± 0.0099.99 ± 0.04100.00 ± 0.00
799.74 ± 0.2899.82 ± 0.2698.65 ± 3.4998.72 ± 0.7999.97 ± 0.0599.94 ± 0.0799.84 ± 0.2699.87 ± 0.23100.00 ± 0.00100.00 ± 0.0099.76 ± 0.34100.00 ± 0.00
897.36 ± 0.7298.22 ± 0.9696.40 ± 2.6395.00 ± 0.7898.37 ± 0.3097.40 ± 0.4398.96 ± 0.2897.91 ± 0.5098.77 ± 0.4598.82 ± 0.2199.19 ± 0.4699.86 ± 0.10
997.52 ± 0.6596.10 ± 1.4294.42 ± 4.8995.31 ± 1.52100.00 ± 0.0099.84 ± 0.1699.31 ± 0.6098.37 ± 0.6197.23 ± 0.3297.56 ± 0.4999.82 ± 0.39100.00 ± 0.00
OA
(%)
99.34 ± 0.0999.39 ± 0.1299.21 ± 0.4898.67 ± 0.1799.34 ± 0.0799.38 ± 0.0499.51 ± 0.0699.44 ± 0.0699.69 ± 0.0499.65 ± 0.0699.80 ± 0.1499.96 ± 0.02
AA
(%)
98.74 ± 0.1898.70 ± 0.2798.29 ± 1.1397.46 ± 0.3398.88 ± 0.1399.02 ± 0.0699.09 ± 0.1198.97 ± 0.1399.33 ± 0.0499.26 ± 0.1299.63 ± 0.2999.93 ± 0.04
Kappa
(%)
99.13 ± 0.1299.19 ± 0.1698.95 ± 0.6398.24 ± 0.2299.12 ± 0.0999.18 ± 0.0599.36 ± 0.0799.26 ± 0.0799.59 ± 0.0599.54 ± 0.0799.73 ± 0.1999.95 ± 0.02
The bold entities indicate the highest value.
Table 6. Comparison results on the Houston 2013 dataset.
Table 6. Comparison results on the Houston 2013 dataset.
ClassCNN-Based MethodsTransformer-Based MethodsCNN-Transformer Hybrid-Based Methods
2D-CNN3D-CNNHybridSNM3D-DCNNViTCrossViTDeepViTSpectralFormerSSFTTmorphFormerMSNATMDFFN
195.97 ± 2.1798.39 ± 0.6899.25 ± 0.3696.70 ± 1.0697.08 ± 0.7698.05 ± 1.2098.53 ± 0.9499.77 ± 0.2496.58 ± 1.8698.57 ± 0.6497.80 ± 1.0299.25 ± 0.37
299.39 ± 0.4299.42 ± 0.6299.77 ± 0.1699.62 ± 0.2099.76 ± 0.0699.86 ± 0.0699.81 ± 0.2999.89 ± 0.0799.19 ± 0.5299.17 ± 0.3499.65 ± 0.1699.89 ± 0.04
399.79 ± 0.0799.98 ± 0.0599.98 ± 0.0599.94 ± 0.08100.00 ± 0.0099.92 ± 0.0899.82 ± 0.13100.00 ± 0.0099.62 ± 0.1199.78 ± 0.1599.98 ± 0.0599.87 ± 0.06
497.79 ± 0.7596.42 ± 1.5197.59 ± 1.2496.88 ± 1.6999.04 ± 0.4498.88 ± 0.4997.89 ± 0.8498.56 ± 0.5097.27 ± 1.0096.81 ± 1.4599.50 ± 0.8699.29 ± 0.55
599.95 ± 0.0899.79 ± 0.33100.00 ± 0.0099.73 ± 0.40100.00 ± 0.00100.00 ± 0.00100.00 ± 0.0099.91 ± 0.27100.00 ± 0.0099.99 ± 0.0399.99 ± 0.03100.00 ± 0.00
690.65 ± 3.3596.61 ± 1.7398.29 ± 2.1693.15 ± 2.5699.97 ± 0.1099.93 ± 0.2199.42 ± 0.4999.76 ± 0.4399.76 ± 0.4199.90 ± 0.3199.01 ± 1.42100.00 ± 0.00
793.81 ± 1.9295.07 ± 0.9894.77 ± 1.6193.19 ± 1.9697.43 ± 0.5696.93 ± 0.5996.26 ± 0.5796.61 ± 0.6196.03 ± 1.2095.03 ± 0.8497.03 ± 1.0498.20 ± 0.44
890.45 ± 3.9695.01 ± 1.5497.66 ± 0.8294.10 ± 0.9997.45 ± 0.6797.62 ± 0.4698.02 ± 0.4097.75 ± 0.6095.37 ± 1.2596.95 ± 1.1795.49 ± 1.2798.18 ± 0.74
992.83 ± 2.3795.97 ± 2.3197.08 ± 1.2693.01 ± 1.9695.47 ± 0.9497.24 ± 0.7697.95 ± 0.9997.83 ± 0.7597.97 ± 1.6597.66 ± 1.4494.08 ± 2.2298.97 ± 0.39
1099.36 ± 0.6299.48 ± 0.4499.68 ± 0.2798.31 ± 1.7499.57 ± 0.2999.66 ± 0.2299.83 ± 0.1499.66 ± 0.2899.81 ± 0.4199.77 ± 0.2499.81 ± 0.17100.00 ± 0.00
1196.91 ± 0.7799.10 ± 0.5499.81 ± 0.1998.71 ± 0.8799.90 ± 0.1499.99 ± 0.0399.98 ± 0.0499.54 ± 0.2999.73 ± 0.3999.18 ± 0.5397.37 ± 2.59100.00 ± 0.00
1297.95 ± 1.5198.77 ± 0.6599.39 ± 0.1898.70 ± 0.3098.75 ± 0.3499.04 ± 0.1398.86 ± 0.2399.19 ± 0.1399.50 ± 0.1499.45 ± 0.4499.15 ± 0.2899.61 ± 0.04
1385.92 ± 2.7488.15 ± 2.9290.50 ± 4.5679.27 ± 5.4481.59 ± 2.8084.79 ± 3.0489.27 ± 1.8994.41 ± 1.0692.37 ± 0.8393.06 ± 2.9195.05 ± 1.3499.10 ± 0.49
1499.51 ± 0.3499.22 ± 0.9499.97 ± 0.0899.95 ± 0.16100.00 ± 0.00100.00 ± 0.00100.00 ± 0.0099.82 ± 0.17100.00 ± 0.00100.00 ± 0.0098.99 ± 1.04100.00 ± 0.00
15100.00 ± 0.00100.00 ± 0.0099.98 ± 0.05100.00 ± 0.00100.00 ± 0.0099.83 ± 0.1899.48 ± 0.24100.00 ± 0.00100.00 ± 0.0099.95 ± 0.1199.95 ± 0.11100.00 ± 0.00
OA
(%)
96.38 ± 0.5497.66 ± 0.4098.41 ± 0.3096.62 ± 0.5598.13 ± 0.1298.45 ± 0.1998.55 ± 0.2098.88 ± 0.1298.19 ± 0.2798.32 ± 0.3598.12 ± 0.3499.42 ± 0.07
AA
(%)
96.02 ± 0.5497.43 ± 0.3798.25 ± 0.5096.08 ± 0.6597.73 ± 0.1998.12 ± 0.2698.34 ± 0.2298.85 ± 0.1198.21 ± 0.2298.35 ± 0.3998.19 ± 0.2999.49 ± 0.07
Kappa
(%)
96.08 ± 0.5997.47 ± 0.4498.28 ± 0.3396.34 ± 0.6097.98 ± 0.1298.33 ± 0.2098.43 ± 0.2298.79 ± 0.1398.05 ± 0.2998.18 ± 0.3897.97 ± 0.3799.37 ± 0.08
The bold entities indicate the highest value.
Table 7. Comparison results of efficiency on the IP dataset.
Table 7. Comparison results of efficiency on the IP dataset.
MethodsParams (MB)FLOPs (GB)Training Time (s)Testing Time (s)
2D-CNN0.390.0210.040.62
3D-CNN0.260.0417.360.90
HybridSN1.190.1114.460.88
M3D-DCNN0.210.0316.170.78
ViT9.600.5027.591.19
CrossViT69.872.2993.684.07
DeepViT9.600.5034.041.34
SpectralFormer0.140.01114.7116.17
SSFTT0.150.0312.800.76
morphFormer0.060.0152.572.63
MSNAT0.120.0235.750.99
MDFFN21.150.7857.151.93
Table 8. Ablation experiments of the MCIE and DACA module in MDFFN on IP, PU, and Houston 2013 datasets.
Table 8. Ablation experiments of the MCIE and DACA module in MDFFN on IP, PU, and Houston 2013 datasets.
CaseDatasetModuleOA (%)AA (%)Kappa (%)
MCIEDACA
1Indian Pines××90.49 ± 0.6988.39 ± 1.0689.12 ± 0.79
2×94.07 ± 0.2691.70 ± 0.7593.22 ± 0.29
3×97.52 ± 0.3595.44 ± 0.9797.17 ± 0.40
498.85 ± 0.0998.05 ± 0.3298.69 ± 0.10
1Pavia
University
××98.57 ± 0.1297.85 ± 0.2398.10 ± 0.16
2×99.14 ± 0.0998.72 ± 0.1098.86 ± 0.12
3×99.89 ± 0.0399.82 ± 0.0699.86 ± 0.04
499.96 ± 0.0299.93 ± 0.0499.95 ± 0.02
1Houston 2013××96.37 ± 0.2995.69 ± 0.3996.07 ± 0.31
2×97.96 ± 0.1097.43 ± 0.1297.79 ± 0.11
3×98.90 ± 0.1599.07 ± 0.1498.81 ± 0.16
499.42 ± 0.0799.49 ± 0.0799.37 ± 0.08
The bold entities indicate the highest value. ×: does not exist, √: exists.
Table 9. The performance of the MDFFN with different sizes of S-branch size and L-branch on the IP dataset.
Table 9. The performance of the MDFFN with different sizes of S-branch size and L-branch on the IP dataset.
Spatial SizeS-Branch
( P S )
L-Branch
( P L )
OA
(%)
AA
(%)
Kappa
(%)
FLOPs
(G)
Params
(M)
102598.02 ± 0.1695.87 ± 0.7497.74 ± 0.180.6620.92
122698.51 ± 0.1096.08 ± 0.9598.29 ± 0.120.9021.03
153598.85 ± 0.0998.05 ± 0.3298.69 ± 0.100.7821.15
183699.04 ± 0.1298.22 ± 0.7398.90 ± 0.141.0521.39
204598.96 ± 0.1298.34 ± 0.4298.81 ± 0.130.9621.51
244698.80 ± 0.0497.71 ± 0.5398.63 ± 0.051.2721.96
The bold entities indicate the highest value.
Table 10. The performance of the MDFFN with different depths of multi-scale transformer on the IP dataset.
Table 10. The performance of the MDFFN with different depths of multi-scale transformer on the IP dataset.
DepthOA (%)AA (%)Kappa (%)FLOPs (G)Params (M)
198.28 ± 0.1597.05 ± 0.4498.04 ± 0.170.287.23
398.85 ± 0.0998.05 ± 0.3298.69 ± 0.100.7821.15
598.91 ± 0.0498.34 ± 0.5898.76 ± 0.051.2835.07
798.85 ± 0.0697.40 ± 0.4598.68 ± 0.071.7848.98
998.84 ± 0.1197.58 ± 0.9298.68 ± 0.132.2862.90
The bold entities indicate the highest value.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, G.; Luo, X.; Deng, Y.; Zhao, F.; Yang, X.; Chen, J.; Chen, J. MDFFN: Multi-Scale Dual-Aggregated Feature Fusion Network for Hyperspectral Image Classification. Electronics 2025, 14, 1477. https://doi.org/10.3390/electronics14071477

AMA Style

Song G, Luo X, Deng Y, Zhao F, Yang X, Chen J, Chen J. MDFFN: Multi-Scale Dual-Aggregated Feature Fusion Network for Hyperspectral Image Classification. Electronics. 2025; 14(7):1477. https://doi.org/10.3390/electronics14071477

Chicago/Turabian Style

Song, Ge, Xiaoqi Luo, Yuqiao Deng, Fei Zhao, Xiaofei Yang, Jiaxin Chen, and Jinjie Chen. 2025. "MDFFN: Multi-Scale Dual-Aggregated Feature Fusion Network for Hyperspectral Image Classification" Electronics 14, no. 7: 1477. https://doi.org/10.3390/electronics14071477

APA Style

Song, G., Luo, X., Deng, Y., Zhao, F., Yang, X., Chen, J., & Chen, J. (2025). MDFFN: Multi-Scale Dual-Aggregated Feature Fusion Network for Hyperspectral Image Classification. Electronics, 14(7), 1477. https://doi.org/10.3390/electronics14071477

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop