FSFF-Net: A Frequency-Domain Feature and Spatial-Domain Feature Fusion Network for Hyperspectral Image Classification

Pan, Xinyu; Zang, Chen; Lu, Wanxuan; Jiang, Guiyuan; Sun, Qian

doi:10.3390/electronics14112234

Open AccessArticle

FSFF-Net: A Frequency-Domain Feature and Spatial-Domain Feature Fusion Network for Hyperspectral Image Classification

by

Xinyu Pan

¹,

Chen Zang

¹

,

Wanxuan Lu

²,

Guiyuan Jiang

³

and

Qian Sun

^1,*

¹

School of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

The Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

³

School of Computer Science and Technology, Ocean University of China, Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2234; https://doi.org/10.3390/electronics14112234

Submission received: 14 April 2025 / Revised: 23 May 2025 / Accepted: 28 May 2025 / Published: 30 May 2025

(This article belongs to the Special Issue Innovation and Technology of Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

In hyperspectral image (HSI) classification, each pixel is assigned to a specific land cover type, which is critical for applications in environmental monitoring, agriculture, and urban planning. Convolutional neural network (CNN) and Transformers have become widely adopted due to their exceptional feature extraction capabilities. However, the local receptive field of CNN limits their ability to capture global context, while Transformers, though effective in modeling long-range dependencies, introduce computational overhead. To address these challenges, we propose a frequency-domain and spatial-domain feature fusion network (FSFF-Net) for HSI classification, which reduces computational complexity while capturing global features. The FSFF-Net consists of a frequency-domain transformer (FDformer) and a deepwise convolution-based parallel encoder structure. The FDformer replaces the self-attention mechanism in traditional Visual Transformers with a three-step process: two-dimensional discrete Fourier transform (2D-DFT), adaptive filter, and two-dimensional inverse Fourier transform (2D-IDFT). 2D DFT and 2D-IDFT convert images between the spatial and frequency domains. Adaptive filter can adaptively retain important frequency components, remove redundant components, and assign weights to different frequency components. This module not only can reduce computational overhead by decreasing the number of parameters, but also mitigates the limitations of CNN by capturing complementary frequency-domain features, which enhance the spatial-domain features for improved classification. In parallel, deepwise convolution is employed to capture spatial-domain features. The network then integrates the frequency-domain features from FDformer and the spatial-domain features from deepwise convolution through a feature fusion module. The experimental results demonstrate that our method is efficient and robust for HSIs classification, achieving overall accuracies of 98.03%, 99.57%, 97.05%, and 98.40% on Indian Pines, Pavia University, Salinas, and Houston 2013 datasets, respectively.

Keywords:

convolutional neural network (CNN); frequency-domain features; hyperspectral image classification; self-attention network; transformer

Graphical Abstract

1. Introduction

With the swift growth of aerospace technology and sensor technological advancements, a large number of high-quality hyperspectral images can now be captured [1,2]. Hyperspectral images (HSIs) are a distinct category of remote sensing data that captures an extensive range of spectral information, usually from near-UV to near-IR wavelengths. They reveal the condition of the ecological environment and signs of human influence. Extracting the knowledge embedded in these images and efficiently isolating relevant information has become a central goal in hyperspectral image analysis. HSI is rich in information and can be applied to various practical scenarios [3], including precision farming [4], biomedical diagnostics [5], mineral detection [6], food quality monitoring [7], military surveillance [8], and many other fields. To make the most of the capabilities of HSI data., many data processing methods have been investigated, including denoising [9,10], unmixing [11], object detection [12], and classification [13,14,15]. In these fields, hyperspectral image (HSI) classification plays a crucial role.

Initial hyperspectral image (HSI) classification approaches mainly relied on manually crafted image features. These techniques required experts with deep domain expertise and practical experience to create diverse image features, such as color, shape, texture, and spectral details. These features carried significant valuable information for target classification. Among the most notable manually designed features are color histograms, texture descriptors, histograms of oriented gradients (HOGs), and scale-invariant feature transform (SIFT). However, these methods have significant drawbacks: on the one hand, they can only make one-sided use of spectral or spatial single-dimensional information, and it is difficult to deal with the high-dimensional redundancy problem of hundreds of bands, which easily leads to the loss of key information and the increase of computational burden. On the other hand, feature design is highly dependent on the prior knowledge of specific scenarios, has weak cross-scenario transfer ability, and insufficient robustness to different sensor data and noise, and its excessive reliance on expert experience is prone to introducing subjective deviations, resulting in low feature engineering efficiency and limited model generalization ability. Consequently, multi-feature fusion techniques have started to be employed for HSI image classification. These fusion methods enhance classification performance; however, traditional early connections (such as directly concatenating multimodal features) are prone to introducing redundant information and have high computational costs. Late connections (such as independently processing features and fusing prediction results) ignore inter-modal dependencies. Even methods based on attention mechanisms such as SENet and CBAM have the problems of only focusing on single-dimensional dependencies or using fixed fusion weights. Although statistical and probabilistic machine learning techniques (e.g., Support Vector Machines (SVMs) with handcrafted spectral features or Gaussian Mixture Models (GMMs)) have offered several practical solutions for HSI image classification, these methods often struggle to model the nonlinear relationship between spectral bands and spatial textures—for example, SVMs fail to distinguish between spectrally similar materials like shadowed grass and asphalt in urban HSI datasets due to their reliance on linear decision boundaries, while GMMs cannot effectively represent the complex multimodal distributions inherent in mixed pixels—and, thus, face challenges in establishing complex functional representations, remaining unable to adapt to the classification of intricate HSI images [16]. To optimize the extraction of spatial features, various mathematical morphology operators have been introduced, including morphological profiles (MPs) [17], extended MP (EMP) [18], and extended multi-attribute profiles (EMAP) [19]. SUN et al. [20] developed a multi-scale spatial–spectral kernel method based on adjacent superpixels to improve classification results. This approach fully leverages the spatial and spectral information of HSI data. However, on datasets like IP, where there are numerous and complex types of ground features, these methods perform very poorly.

To address the limitations of traditional methods, deep learning has achieved remarkable success in hyperspectral image (HSI) classification in recent years, with an increasing number of researchers applying these techniques to remote sensing image processing. Common deep learning approaches for classifying hyperspectral images include autoencoders, CNN, and others. CHEN et al. [21,22] used stacked autoencoders (SAEs) and deep belief networks (DBNs) for hierarchical extraction of spatial–spectral features. These methods need image patches of the training samples to be flattened into one-dimensional features as input. However, this approach alters the spatial information of the original image. Subsequently, with the widespread adoption of CNNs, several CNN-based network structures have been investigated for distinguishing HSIs. KRIZHEVSKY et al. [23] constructed alexnet. LIN et al. proposed the network in network (NIN) model [24], which replaces traditional convolution layers that use linear filters and nonlinear activation functions with a more complex micro neural network structure called “MLPCONV”. Later, KAIMINGHE et al. [25] introduced the deep residual network, resnet, which continuously accumulates multiple identity mapping layers in stacked layers to optimize the network. ZHONG et al. [26] proposed a supervised spectral–spatial residual network (SSRN) that uses a series of 3D CNN in spatial and spectral residual blocks to extract joint spectral–spatial features. To address the lack of spectral correlation in 2D CNN and the complexity of 3D CNN models, ROY et al. [27] introduced a hybrid 3D-2D CNN architecture that first uses 3D CNN to extract joint spectral–spatial features and then applies 2D CNN to derive more complex spatial context features. WANG et al. [28] further improved the dense convolutional network structure by proposing a new CUBIC-CNN feature extraction framework. This method takes both the raw image patches and the one-dimensional convolution features after dimensionality reduction as input to the network, effectively reducing feature redundancy. Subsequently, other methods [29,30,31] were also proposed, which have gained significant attention because of their capacity to extract spatial and spectral features from hyperspectral images at the same time. Although the above-mentioned deep learning methods have made progress in HSI classification, they have disadvantages such as spatial information loss, insufficient spectral–spatial coupling ability, high computational cost, and limited generalization ability, providing research space for the introduction of the Transformer.

In recent years, a model known as the vision transformer (ViT) [32] has shown promising efficiency in image processing work. The vision transformer is an image processing method based on the transformer model. Traditional CNNs perform excellently in image processing tasks, but their limitation lies in the locality assumption of convolutional layers, which results in poor modeling of long-range dependencies. The vision transformer, on the other hand, leverages the self-attention mechanism to grasp long-range dependencies in images, offering better image understanding and processing capabilities [33]. In 2020, Touvron et al. proposed DeiT [34], a method that uses knowledge distillation (KD) to train a more robust ViT model. In DeiT, resnet serves as the teacher model, helping ViT learn distillation tokens that differ from class tokens. In [35], a spatial–spectral transformer (SST) model was proposed, which uses a network structure alike VGGNet [36] to obtain spatial features and establish relationships between adjacent spectra and dense transformers. The classification outcomes are derived by employing a multi-layer perceptron (MLP). Hong et al. [37] developed a new model called spectralformer (SF), which learns spectral representation information from grouped adjacent bands and constructs a cross-layer transformer encoder (TE) module. To obtain more refined local features and handle occluded objects, He et al. proposed the STranU model [38] in 2022, which combines spatial transformer (ST) with U-Net to form a dual-encoder structure. Jiang et al. introduced GraphGST [39], a method aimed at capturing local-to-global correlations and enhancing the position encoding of transformers. However, vision transformer models suffer from high computational complexity [40,41], and in hyperspectral image analysis, the high dimensionality of data and limited feature representation capabilities present challenges. Traditional spatial domain feature extraction methods may lead to feature loss and increased computational complexity due to the high dimensionality. However, the Fourier transform can address these issues effectively.

For decades, the Fourier transform has played a crucial role in the field of digital image processing [42,43,44]. With the significant breakthroughs achieved by CNNs in computer vision, an expanding body of research has begun to combine Fourier transforms with deep learning methods to enhance the effectiveness of visual tasks [45,46,47,48,49,50]. Certain studies apply the discrete Fourier transform (DFT) to shift images to the frequency domain, harnessing frequency-based insights to boost performance in specific tasks, while others use the convolution theorem to accelerate CNN computations by employing the fast Fourier transform (FFT). Fourier convolutional networks (FFCs) replace traditional convolution operations in CNNs with local Fourier units, enabling convolution processing in the frequency domain. Additionally, there have been attempts to apply Fourier transform to deep learning models to address partial differential equations [51] and natural language processing (NLP) tasks.

Therefore, the Fourier transform converts HSI features to the frequency domain, providing a novel solution for the field of HSI classification. The Fourier transform can capture the frequency domain information of an image. By filtering and training, it effectively reduces redundant information and highlights key frequency components to extract important features. Inspired by this, to tackle the limitation that CNNs struggle to capture global context due to their local receptive fields and the challenge that Transformers incur high computational costs while modeling long-range dependencies, we propose an HSI classification network, a dual-domain feature fusion network, for HSI classification. At the outset, a deformable convolution [52] is aimed at highly fitting the various shapes in hyperspectral images, which effectively improves classification accuracy. Then, different features are efficiently extracted using the FDformer and deepwise convolution, respectively. Next, a feature fusion module that combines channel and spatial information merges the two types of features. Ultimately, a linear classifier powered by softmax serves to predict the label for each pixel. This paper’s main contributions are highlighted as follows:

We propose a novel module, the FDformer, for capturing frequency-domain features. This module not only reduces computational overhead by decreasing the number of parameters, but also overcomes the limitations of CNN by capturing complementary frequency-domain features, which enhance the spatial-domain features for improved classification. It takes the place of the self-attention layer in visual transformers with three steps: two-dimensional discrete Fourier transform, adaptive filter, and two-dimensional inverse Fourier transform. We proved in the subsequent ablation experiments that FDformer effectively captures frequency domain features and improves classification accuracy.
We propose a novel module, a deformable gate feed-forward network (DGFN) to feature refinement. The traditional FFN [53] neglects the modeling of spatial information. Additionally, redundant information in the channels affects the expressive power of the features and it cannot adapt to local deformations. We replaced the conv module in the FFN with a gating mechanism based on deformable convolution. We demonstrate in the subsequent ablation experiments that DGFN is capable of capturing spatial information and further fitting class shapes, improving classification accuracy.
We propose a dual-domain feature fusion network for HSI classification. The network integrates frequency domain features and spatial features, enhancing classification accuracy and the robustness of the network. An image after deformable convolution is passed through both the FDformer module and the deepwise convolution module. Then, the feature fusion module allows sufficient interaction between the frequency-domain and spatial-domain feature in HSI. Finally, these features are further refined through the deformable gate feed-forward network, further enhancing classification accuracy. The experimental results across four public datasets confirm the advantages of the proposed network.

2. Materials and Methods

To begin, the basic architecture of FSFF-Net is launched, followed by a detailed explanation of its core component, the FDformer. Next, we describe the dual feature fusion module. Finally, we present the deformable gate feed-forward network and the classification network.

2.1. Network Framework

The entire network of FSFF-Net advanced in this paper consists of four steps: PCA (principal component analysis) dimensionality reduction and block extraction, shallow feature derivation, deep feature derivation, and image classification, as shown in Figure 1. First, original HSI data

I \in R^{H \times W \times L}

are given, where

H \times W

is the spatial size and L represents the count of spectral bands. Each pixel in I possesses L spectral dimensions and generates a single-class vector

Y = (y_{1}, y_{2}, y_{3}, . . ., y_{N}) \in R^{1 \times 1 \times N}

, where N corresponds to the number of land cover types. Taking into account the overlapping nature of spectral bands in HSI that may affect the model performance, PCA is applied to lower the band quantity to B and keeping its spatial dimensions unchanged. As a result, the dimensionality-reduced HSI has the resolution

H \times W \times B

. In the dimensionally reduced hyperspectral data, a 3D image block is extracted for each pixel. Each 3D image block is represented as

P_{H} = R^{a \times a \times C}

, where

a \times a

is the size of the image block; each image block is classified into the real category based on the label of its central pixel.

Next, we apply deformable convolution (Figure 1a) to the dimension-reduced HSI, generating shallow features

F_{S} = R^{a \times a \times C}

. The application of deformable convolution during early stages allows the network to possess greater flexibility in the feature extraction process. It automatically adjusts the receptive field to accommodate local deformations and non-structural information in hyperspectral images.

The shallow features

F_{S}

are then processed within the deep feature extraction step to obtain deep features

F_{P} = R^{b \times b \times C}

. This step consists of a dual-branch structure formed by FDformer and deepwise convolution. FDformer performs transformer operations in the frequency domain, capturing the frequency and global features, while deepwise convolutional network gradually captures features at different levels through multiple convolution layers, obtaining the spatial and local features. The system integration of the FDformer module and the deepwise convolutional network can fully leverage both frequency and spatial information in HSI, integrating global and local features to significantly improve the accuracy of classification. Next comes the dual-feature fusion module used to combine the features of two branches.

Finally, we apply a deformable gate feedback network (DGFN) and introduce a convolutional layer to refine the features extracted from the previous steps. Subsequently, the softmax function is implemented on the output data

F_{O U T} \in R^{H \times W \times C}

, thereby generating the final classification result.

2.2. Frequency Domain Transformer

Owing to the high-dimensional nature of the data and limited feature representation capabilities in hyperspectral image analysis, the visual transformer may suffer from feature loss and increased computational complexity because of the high dimensionality. To tackle the limitations outlined above, we suggest a novel model FDformer. The entire network of our module is shown in Figure 1b. After being processed by the deformable convolution, the shallow features are separated into patches whose size is

b \times b

without overlap and used as input. These flattened patches are then projected into

L = \frac{a}{b} \times \frac{a}{b}

tokens of dimension C, which are then fed into FDformer.

We first introduce two-dimensional discrete Fourier transform (DFT) [43]. Given a two-dimensional signal X[a, b], where

0 \leq a \leq A, 0 \leq b \leq B

, this two-dimensional DFT of X[a, b] is expressed as

X [u, v] = \sum_{a = 0}^{A - 1} \sum_{b = 0}^{B - 1} x [a, b] e^{- j 2 π (\frac{u a}{A} + \frac{v b}{B})}

(1)

Given a token

x \in R^{b \times b \times C}

, we begin by applying a 2D FFT over the spatial dimensions to transform x to the frequency domain.

X = D [x] \in C^{b \times b \times C}

(2)

where

D [\cdot]

stands for the 2D FFT. At this point, X denotes a complex tensor that characterizes the frequency spectrum of x.

Afterwards, X enters the adaptive filter layer. The adaptive filter layer is divided into two steps. Our first step is to conduct a preliminary filtering process on X. In signal processing, the energy of a frequency component refers to the square of the amplitude (or intensity) of that frequency, which reflects the contribution of that frequency component to the overall signal energy. Under normal circumstances, frequency components with high energy have a significant impact on the overall performance of the signal. This step can dynamically adjust the threshold according to the characteristics of different datasets, effectively removing the redundant frequency components. First, we calculate the power spectrum of X to identify the main frequency components. The power spectrum P is calculated by squaring the amplitude (F) of the frequency component:

P = ∣ F ∣^{2}

, which reflects the intensity of different frequency components. The key to achieving effective filtering lies in the processing of the useless components in the power spectrum P. We achieve this goal through a trainable threshold

θ

, which is adjusted according to the spectral characteristics of the data. We randomly initialize the threshold parameters as values between 0 and 1. By calculating the energy of the frequency component and normalizing it with the median energy, the normalized energy is compared with the threshold. Those greater than the threshold are set as 1 for retention, and those otherwise are set as 0 for suppression. A special gradient estimator is adopted to ensure that the threshold parameters are trainable. This is to achieve dynamic threshold filtering and avoid manual parameter adjustment. This threshold, as a learnable parameter, is optimized through backpropagation during the training process, specifically adjusted by

\frac{\partial L}{\partial θ}

. The formula is as follows.

X_{f} = X ⊙ (P > θ)

(3)

where ⊙ represents element-wise multiplication, and

P > θ

is a binary mask where frequencies with power greater than the threshold

θ

are retained, while others are filtered out.

Next, we can adjust this frequency spectrum by applying the learnable filter

K \in C^{b \times b \times C}

to X.

\hat{X} = K ⊙ X_{f}

(4)

The filter K has the same dimensions as X, allowing it to capture global information. In the end, the inverse FFT is applied to convert the modulated frequency spectrum

\hat{x}

back to the spatial domain, followed by an update of the tokens.

x \leftarrow F^{- 1} [\hat{X}]

(5)

This formula for learnable filter comes from frequency filters in digital image processing [51], where K is a set of learnable frequency filters, each with distinct hidden dimensions. This filter K performs a deep global circular convolution with a filter size of

b \times b

. As a result, K differs from standard convolution layers, which use comparatively tiny filter sizes to reinforce local inductive biases. Additionally, the complexity of standard convolution layers is

O (D L l o g L)

, while the complexity of a regular deep global circular convolution in the spatial domain is

O (D L^{2})

, which allows for efficient global information interaction while reducing the parameter size significantly. Among them, D represents the dimension and L represents the number of layers. We add a residual mechanism in this module for the stability of training.

2.3. Dual Feature Fusion (DFF)

To efficiently and accurately achieve the fusion of global and local features, we used a dual feature fusion module. As illustrated in Figure 1, we reassign weights to the features of the two branches along the spatial and channel dimensions, respectively. Our fusion module draws on SENet’s channel attention idea during the feature weighting process and extends it to the spatial-channel dual-dimensional interaction. Specifically, spatial attention maps are generated through spatial interaction branches to perform spatial position weighting on frequency-domain features. Meanwhile, the channel attention map is generated by using the channel interaction branch to adjust the channel weights of the spatial features. The two achieve the dual enhancement of “spatial key features + channel key features” through element-level multiplication, and finally add and fuse them.

First, we represent the features processed by FDformer as

Y_{f} \in R^{b \times b \times C}

, and the features obtained from the depthwise convolution as

Y_{d} \in R^{b \times b \times C}

. Then, the features

Y_{f}

and

Y_{d}

are fused through the DFF module. Specifically, DFF consists of two operations: spatial fusion (SF, shown in Figure 1b) and channel fusion (CF, shown in Figure 1c). The feature map

Y_{f}

undergoes SF to compute the spatial weight map (denoted as

S - Weight

, with size

b \times b \times 1

). The feature map

Y_{d}

undergoes CF to compute the channel weight map (denoted as

C - Weight

, with size

1 \times 1 \times C

). The operations are shown in Figure 1c,d, and the formula is given below:

\begin{matrix} S - W e i g h t (Y_{d}) = f (W_{2} σ (W_{1} Y_{d})), \\ C - W e i g h t (Y_{f}) = f (W_{4} σ (W_{3} H_{G P} Y_{f})) \end{matrix}

(6)

In the formula,

H_{G P}

represents global average pooling,

f (\cdot)

is the sigmoid function, and

σ (\cdot)

is GELU function.

W (\cdot)

denotes convolutional weights. The reduction ratios of

W_{1}

and

W_{2}

are

r_{1}

and

\frac{c}{r_{1}}

, respectively.

W_{3}

has a reduction ratio of

r_{2}

, and

W_{4}

has a dilation rate of

r_{2}

. Then, the corresponding weight maps are applied to another input to achieve fusion. This process is described as follows:

\begin{matrix} S F = Y_{f} ⊙ S - W e i g h t (Y_{d}), \\ C F = Y_{D} ⊙ C - W e i g h t (Y_{F}) \end{matrix}

(7)

Finally, the two fused features are element-wise added:

F_{a d d} = S F + C F

(8)

2.4. Deformable Gate Feed-Forward Network

Traditional feed-forward networks consist of a nonlinear activation layer and two linear projection layers to derive features. Nevertheless, they ignore encoding spatial relationships. Furthermore, duplicate information in the channels affects the feature representation ability and cannot adapt to local deformations. To counter the aforementioned challenges, we propose the deformable gate feed-forward network (DGFN, shown in Figure 1e) to replace the traditional FFN. As shown in Figure 1e, this module is a simple gating mechanism composed of deformable convolutions and element-wise multiplication. We split the feature map into two segments in the channel dimension: (1) convolution branch; (2) multiplication branch. In summary, given an input

X \in R^{b \times b \times C}

, the formula for the deformable gate feed-forward network is shown below:

\begin{matrix} X^{'} = σ (W_{p}^{1} X), [X_{1}^{'}, X_{2}^{'}] = X^{'}, \\ DGFN (X) = W_{p}^{2} (X_{1}^{'} ⊙ (W_{d} X_{2}^{'})) \end{matrix}

(9)

where

W_{p}^{1}

and

W_{p}^{2}

represent the linear projections,

σ

denotes the GELU activation function, and

W_{d}

is the learnable parameter of the deformable convolution. Compared to FFN, DGFN is able to capture nonlinear spatial information, reduce channel redundancy in fully connected layers, and adapt to local deformations. We obtain the final output feature

F_{out}

.

F_{out}

is passed through a convolutional layer to refine the features, and then via a linear layer, where the softmax function is used to figure out the likelihood that the input is associated with a certain class. The label with the highest probability is considered the sample’s predicted class.

3. Results

3.1. Data Description

To validate the performance of the proposed model, we picked four traditional HSI datasets that are commonly used to evaluate classification accuracy for the experiments, including Indian Pines (IP), Pavia University (PU), Salinas Scene (SA), and Houston 2013 datasets. We conducted 10 experiments and compared the results by taking the average.

IP [54]: This dataset was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over the northwestern part of Indiana. It has spatial dimensions of 145 × 145 and a spectrum spanning from 400 to 2500 nm with 220 spectral bands. Following the removal of 20 bands affected by water intake, 200 bands were utilized for the experiment. The dataset contains 10,249 ground truth pixels, divided into sixteen classes. Displayed in Figure 2 are the false-color and ground truth maps. Indian Pines is challenging due to its complex and irregular label distributions with interlaced patch sizes.

PU [55]: This dataset was collected by the ROSIS sensor over the Pavia University area in Italy. It has spatial dimensions of 610 × 340 and a spectrum spanning from 430 to 860 nm with 115 spectral bands. Following the removal of 12 noisy bands, 103 bands were utilized for the experiment. The dataset contains 42,776 ground truth pixels, divided into 9 classes. Displayed in Figure 3 are the false-color and ground truth maps. PU is challenging due to mixed spatial patterns and class imbalance.

SA [56]: This dataset was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over the Salinas Valley region in California. It has spatial dimensions of 512 × 217 and a spectral range from 360 to 2500 nm with 224 spectral bands. Following the removal of 20 bands affected by water intake, 204 bands were used for the experiment. The dataset contains 54,129 ground truth pixels, divided into 16 classes. Displayed in Figure 4 are the false-color and ground truth maps. The distribution of ground objects in the SA dataset is relatively uniform and regular. However, “Grapes untrained” and “Soil vinyard develop” have similar textures in most spectral bands and are often misclassified.

Houston 2013 dataset [57]: This dataset was provided by the hyperspectral image analysis team of the University of Houston and the Airborne Laser Scanning Center (NCALM) funded by the National Science Foundation of the United States. The dataset was initially used for the 2013 IEEE GRSS Data Fusion Contest. The spectral range is 0.38 to 1.05 micron, with 144 spectral bands. The dataset consists of 349 × 1905 pixels, with a spatial resolution of 2.5 m, and contains 15 classes. Displayed in Figure 5 are the false-color and ground truth maps. Houston 2013 is challenging due to mostly discrete and localized sample points.

Table 1, Table 2, Table 3 and Table 4 provides the land cover class names, the number of training samples, and the number of test samples for these four datasets. Each dataset was divided into training and test sets. In this paper, 10% of the pixels from each class in the IP and Houston 2013 datasets were selected at random to form the training set. For SA and PU datasets, 0.5% and 5% of the pixels’ random choices from each class were made as the training set.

3.2. Experimental Setting

Evaluation metrics: To evaluate the performance of the proposed model on the four different datasets, four common HSI classification evaluation metrics were used: overall accuracy (OA), average accuracy (AA), kappa coefficient, and F1 score. The overall accuracy (OA) is calculated as the percentage of properly classified pixels and is used to measure the overall performance of the classification model on the entire dataset. Average accuracy (AA) is calculated as the mean percentage of correctly classified pixels for each class and is employed to evaluate the proficiency of the model across all individual classes. The kappa coefficient serves as a quantitative metric calculated from the confusion matrix, which quantifies the correlation between the predicted and correct labels. The F1 score is the harmonic mean for measuring the precision and recall in the classification model, used to balance the performance of both.
Configuration: In the experiment, the quantity of hyperspectral image bands after PCA dimensionality reduction, symbolized by C, is set to 30. The 3D block size is 36 × 36, and the embedding block size is 6 × 6. The Adam optimizer with a learning rate of 0.001 is utilized for training the model. For batch training, the batch size is set to 256. The training is run through 100 epochs. The experimental hardware environment consists of an i7-10500 CPU and an NVIDIA GeForce RTX 4090 GPU, with our compilation environment being Python 3.8 and Pytorch 1.11.0.

3.3. Comparative Experimentation

To validate the effectiveness of the suggested model, we selected some representative methods for comparative experiments: SVM, 1-D-CNN [58], 2-D-CNN [59,60], 3-D-CNN [61], HybridSN [27], SSFTT [62], SSFT [63],SMESC [64], MorphF [65], and the proposed FSFF-Net method.

3.3.1. Experiment on the IP

Table 5 shows the quantitative results of different methods on the IP dataset, with the best-performing method highlighted in bold. Our method achieved the highest OA, AA, and kappa values, which were only lower than SSFTT in F1. Specifically, our method shows relatively balanced classification results across all classes. Although other methods achieve better precision in certain classes, our overall performance is still the best, demonstrating the excellent classification ability of the frequency domain features.

Figure 6 shows the classification outcomes on the IP. Because of the complexity of the IP dataset, the ground truth labels are mostly irregular and intermixed in size, making the differences between methods more noticeable. Our method, utilizing frequency-domain features and combined with time-domain features, achieves better outcomes. The figure demonstrates that our method has significantly fewer misclassifications compared to others, thanks to the adaptive filter module. The output of our method is much smoother, with fewer noise points. The category “Alfalfa” in red is also more notable. In contrast to the considerable green clutter seen in several other comparison methods within this category, our method classification map is much cleaner.

3.3.2. Experiment on the PU

The performance of various methods is displayed in Table 6. The classification outcomes are presented in Figure 7. The PU dataset has an increased spatial dimension and fewer bands compared to other datasets. Our method outperforms others in all four metrics, and the accuracy for every category in the classification is also very high. In the context of large-size image classification, the effect of the deformable convolution in the first step is more noticeable, allowing for better fitting of the shapes of each class. Specifically, for categories with low shape repetition, such as gravel and bitumen, it can be seen from the figure that we have almost no misclassifications.

3.3.3. Experiment on the Salinas Scene Dataset

Table 7 shows the findings of several techniques on the SA. Therefore, we have significantly reduced its sampling rate. However, the method proposed in this paper still performs the best in three metrics, with 8 out of the 16 categories achieving the best results. The result map from SA is shown in Figure 8. It can be observed that misclassifications in other methods are mostly concentrated on the top left part of the image labeled “Grapes untrained”, and have similar spectral characteristics to the “Soil vineyard develop” in the upper left corner. However, our method has fewer misclassifications in these two regions, indicating that it can effectively recognize the features of similar textures in the frequency domain.

3.3.4. Experiment on the Houston 2013 Dataset

Table 8 and Figure 9 present the quantitative and qualitative results of various methods on the Houston 2013 dataset. The sample points in the Houston 2013 dataset are mostly discrete and localized, unlike the previous three datasets where most regions belong to the same class. In this case, our frequency-domain feature extraction shows a clear advantage in classifying discrete samples, and our method still yields the best results in classification.

Overall, our method achieved the best classification accuracy, and the classification result maps were the closest to the ground truth, the smoothest, and cleanest, with the fewest misclassifications, validating the reliability of the proposed approach.

3.4. Ablation Study

Within this section, we will contrast DDF and its variants to measure the necessity and effectiveness of every constituent. We designed five variants: first, we analyze the variant where ViT replaces FDformer (ViT + DCONV), removes the first step of the adaptive filter (without FAF), and replaces the traditional FFN with DGFN (without DGFN) to verify the necessity of the corresponding components. We also analyze the variant where the deformable convolution is placed at the end of the model (DF end) to verify the effect of its position.

The experimental outcomes of the first three variants are shown in Table 9 and Figure 10. From the results of ViT + DCONV, we can see that the classification outcomes of FDformer are superior to that of ViT, with the classified images being cleaner and smoother. From the results without FAF, it can be seen that after removing the FAF, the classification performance is marginally reduced, but is still superior to ViT + DCONV. The FAF effectively reduces noise point in the input data; it helps the model extract clearer and cleaner features. From the figure, we can observe that in the absence of the FAF, the classified images contain noticeable noise point, indicating that the model cannot accurately focus on key features under noisy conditions, which negatively impacts classification accuracy. This suggests that the FAF can significantly improve the model’s classification precision. After replacing DGFN with a traditional FFN in the without DGFN variant, the classification accuracy decreases significantly, validating the superior spatial modeling and shape-fitting capabilities of DGFN.

The experimental results for the fourth variant are shown in Table 10. From it, it can be seen that our method improves nearly every category, especially in the class 9 “shadow”, where the shape repetition rate is extremely low. We achieved an improvement of nearly 6%. Therefore, placing the deformable convolution at the early stages of the network significantly boosts its adaptability to spatial information, especially for hyperspectral images, which are complex and contain challenging targets. This leads to a noticeable improvement in classification accuracy.

3.5. Parameter Study

To determine the optimal parameters for the model, we conducted parameter experiments on four datasets, including the number of spectral bands after PCA and the size of 3D image blocks. As shown in Figure 11 and Figure 12, our method performs best when the spectral dimension after PCA is 30 and the 3D image block size is 36.

3.6. Memory and Time

As the key focus of this paper is to replace the ViT method, we compared the testing time and parameter count of our proposed method with the ViT + DCONV method in Table 11. From the table, it is clear that our proposed method has significantly lower time cost than ViT + DCONV, and its parameter count is less than half of theirs.

3.7. Comparison of Classification Accuracy Under Different Training Sample Ratio

Figure 13 reflects the effect of different ratio of training samples on classification accuracy. With the increase in the number of training samples, the classification accuracy gradually improved. It is worth noting that our method still maintains excellent performance at a low sampling ratio. It has proved the robustness of our method.

4. Limitations and Discussion

In the classification results generated by FSFF-Net, we observed the issue of low classification accuracy for classes with few samples. Additionally, we encountered the same problem in other models. The primary reason for this is the scarcity of samples, thereby reducing the model’s generalization potential. During training, the model struggles to extract sufficient features from the small number of samples, leading to poor performance on unseen samples. This problem is particularly prominent in high-dimensional hyperspectral data, where the feature space is sparse. A limited number of samples may not fully represent all the characteristics of a particular class, resulting in inaccurate representation of that class by the model. In recent years, some research has attempted to address the impact of few-shot classification through methods like data augmentation, transfer learning, or meta-learning. For example, the model can be trained using labeled source samples, unlabeled target samples, and a subset of labeled target samples—either by generating synthetic samples, leveraging knowledge from large-scale existing datasets to assist in training, or by additionally incorporating a small number of labeled target samples [66]. Therefore, we are encouraged to design new approaches for our FSFF-Net to tackle the issue of low classification accuracy for classes with few samples. For future work, we plan to incorporate strategies to mitigate class imbalance, such as the following: Data-level adjustments: Applying oversampling techniques (e.g., SMOTE) or undersampling to balance class distributions, or using data augmentation methods specifically designed for rare classes to generate synthetic samples. Model-level modifications: Introducing class-weighted loss functions to assign higher weights to underrepresented classes during training, or integrating attention mechanisms that prioritize features from minority classes.

5. Conclusions

In this article, we proposed FSFF-Net, a HSI classification network that is highly efficient. The core structure, FDformer, replaces the self-attention layers in the vision transformer with 2D FFT, adaptive filter and 2D IFFT.

Introducing deformable convolution at the early stage of the classification task allows the network to have stronger flexibility in the initial feature extraction phase, enabling it to automatically adjust the receptive field to accommodate categories with low shape repetitiveness in HSI. Thanks to the token-mixing operation, FDformer can efficiently capture the global features of the image from frequency domain. When combined with a deepwise convolutional system capable of capturing local features, it enables the acquisition of fused features from both domains, then we re-fit the shape through DGFN. Extensive comparative experiments with other methods demonstrate the superiority of this approach on four HSI datasets, showcasing excellent classification performance while effectively reducing complexity. Moreover, the proposed FSFF-Net method exhibits good scalability in feature extraction within the frequency domain, providing new insights for designing networks that combine both frequency and spatial domains for HSI classification.

Author Contributions

Conceptualization, X.P., C.Z. and Q.S.; methodology, X.P. and Q.S.; software, X.P., C.Z. and Q.S.; validation, X.P., C.Z. and Q.S.; formal analysis, X.P. and Q.S.; investigation, X.P. and C.Z.; data curation, X.P. and C.Z.; resources, Q.S., C.Z. and W.L.; writing—original draft, X.P., Q.S. and C.Z.; writing—review and editing, Q.S., C.Z., W.L. and G.J.; visualization, Q.S. and X.P.; supervision, Q.S. and W.L.; project administration, Q.S. and W.L.; funding acquisition, Q.S., W.L. and G.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Laboratory of Target Cognition and Application Technology, grant number 2023-CXPT-LC-005; and in part by the National Natural Science Foundation of China, grant numbers 62372421 and 62402465.

Data Availability Statement

The Indian Pines, Pavia University, and Salinas Valley datasets are available at http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes.(accessed on 8 October 2024). The Houston 2013 dataset is available at https://github.com/songyz2019/fetch_houston2013.(accessed on 8 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Q.; Zheng, Y.; Yuan, Q.; Song, M.; Yu, H.; Xiao, Y. Hyperspectral image denoising: From model-driven, data-driven, to model-data-driven. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 13143–13163. [Google Scholar] [CrossRef] [PubMed]
Huo, Y.; Cheng, X.; Lin, S.; Zhang, M.; Wang, H. Memory-augmented Autoencoder with Adaptive Reconstruction and Sample Attribution Mining for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5518118. [Google Scholar] [CrossRef]
Cheng, X.; Zhang, M.; Lin, S.; Li, Y.; Wang, H. Deep self-representation learning framework for hyperspectral anomaly detection. IEEE Trans. Instrum. Meas. 2023, 73, 5002016. [Google Scholar] [CrossRef]
Liu, X.; Jiao, L.; Li, L.; Tang, X.; Guo, Y. Deep multi-level fusion network for multi-source image pixel-wise classification. Knowl.-Based Syst. 2021, 221, 106921. [Google Scholar] [CrossRef]
Noor, S.S.M.; Michael, K.; Marshall, S.; Ren, J.; Tschannerl, J.; Kao, F.J. The properties of the cornea based on hyperspectral imaging: Optical biomedical engineering perspective. In Proceedings of the 2016 International Conference on Systems, Signals and Image Processing (IWSSIP), Bratislava, Slovakia, 23–25 May 2016; pp. 1–4. [Google Scholar] [CrossRef]
Wang, J.; Zhang, L.; Tong, Q.; Sun, X. The Spectral Crust project—Research on new mineral exploration technology. In Proceedings of the 2012 4th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Shanghai, China, 4–7 June 2012; pp. 1–4. [Google Scholar] [CrossRef]
Fong, A.; Shu, G.; McDonogh, B. Farm to table: Applications for new hyperspectral imaging technologies in precision agriculture, food quality and safety. In Proceedings of the Conference on Lasers and Electro-Optics, San Jose, CA, USA, 10–15 May 2020; p. AW3K.2. [Google Scholar] [CrossRef]
Ardouin, J.P.; Lévesque, J.; Rea, T.A. A demonstration of hyperspectral image exploitation for military applications. In Proceedings of the 2007 10th International Conference on Information Fusion, Quebec, QC, Canada, 9–12 July 2007; pp. 1–8. [Google Scholar] [CrossRef]
Sun, L.; He, C.; Zheng, Y.; Tang, S. SLRL4D: Joint Restoration of S ubspace L ow-R ank L earning and Non-Local 4-D Transform Filtering for Hyperspectral Image. Remote Sens. 2020, 12, 2979. [Google Scholar] [CrossRef]
He, C.; Sun, L.; Huang, W.; Zhang, J.; Zheng, Y.; Jeon, B. TSLRLN: Tensor subspace low-rank learning with non-local prior for hyperspectral image mixed denoising. Signal Process. 2021, 184, 108060. [Google Scholar] [CrossRef]
Sun, L.; Wu, F.; Zhan, T.; Liu, W.; Wang, J.; Jeon, B. Weighted nonlocal low-rank tensor decomposition method for sparse unmixing of hyperspectral images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1174–1188. [Google Scholar] [CrossRef]
Yang, S.; Shi, Z. Hyperspectral image target detection improvement based on total variation. IEEE Trans. Image Process. 2016, 25, 2249–2258. [Google Scholar] [CrossRef]
Sun, L.; Wu, Z.; Liu, J.; Xiao, L.; Wei, Z. Supervised spectral–spatial hyperspectral image classification with weighted Markov random fields. IEEE Trans. Geosci. Remote Sens. 2014, 53, 1490–1503. [Google Scholar] [CrossRef]
Sun, L.; Ma, C.; Chen, Y.; Zheng, Y.; Shim, H.J.; Wu, Z.; Jeon, B. Low rank component induced spatial-spectral kernel method for hyperspectral image classification. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3829–3842. [Google Scholar] [CrossRef]
Chen, M.; Feng, S.; Zhao, C.; Qu, B.; Su, N.; Li, W.; Tao, R. Fractional Fourier Based Frequency-Spatial-Spectral Prototype Network for Agricultural Hyperspectral Image Open-Set Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5514014. [Google Scholar] [CrossRef]
Lv, Z.; Li, G.; Jin, Z.; Benediktsson, J.A.; Foody, G.M. Iterative training sample expansion to increase and balance the accuracy of land classification from VHR imagery. IEEE Trans. Geosci. Remote Sens. 2020, 59, 139–150. [Google Scholar] [CrossRef]
Fauvel, M.; Benediktsson, J.A.; Chanussot, J.; Sveinsson, J.R. Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles. IEEE Trans. Geosci. Remote Sens. 2008, 46, 3804–3814. [Google Scholar] [CrossRef]
Benediktsson, J.A.; Palmason, J.A.; Sveinsson, J.R. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Trans. Geosci. Remote Sens. 2005, 43, 480–491. [Google Scholar] [CrossRef]
Tang, Y.; Feng, S.; Zhao, C.; Fan, Y.; Shi, Q.; Li, W.; Tao, R. An object fine-grained change detection method based on frequency decoupling interaction for high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5600213. [Google Scholar] [CrossRef]
Sun, L.; Ma, C.; Chen, Y.; Shim, H.J.; Wu, Z.; Jeon, B. Adjacent superpixel-based multiscale spatial-spectral kernel for hyperspectral classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 1905–1919. [Google Scholar] [CrossRef]
Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, X.; Jia, X. Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2017, 60, 84–90. [Google Scholar] [CrossRef]
Lin, M. Network in network. arXiv 2013, arXiv:1312.4400. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. 2016, pp. 770–778. Available online: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html (accessed on 15 December 2024).
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2017, 56, 847–858. [Google Scholar] [CrossRef]
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef]
Wang, J.; Song, X.; Sun, L.; Huang, W.; Wang, J. A novel cubic convolutional neural network for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4133–4148. [Google Scholar] [CrossRef]
Zheng, Y.; Liu, S.; Bruzzone, L. An Attention-Enhanced Feature Fusion Network (AeF 2 N) for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5511005. [Google Scholar] [CrossRef]
Yu, C.; Zhao, X.; Gong, B.; Hu, Y.; Song, M.; Yu, H.; Chang, C.I. Distillation-constrained prototype representation network for hyperspectral image incremental classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5507414. [Google Scholar] [CrossRef]
Zhang, Q.; Dong, Y.; Yuan, Q.; Song, M.; Yu, H. Combined deep priors with low-rank tensor factorization for hyperspectral image restoration. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5500205. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Sun, Q.; Sun, Y.; Pan, C. AIDB-Net: An Attention-Interactive Dual-Branch Convolutional Neural Network for Hyperspectral Pansharpening. Remote Sens. 2024, 16, 1044. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
He, X.; Chen, Y.; Lin, Z. Spatial-spectral transformer for hyperspectral image classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5518615. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408715. [Google Scholar] [CrossRef]
Jiang, M.; Su, Y.; Gao, L.; Plaza, A.; Zhao, X.L.; Sun, X.; Liu, G. GraphGST: Graph generative structure-aware transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5504016. [Google Scholar] [CrossRef]
Wang, D.; Zhuang, L.; Gao, L.; Sun, X.; Zhao, X.; Plaza, A. Sliding dual-window-inspired reconstruction network for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5504115. [Google Scholar] [CrossRef]
Xiang, P.; Ali, S.; Zhang, J.; Jung, S.K.; Zhou, H. Pixel-associated autoencoder for hyperspectral anomaly detection. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103816. [Google Scholar] [CrossRef]
Pitas, I. Digital Image Processing Algorithms and Applications; John Wiley & Sons Inc.: Hoboken, NJ, USA, 2000; Volume 2, pp. 133–138. Available online: https://books.google.com.tw/books?hl=zh-CN&lr=&id=VQs_Ly4DYDMC&oi=fnd&pg=PA1&dq=Ioannis+Pitas.+Digital+image+processing+algorithms+and+applications.+John+Wiley+Sons,+200&ots=jKofgIqfZp&sig=-SilbYomsMP1YkRBc9kptzRAJTg&redir_esc=y#v=onepage&q=Ioannis%20Pitas.%20Digital%20image%20processing%20algorithms%20and%20applications.%20John%20Wiley%20Sons%2C%20200&f=false (accessed on 15 December 2024).
Baxes, G.A. Digital Image Processing: Principles and Applications; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1994. [Google Scholar]
Song, G.; Sun, Q.; Zhang, L.; Su, R.; Shi, J.; He, Y. OPE-SR: Orthogonal position encoding for designing a parameter-free upsampling module in arbitrary-scale image super-resolution. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 10009–10020. [Google Scholar] [CrossRef]
Li, S.; Xue, K.; Zhu, B.; Ding, C.; Gao, X.; Wei, D.; Wan, T. Falcon: A fourier transform based approach for fast and secure convolutional neural network predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8705–8714. Available online: https://openaccess.thecvf.com/content_CVPR_2020/html/Li_FALCON_A_Fourier_Transform_Based_Approach_for_Fast_and_Secure_CVPR_2020_paper.html (accessed on 15 December 2024).
Yang, Y.; Soatto, S. Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4085–4095. Available online: https://openaccess.thecvf.com/content_CVPR_2020/html/Yang_FDA_Fourier_Domain_Adaptation_for_Semantic_Segmentation_CVPR_2020_paper.html (accessed on 20 December 2024).
Ding, C.; Liao, S.; Wang, Y.; Li, Z.; Liu, N.; Zhuo, Y.; Wang, C.; Qian, X.; Bai, Y.; Yuan, G.; et al. Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, 14–18 October 2017; pp. 395–408. [Google Scholar] [CrossRef]
Chi, L.; Jiang, B.; Mu, Y. Fast fourier convolution. Adv. Neural Inf. Process. Syst. 2020, 33, 4479–4488. [Google Scholar]
Rao, Y.; Zhao, W.; Zhu, Z.; Lu, J.; Zhou, J. Global filter networks for image classification. Adv. Neural Inf. Process. Syst. 2021, 34, 980–993. [Google Scholar]
Lee, J.H.; Heo, M.; Kim, K.R.; Kim, C.S. Single-image depth estimation based on fourier domain analysis. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 330–339. Available online: https://openaccess.thecvf.com/content_cvpr_2018/html/Lee_Single-Image_Depth_Estimation_CVPR_2018_paper.html (accessed on 20 December 2024).
Rao, Y.; Zhao, W.; Zhu, Z.; Lu, J.; Zhou, J. Global Filter Networks for Image Classification. arXiv 2021, arXiv:2107.00645. [Google Scholar] [CrossRef]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. Available online: https://openaccess.thecvf.com/content_CVPR_2019/html/Zhu_Deformable_ConvNets_V2_More_Deformable_Better_Results_CVPR_2019_paper.html (accessed on 5 December 2024).
Ashish, V. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, I. [Google Scholar]
USGS. Hyperspectral Image Data Set: Indian Pines. In Proceedings of the AVIRIS Sensor Data Products. 1992. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 10 October 2024).
DLR. Hyperspectral Image Data Set: University of Pavia. In Proceedings of the ROSIS-03 Sensor Data Products. 2003. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 10 October 2024).
USGS. Hyperspectral Image Data Set: Salinas. In Proceedings of the A VIRIS Sensor Data Products. 1992. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 10 October 2024).
Gader, P.; Zare, A.; Close, R.; Aitken, J.; Tuell, G. MU UFL Gulfport Hyperspectral and LiDAR Airborne Data Set; Technical Report Rep-2013-570; University of Florida: Gainesville, FL, USA, 2013; Available online: https://hyperspectral.ee.uh.edu/?page_id=459 (accessed on 5 December 2024).
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Zhao, W.; Du, S. Spectral–spatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [Google Scholar] [CrossRef]
Yue, J.; Zhao, W.; Mao, S.; Liu, H. Spectral–spatial classification of hyperspectral images using deep convolutional neural networks. Remote Sens. Lett. 2015, 6, 468–477. [Google Scholar] [CrossRef]
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Qiao, X.; Huang, W. Spectral-Spatial-Frequency Transformer Network for Hyperspectral Image Classification. In Proceedings of the 2023 IEEE Sensors Applications Symposium (SAS), Ottawa, ON, Canada, 18–20 July 2023. [Google Scholar] [CrossRef]
Yu, C.; Zhu, Y.; Song, M.; Wang, Y.; Zhang, Q. Unseen feature extraction: Spatial mapping expansion with spectral compression network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5521915. [Google Scholar] [CrossRef]
Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–Spatial Morphological Attention Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503615. [Google Scholar] [CrossRef]
Jin, M.; Li, K.; Li, S.; He, C.; Li, X. Towards Realizing the Value of Labeled Target Samples: A Two-Stage Approach for Semi-Supervised Domain Adaptation. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar] [CrossRef]

Figure 1. The basic architecture of FSFF-Net. (a) Deformable Conv. (b) Frequency-domain transformer (FDformer). (c) Channel fusion (CF). (d) Spatial fusion (SF). (e) Deformable gate feed-forward network (DGFN).

Figure 2. Indian Pines dataset. (a) False-color map. (b) Ground truth.

Figure 3. Pavia University dataset. (a) False-color map. (b) Ground truth.

Figure 4. Salinas Scene dataset. (a) False-color map. (b) Ground truth.

Figure 5. The Houston 2013 dataset. (a) False-color map. (b) Ground truth.

Figure 6. Classification maps for the Indian Pines dataset. (a) Ground truth. (b) SVM. (c) 1D-CNN. (d) 2D-CNN. (e) 3D-CNN. (f) HybridSN. (g) SSFTT. (h) SSFT. (i) SMESC. (j) MorphF. (k) Ours.

Figure 7. Classification maps for the Pavia University dataset. (a) Ground truth. (b) SVM. (c) 1D-CNN. (d) 2D-CNN. (e) 3D-CNN. (f) HybridSN. (g) SSFTT. (h) SSFT. (i) SMESC. (j) MorphF. (k) Ours.

Figure 8. Classification maps for the Salinas Valley dataset. (a) Ground truth. (b) SVM. (c) 1D-CNN. (d) 2D-CNN. (e) 3D-CNN. (f) HybridSN. (g) SSFTT. (h) SSFT. (i) SMESC. (j) MorphF. (k) Ours.

Figure 9. Classification maps for the Houston 2013 dataset. (a) Ground truth. (b) SVM. (c) 1D-CNN. (d) 2D-CNN. (e) 3D-CNN. (f) HybridSN. (g) SSFTT. (h) SSFT. (i) SMESC. (j) MorphF. (k) Ours.

Figure 10. Classification maps for PU of the ablation study. (a) Ground truth. (b) ViT + DCONV. (c) Without FAF. (d) With FFN. (e) DF end. (f) Ours.

Figure 11. The impact of the size of the selected 3D image block on the result.

Figure 12. The impact of the number of spectral bands after PCA dimensionality reduction on the results.

Figure 13. The effect of different training sample ratio on OA. (a) Indian Pines. (b) Pavia University. (c) Salinas. (d) Houston 2013.

Table 1. Training and testing settings of IP.

No.	Class Name	Training	Testing	Total
1	Alfalfa	5	41	46
2	Corn-notill	143	1285	1428
3	Corn-mintill	83	747	830
4	Corn	24	213	237
5	Grass-pasture	48	435	483
6	Grass-trees	73	657	730
7	Grass-pasture-mowed	3	25	28
8	Hay-windrowed	48	430	478
9	Oats	2	18	20
10	Soybean-notill	97	875	972
11	Soybean-mintill	245	2210	2455
12	Soybean-clean	59	534	593
13	Wheat	20	185	205
14	Woods	126	1139	1265
15	Building-Grass-Trees-Drives	39	347	386
16	Stone-Steel-Towers	9	84	93
Total		1024	9225	10,249

Table 2. Training and testing settings of PU.

No.	Class Name	Training	Testing	Total
1	Asphalt	332	6299	6631
2	Meadows	932	17,717	18,649
3	Gravel	105	1994	2099
4	Trees	153	2911	3064
5	Metal sheets	67	1278	1345
6	Bare soil	251	4778	5029
7	Bitumen	67	1263	1330
8	Bricks	184	3498	3682
9	Shadows	47	900	947
Total		2138	40,638	42,776

Table 3. Training and testing settings of the SA.

No.	Class Name	Training	Testing	Total
1	Broccoli green weeds 1	10	1999	2009
2	Broccoli green weeds 2	19	3707	3726
3	Fallow	10	1966	1976
4	Fallow rough plow	7	1387	1394
5	Fallow smooth	13	2665	2678
6	Stubble	20	3939	3959
7	Celery	18	3561	3579
8	Grapes untrained	56	11,215	11,271
9	Soil vineyard develop	31	6172	6203
10	Corn senescence green	16	3262	3278
11	Lettuce romaine 4wk	5	1063	1068
12	Lettuce romaine 5wk	10	1917	1927
13	Lettuce romaine 6wk	5	911	916
14	Lettuce romaine 7wk	5	1065	1070
15	Vineyard untrained	36	7232	7268
16	Vineyard vertical trellis	9	1798	1807
Total		270	53,859	54,129

Table 4. Training and testing settings of the Houston 2013 dataset.

No.	Class Name	Training	Testing	Total
1	Healthy Grass	125	1126	1251
2	Stressed Grass	125	1129	1254
3	Synthetic Grass	70	627	697
4	Tree	124	1120	1244
5	Soil	124	1118	1242
6	Water	33	292	325
7	Residential	127	1141	1268
8	Commercial	124	1120	1244
9	Road	125	1127	1252
10	Highway	123	1104	1227
11	Railway	123	1112	1235
12	Parking Lot 1	123	1110	1233
13	Parking Lot 2	47	422	469
14	Tennis Court	43	385	428
15	Running Track	66	594	660
Total		1502	13,527	15,029

Table 5. Quantitative classification results of IP (the best results are shown in bold).

Class	SVM	1D-CNN	2D-CNN	3D-CNN	HybridSN	SSFTT	SSFT	SMESC	MorphF	Ours
1	70.27	100.00	100.00	96.97	86.36	97.22	90.24	67.44	84.09	95.12
2	61.82	71.49	78.99	78.48	93.92	95.09	98.94	96.88	95.96	98.03
3	38.70	82.76	90.56	84.57	89.30	96.65	98.25	92.01	97.35	98.27
4	25.40	73.23	85.09	85.94	96.39	93.19	88.73	85.91	94.76	94.26
5	91.47	91.08	94.84	95.10	97.01	98.33	94.85	96.71	93.60	98.46
6	92.98	97.74	98.61	98.11	98.72	99.15	98.98	98.48	95.98	98.10
7	77.27	100.00	69.23	93.33	100.00	100.00	60.61	100.00	100.00	60.00
8	95.55	94.94	99.20	98.19	99.12	99.22	98.44	98.47	99.21	99.22
9	37.5	50.00	90.00	100.00	100.00	100.00	100.00	76.19	99.63	51.52
10	57.97	77.79	75.77	83.54	91.32	91.45	95.33	92.60	98.51	99.50
11	73.49	81.05	86.96	84.72	94.52	97.45	99.23	98.46	98.27	99.28
12	31.58	72.60	79.86	81.17	87.42	95.57	94.92	89.47	95.54	95.62
13	90.85	98.18	96.93	100.00	98.77	100.00	100.00	100.00	100.00	96.69
14	94.07	95.02	95.31	95.59	99.06	99.00	99.80	99.40	98.92	99.35
15	41.56	78.01	82.69	85.17	90.93	91.12	96.86	92.40	98.09	97.89
16	81.33	94.37	88.46	96.10	92.41	97.33	87.14	92.50	92.96	92.41
OA (%)	68.99	83.32	87.41	87.43	94.47	96.71	97.67	95.85	97.20	98.03
AA (%)	66.36	72.58	83.87	82.55	92.89	95.83	96.68	93.87	97.29	97.95
k × 100	64.34	80.91	85.62	86.62	93.69	96.24	97.34	95.26	96.81	97.76
F1 (%)	58.26	76.61	85.95	86.43	91.56	96.24	95.05	92.68	95.42	95.21

Table 6. Quantitative classification results of PU (the best results are shown in bold).

Class	SVM	1D-CNN	2D-CNN	3D-CNN	HybridSN	SSFTT	SSFT	SMESC	MorphF	Ours
1	92.29	97.01	97.92	96.97	99.51	99.48	99.22	97.30	99.28	99.72
2	96.86	98.41	99.21	99.45	99.55	99.77	99.79	99.84	99.66	99.72
3	68.14	88.74	92.86	92.93	93.41	97.67	95.17	98.65	97.88	98.89
4	87.32	97.52	96.23	99.30	99.52	99.50	99.26	99.65	99.32	99.29
5	100.00	100.00	100.00	100.00	100.00	99.91	100.00	97.20	98.21	99.22
6	76.78	96.38	97.87	99.11	99.67	99.37	99.75	99.63	100.00	100.00
7	59.68	93.63	97.38	97.47	98.83	99.12	99.71	98.59	96.84	98.52
8	86.15	85.95	85.98	90.56	92.15	93.24	94.17	89.45	98.77	99.36
9	98.87	99.38	96.58	99.63	94.83	99.75	99.34	97.83	97.52	98.96
OA (%)	89.79	96.24	97.02	97.80	98.48	98.95	98.93	98.21	99.27	99.57
AA (%)	85.23	94.66	95.48	96.59	97.54	98.27	98.35	96.34	98.25	98.93
k × 100	86.31	95.01	96.05	97.16	97.99	98.61	98.58	97.63	99.03	99.44
F1 (%)	88.54	95.11	96.75	97.80	97.96	98.43	98.42	97.74	98.42	99.10

Table 7. Quantitative classification results of SA (the best results are shown in bold).

Class	SVM	1D-CNN	2D-CNN	3D-CNN	HybridSN	SSFTT	SSFT	SMESC	MorphF	Ours
1	99.60	98.74	100.00	100.00	99.10	84.39	100.00	99.40	99.95	98.51
2	96.43	95.69	99.51	98.18	100.00	98.46	99.73	99.32	99.92	98.03
3	94.64	99.49	97.48	99.79	91.62	100.00	98.47	89.08	78.97	92.16
4	92.90	94.19	87.18	92.58	94.74	58.81	89.97	81.79	79.74	86.95
5	78.12	94.31	97.53	84.72	97.97	87.77	98.74	94.14	95.55	97.61
6	99.00	96.33	99.39	95.24	99.44	97.07	99.42	98.66	99.40	99.46
7	98.00	99.00	98.97	99.78	99.57	98.47	99.92	99.92	96.19	99.97
8	68.64	79.14	77.27	79.88	78.44	95.67	89.15	99.43	94.81	95.00
9	86.03	99.39	98.57	99.09	98.69	100.00	96.29	98.63	99.78	100.00
10	96.41	96.12	98.31	92.73	98.34	95.06	98.07	98.29	83.51	98.50
11	92.79	94.00	98.20	92.87	83.33	99.83	95.14	89.89	96.70	92.34
12	90.07	93.72	88.82	96.52	92.00	94.94	91.24	84.90	99.37	96.69
13	57.54	49.75	65.30	82.95	81.61	85.96	64.78	85.96	97.57	100.00
14	87.84	50.70	73.89	90.47	98.67	75.97	91.35	94.94	97.48	95.75
15	69.64	73.85	75.77	79.89	80.53	98.49	67.19	95.70	94.59	96.50
16	99.42	82.55	97.35	87.25	98.69	100.00	100.00	96.06	97.86	100.00
OA(%)	82.83	87.59	89.55	90.02	90.89	94.73	90.10	96.42	94.84	97.05
AA(%)	83.11	87.61	90.22	91.20	92.08	93.41	93.25	94.69	95.06	96.52
k × 100	82.09	86.18	88.33	88.86	89.84	94.13	90.01	96.08	94.26	96.72
F1(%)	80.29	86.86	91.06	91.47	94.02	96.24	93.18	94.56	94.61	97.02

Table 8. Quantitative classification results of the Houston 2013 dataset (the best results are shown in bold).

Class	SVM	1D-CNN	2D-CNN	3D-CNN	HybridSN	SSFTT	SSFT	SMESC	MorphF	Ours
1	95.90	97.60	97.59	98.19	100.00	96.31	99.06	100.00	100.00	99.50
2	98.70	96.66	96.62	97.85	99.37	99.39	99.48	97.27	98.15	99.30
3	99.99	99.00	98.63	100.00	99.33	100.00	100.00	96.99	100.00	100.00
4	96.79	97.99	98.50	98.00	99.60	97.66	98.95	99.16	95.57	93.79
5	99.30	97.30	98.00	99.99	100.00	99.10	100.00	97.72	100.00	98.91
6	99.62	98.19	98.00	99.00	99.92	100.00	100.00	99.32	99.63	100.00
7	92.01	96.99	97.65	98.28	99.97	96.94	95.79	100.00	97.50	97.65
8	79.72	93.77	97.58	97.30	99.95	100.00	95.80	99.79	98.38	97.82
9	81.44	90.85	94.93	95.28	99.98	95.04	94.36	100.00	92.89	96.39
10	80.73	94.16	95.30	95.13	99.76	98.30	95.71	93.21	98.21	98.30
11	86.44	95.37	95.81	98.10	99.90	99.60	96.60	92.63	99.24	99.90
12	74.57	90.16	97.28	95.96	97.25	98.88	96.50	92.92	99.90	100.00
13	63.20	93.63	95.44	97.61	99.76	98.06	95.18	100.00	98.70	100.00
14	99.42	93.20	98.12	99.99	99.99	99.99	100.00	99.53	100.00	100.00
15	98.11	99.97	97.25	95.65	99.97	98.51	97.10	99.48	98.94	99.98
OA (%)	89.30	95.39	96.51	97.01	97.37	98.27	97.43	97.85	98.21	98.40
AA (%)	89.73	94.69	96.25	97.04	97.23	97.90	97.65	96.27	98.22	98.38
k × 100	88.43	95.09	96.39	97.65	97.24	98.12	97.22	97.87	98.07	98.27
F1 (%)	88.26	96.14	96.61	97.22	97.47	98.18	97.63	97.54	98.34	98.50

Table 9. Ablation study of the proposed model using PU (the best results are shown in bold).

Indicators	ViT + DCONV	Without FAF	with FFN	Ours
OA(%)	98.25	99.27	99.13	99.57
AA(%)	97.32	98.25	98.09	98.93
k × 100	98.00	99.03	98.84	99.44
F1(%)	98.29	98.42	98.38	99.10

Table 10. Classification outcomes of PU by ours and DF end (the best results are shown in bold).

Class	1	2	3	4	5	6	7	8	9	OA(%)	AA(%)	k × 100	F1(%)
DF end	98.77	99.52	97.73	97.63	96.62	99.85	98.60	98.65	93.34	98.90	97.43	98.54	98.32
Ours	99.52	99.72	98.89	99.29	99.22	100.00	98.52	99.36	98.96	99.57	98.93	99.44	99.10

Table 11. Ablation study of the proposed model using PU (the best results are shown in bold).

Method	Test Time	Parameter Count
Ours	48S	3,172,003
ViT+DONV	72S	7,936,299

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, X.; Zang, C.; Lu, W.; Jiang, G.; Sun, Q. FSFF-Net: A Frequency-Domain Feature and Spatial-Domain Feature Fusion Network for Hyperspectral Image Classification. Electronics 2025, 14, 2234. https://doi.org/10.3390/electronics14112234

AMA Style

Pan X, Zang C, Lu W, Jiang G, Sun Q. FSFF-Net: A Frequency-Domain Feature and Spatial-Domain Feature Fusion Network for Hyperspectral Image Classification. Electronics. 2025; 14(11):2234. https://doi.org/10.3390/electronics14112234

Chicago/Turabian Style

Pan, Xinyu, Chen Zang, Wanxuan Lu, Guiyuan Jiang, and Qian Sun. 2025. "FSFF-Net: A Frequency-Domain Feature and Spatial-Domain Feature Fusion Network for Hyperspectral Image Classification" Electronics 14, no. 11: 2234. https://doi.org/10.3390/electronics14112234

APA Style

Pan, X., Zang, C., Lu, W., Jiang, G., & Sun, Q. (2025). FSFF-Net: A Frequency-Domain Feature and Spatial-Domain Feature Fusion Network for Hyperspectral Image Classification. Electronics, 14(11), 2234. https://doi.org/10.3390/electronics14112234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FSFF-Net: A Frequency-Domain Feature and Spatial-Domain Feature Fusion Network for Hyperspectral Image Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Network Framework

2.2. Frequency Domain Transformer

2.3. Dual Feature Fusion (DFF)

2.4. Deformable Gate Feed-Forward Network

3. Results

3.1. Data Description

3.2. Experimental Setting

3.3. Comparative Experimentation

3.3.1. Experiment on the IP

3.3.2. Experiment on the PU

3.3.3. Experiment on the Salinas Scene Dataset

3.3.4. Experiment on the Houston 2013 Dataset

3.4. Ablation Study

3.5. Parameter Study

3.6. Memory and Time

3.7. Comparison of Classification Accuracy Under Different Training Sample Ratio

4. Limitations and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI