Frequency–Spatial Domain Jointly Guided Perceptual Network for Infrared Small Target Detection

Han, Yeteng; Ye, Minrui; Liu, Bohan; Li, Jie; Jia, Chaoxian; Cui, Wennan; Zhang, Tao

doi:10.3390/rs18071000

Open AccessArticle

Frequency–Spatial Domain Jointly Guided Perceptual Network for Infrared Small Target Detection

by

Yeteng Han

^1,2,†

,

Minrui Ye

^2,†,

Bohan Liu

^1,2,

Jie Li

²,

Chaoxian Jia

^3,4,

Wennan Cui

² and

Tao Zhang

^1,2,*

¹

University of Chinese Academy of Sciences, Beijing 101408, China

²

No. 1 Department of Engineering, Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China

³

Institute of Marine Equipment, Shanghai Jiao Tong University, Shanghai 200240, China

⁴

Shanghai Chang Xing Ocean Laboratory, Shanghai 201913, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(7), 1000; https://doi.org/10.3390/rs18071000

Submission received: 12 February 2026 / Revised: 10 March 2026 / Accepted: 24 March 2026 / Published: 26 March 2026

(This article belongs to the Special Issue Deep Learning-Based Small-Target Detection in Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We proposed a novel framework FSGPNet that integrates frequency-domain enhancement and spatial-domain feature selection within the whole encoder–decoder backbone.
The GTAM module leverages the directional selectivity of Gabor filters integrated with Transformer attention.

What are the implications of the main findings?

This work provides a new architectural paradigm for coupling physics-informed diffusion equations with data-driven deep learning in remote sensing image restoration.
Extensive benchmarks on the IRSTD-1K and NUDT-SIRST datasets demonstrate that the proposed method significantly reduces false alarms and improves segmentation accuracy in low-SNR scenarios.

Abstract

Infrared small target detection is a critical task in remote sensing. However, it remains highly challenging due to low contrast, heavy background clutter, and large variations in target scale. Traditional convolutional networks are inadequate for joint modeling, as they cannot effectively capture both fine structural details and global contextual dependencies. To address these issues, we propose FSGPNet, a frequency–spatial domain jointly guided perceptual network that explicitly exploits complementary representations in both the frequency and spatial domains. Specifically, a Frequency–Spatial Enhancement Module (FSEM) is introduced to strengthen target details while suppressing background interference through high-frequency enhancement and Perona–Malik diffusion. To enhance global context modeling, we propose a Multi-Scale Global Perception (MSGP) module that integrates non-local attention with multi-scale dilated convolutions, enabling robust background modeling. Furthermore, a Gabor Transformer Attention Module (GTAM) is designed to achieve selective frequency–spatial feature aggregation via self-attention over multi-directional and multi-scale Gabor responses, effectively highlighting discriminative structures of various small targets. Extensive experiments are conducted on two benchmark datasets (IRSTD-1K and NUDT-SIRST) that cover typical remote sensing infrared scenarios. Quantitative and qualitative results demonstrate that FSGPNet consistently outperforms state-of-the-art methods across multiple evaluation metrics. These findings validate the effectiveness and robustness of the proposed FSGPNet for detecting small infrared targets in remote sensing applications.

Keywords:

infrared small target detection; attention mechanism; frequency-domain features; Gabor filtering

1. Introduction

Infrared small target detection (IRSTD) has attracted extensive attention in recent years due to its critical role in remote sensing, maritime search and rescue [1], and early-warning surveillance. According to the SPIE definition [2], an infrared small target typically occupies less than 0.13% of the image area (e.g., less than 9 × 9 pixels in a 256 × 256 image). Such an extremely limited spatial extent results in weak and sparse features that are easily submerged by strong background clutter and sensor noise.

Early methods in IRSTD relied mainly on handcrafted representations, such as filtering, saliency analysis, and low-rank decomposition [3,4,5,6,7,8,9,10,11,12,13,14]. Although these methods can yield acceptable performance in simple backgrounds, their strong dependence on prior assumptions limits robustness in complex scenes. With the rapid development of deep learning, CNN-based approaches have substantially improved detection accuracy [15,16,17,18] by automatically learning discriminative features. Representative works, such as ACM [19], DNA-Net [20], and UIU-Net [21], adopt encoder–decoder architectures to better retain fine-grained target details and spatial localization cues. Furthermore, attention mechanisms [22,23,24], multi-scale architectures [25,26,27,28], contrast-enhancement modules [29,30,31,32] and contextual modeling [33,34,35] have been widely explored to enhance the representation of weak, irregular, and low-contrast targets. Despite this progress, existing methods still face several limitations: (1) insufficient preservation of boundary information and structural details; (2) vulnerability to missed detections and false alarms when targets are submerged in low signal-to-noise ratio (SNR) backgrounds; (3) poor adaptability to variations in target scale and morphology under complex backgrounds. In addition, the scarcity of publicly annotated IRSTD datasets further restricts the generalizability of learning-based approaches [19,20,29,36].

Recently, a growing body of work has begun to explore enhancement of features in both the frequency and spatial domains [14,37,38,39]. In the spatial domain, researchers have improved traditional filtering schemes to suppress noise while preserving edges [40,41,42], or they have employed non-local attention and contextual modeling to strengthen spatial saliency [34,43,44]. Another line of work improves gradient-related cues to emphasize boundary and detailed information [45,46], for example, by introducing gradient-guided branches or combining Sobel/Laplacian responses with deep features to accentuate target–background differences [47,48]. In the frequency domain, methods based on the discrete cosine transform (DCT) and the discrete wavelet transform (DWT) extract spectral components to suppress the low-frequency background and highlight the high-frequency edge cues [38,49,50]. And, frequency attention mechanisms further refine the discrimination capability in the spectral space [51,52].

However, existing approaches reveal a clear dichotomy: some methods focus primarily on spatial-domain enhancement (e.g., spatial attention, non-local modeling, contextual refinement), while others emphasize explicit frequency-domain analysis (e.g., spectral decomposition, frequency attention). Although a few recent attempts have explored frequency–spatial joint modeling, they typically incorporate frequency or spatial enhancement merely as external preprocesses [37,40,53] or parallel auxiliary branches [54,55], failing to achieve a unified integration within the whole encoder–decoder backbone. This architectural flaw leads to a critical limitation: under SNR conditions, the weak features of small targets undergo continuous attenuation throughout the successive encoding and decoding processes. Therefore, we argue that feature enhancement and global-context reasoning must be co-designed at the architectural level.

To address these challenges, we propose a frequency–spatial domain jointly guided perceptual network. Specifically, we design a frequency–spatial feature enhancement encoder that applies DCT to extract high-frequency information for feature enhancement in detail while employing the Perona–Malik Diffusion (PMD) equation [56] to preserve the edge structure and smooth homogeneous regions. This complementary design strengthens edge cues and reduces detail loss in the encoding stage. In the bottleneck, we introduce a multi-scale global perception module that enhances long-range dependency modeling and stabilizes target localization under weak contrast. During the decoding stage, we integrate multi-scale, multi-orientation Gabor filters [57] using a self-attention mechanism, which aggregates differential target features. The entire end-to-end architecture enables small-target detail preservation, scale adaptability, and robustness against background clutter. The key innovation of this work is the unified integration of frequency-domain enhancement and spatial-domain feature selection within the whole encoder–decoder backbone, rather than incorporating them merely as external preprocessing or auxiliary branches.

The main contributions of this work are summarized as follows:

We propose a Frequency–Spatial Enhancement Module (FSEM) to strengthen target details while suppressing background interference through high-frequency enhancement and Perona–Malik diffusion, effectively preserving edge and shape characteristics while suppressing noise.
We design a Multi-Scale Global Perception (MSGP) module that integrates non-local attention with multi-scale dilated convolutions, enabling robust background modeling and improving detection stability under low-SNR conditions.
We introduce a Gabor Transformer Attention Module (GTAM) to achieve selective frequency–spatial feature aggregation via self-attention over multi-directional and multi-scale Gabor responses, effectively highlighting discriminative structures of various small targets. It collaboratively enhances spatial global modeling and frequency selectivity, improving the discriminability of irregularly shaped targets.

Experiments on public datasets covered typical remote sensing infrared scenarios demonstrate that the proposed method achieves comparable performance with state-of-the-art approaches in both segmentation accuracy and robustness.

2. Related Work

2.1. Infrared Small Target Detection

Existing infrared small target detection (IRSTD) methods can be broadly categorized according to their modeling paradigms and representational capacity.

Early studies were dominated by handcrafted, prior-driven approaches that exploit local contrast, structural saliency, or low-rank properties to separate small targets from cluttered backgrounds [3,4,5,6,7,8,9,10,11,12,13,14]. These methods are computationally efficient and interpretable but rely heavily on fixed assumptions, making them fragile under background complexity, scale variation, and noise corruption.

Data-driven convolutional neural networks (CNNs) subsequently became the mainstream solution [17,19,20,58,59]. By learning hierarchical representations from the data, CNN-based models significantly improve robustness to background clutter and target diversity. Architectures such as DNANet [20] and UIU-Net [21] enhance small-target perception through multi-scale fusion and encoder–decoder designs. Nevertheless, CNNs inherently favor local feature aggregation, which limits their ability to capture long-range context and often leads to detail attenuation for extremely small or low-contrast targets.

To overcome locality constraints, recent works introduce Transformer-based components into IRSTD frameworks [33]. Vision Transformers (ViTs) [60] offer strong global modeling capability, and hybrid designs such as MTU-Net [61] and SCTransNet [62] demonstrate improved contextual reasoning in complex scenes. However, the high computational burden of self-attention, together with feature alignment issues between CNN and Transformer representations, restricts their efficiency and scalability for high-resolution infrared imagery.

Beyond architectural innovations, several methods incorporate physical or statistical priors to guide feature learning. TCI-Former [48] leverages heat-conduction-inspired directional attention, while PConv [63] exploits the Gaussian distribution of infrared targets to expand the effective receptive field. These studies suggest that physically motivated constraints can improve detection reliability, but they are typically embedded as isolated modules rather than being integrated throughout the feature learning pipeline.

Overall, existing IRSTD approaches reveal complementary strengths and limitations across handcrafted, CNN-based, Transformer-based, and physics-guided paradigms. Effectively integrating fine-grained local discrimination, global contextual reasoning, and physically meaningful priors within a unified framework remains an open challenge—directly motivating the design of jointly guided feature enhancement architectures.

2.2. Frequency-Domain Feature Enhancement

Frequency-domain feature enhancement has recently attracted increasing attention in infrared small target detection (IRSTD), motivated by the observation that small targets are typically characterized by localized high-frequency responses, whereas background clutter is dominated by low-frequency structures. Explicit frequency modeling, therefore, provides an effective mechanism for suppressing background interference and enhancing target saliency.

One line of research focuses on explicit frequency-domain decomposition. WTAPNet [64] replaces conventional downsampling and feature fusion with the discrete wavelet transform (DWT) and the inverse wavelet transform (IWT), mitigating information loss in deep layers and improving small-target preservation. FDConv [65] introduces Fourier-domain parameterization into convolutional kernel learning by grouping weights according to frequency bands and applying dynamic modulation, allowing frequency-diverse representations with minimal parameter overhead.

Another category emphasizes spatial–frequency joint modeling to leverage complementary information between domains. FreqFusion [66] adaptively integrates high-pass and low-pass filtering to enhance boundary consistency and structural details during upsampling. HS-FPN [67] incorporates high-frequency enhancement and spatial dependency modeling within a feature pyramid, improving responsiveness to tiny targets. FreqODEs [52] combine interactive attention in the frequency-domain with ordinary neural differential equations to suppress high-frequency noise while strengthening target discrimination, particularly in scenarios with strong target–background similarity. SFDTNet [68] further integrates self-attention in the frequency-domain and adaptive frequency selection, effectively preserving target contours while reducing background interference.

In summary, frequency-domain enhancement methods explicitly decouple and model spectral components, enabling effective clutter suppression and edge amplification. Compared with purely spatial-domain approaches, frequency-based modeling provides more globally discriminative representations and exhibits clear advantages in detecting infrared small targets under low-SNR conditions.

2.3. Advanced Paradigms for Remote Sensing Small Target Detection

In the broader context of remote sensing, small target detection remains a formidable challenge across various imaging modalities (e.g., optical, radar and infrared). The inherent characteristics of remote sensing small targets—such as extreme scale variations, lack of explicit textures, and vulnerability to dense background clutter—often lead to severe feature attenuation in deep networks. To mitigate these bottlenecks, recent research has introduced advanced paradigms that enhance model generalization and task adaptability, providing crucial methodological insights for IRSTD.

For instance, the weak features of small targets are highly sensitive to domain shifts caused by varying sensors or environmental conditions. To address such heterogeneity, MASDG [69] utilizes a multiview augmented single-source domain generalization approach, employing texture-level and style-level augmentations to ensure robust building extraction across unseen domains. This strategy of enforcing semantic-invariant representation is particularly relevant to IRSTD, where dim targets must be robustly detected against unpredictable and heterogeneous backgrounds.

Furthermore, accurately isolating small targets from complex remote sensing scenes requires highly targeted feature perception. The integration of multimodal information and task-specific prompts has shown great potential in this regard. PMTSeg [70] introduces a prompt-driven multimodal Transformer that adaptively bridges the modality and task gaps for task-adapted remote sensing image segmentation. For small target detection, such affinity approximation provides a novel perspective on using specific “prompts” (similar to our Gabor-guided directional priors) to guide the network’s attention toward target-like structures while ignoring isotropic clutter.

Additionally, maintaining the semantic consistency of small targets across different remote sensing platforms or altitudes is a major hurdle. To address the semantic discrepancies between multiple domains, Zhang et al. [71] proposed a domain-common approximation learning method for multitarget domain adaptation. By aligning inter-domain and intra-domain features through style transfer and feature approximation, this approach significantly improves the stability of instance extraction.

Drawing inspiration from both the inherent requirements of remote sensing small target detection and these robust learning strategies, our proposed FSGPNet explicitly leverages frequency-domain saliency and physical priors. By doing so, it approximates a domain-common target representation to decouple small targets from complex remote sensing clutter, ensuring stable and superior performance across diverse infrared scenes.

3. Materials and Methods

3.1. Method Overview

As illustrated in Figure 1, FSGPNet is built on the U-Net [59] architecture and introduces three key components: Frequency–Spatial Feature Enhancement Module (FSEM), Multi-Scale Global Perception module (MSGP), and Gabor Transformer Attention Module (GTAM). Given an infrared image I, four groups of FSEM and a max-pooling layer are used to extract high-level features

E_{i} \in R^{C_{i} \times (H / 2^{i}) \times (W / 2^{i})}

, where the channel dimensions are

C_{1} = 16, C_{2} = 32, C_{3} = 64

and

C_{4} = 128

. The deepest encoder feature is further processed by the MSGP module to obtain the global-aware representation. In the decoding stage, inputs from the skip connection are fed into the GTAM and then concatenated with the upsampled features from the previous decoder block [72], producing the decoder feature maps

F_{i} \in R^{C_{i} \times (H / 2^{i}) \times (W / 2^{i})}

for

i = 1, 2, 3, 4

.

The design of FSGPNet is not a simple concatenation of independent modules; rather, FSEM, MSGP, and GTAM are deeply coupled following a “Local Enhancement → Global Contextualization → Selective Reconstruction” paradigm. In the encoder, FSEM acts as the foundational processor, explicitly extracting high-frequency details while utilizing Perona–Malik Diffusion (PMD) to physically diffuse background noise, ensuring that small target signatures are not lost during downsampling. At the structural bottleneck, MSGP leverages these purified features to establish long-range spatial dependencies and model the global background context, effectively preventing high-frequency clutter from causing false alarms. Finally, during the decoding stage, GTAM serves as the ultimate selective gate. It bridges the fine-grained local cues from FSEM and the global semantics from MSGP by utilizing Gabor-guided self-attention, adaptively selecting the most discriminative frequency–spatial structures for precise target reconstruction. Together, these three modules form a closed-loop mechanism that ensures target features are preserved, contextualized, and selectively refined.

3.2. Frequency–Spatial Domain Feature Enhancement Module

As CNN depth increases, resblocks progressively lose fine edge details of small targets, causing them to be overwhelmed by background clutter. To address this issue, we propose the Frequency–Spatial Domain Feature Enhancement Module (FSEM), a multi-branch architecture designed to enhance small-target feature and suppress background noise, as shown in Figure 2.

In FSEM, the convolution operation of the residual convolutional block is replaced with PConv [63], whose radially decaying receptive field conforms to the Gaussian spatial distribution of infrared small targets, enabling focused target activation.

To further mitigate the effects of low contrast, noise and target-background similarity in infrared images, we introduce a Perona–Malik Diffusion (PMD) branch, which employs a spatially adaptive diffusion coefficient

c (| \nabla I |)

to regulate smoothing strength across regions, enabling effective noise suppression while preserving sharp boundaries and the geometric integrity of small targets.

\frac{\partial I}{\partial t} = d i v (c (| \nabla I |) \nabla I)

(1)

\begin{matrix} c (| \nabla I |) & = \frac{1}{1 + {(\frac{\nabla I}{K})}^{2}} \end{matrix}

(2)

where

I (x, y, t)

denotes the input image, and

\nabla I

is the spatial gradient.

c (| \nabla I |)

is the diffusion coefficient, which is used to control the degree of diffusion. K denotes the gradient threshold. The discrete form of the PMD equation is provided in [73], upon which we incorporate an additional regularization term based on the magnitude of the gradient:

{PMD}_{e n h} = {PMD}_{r a w} + λ | \nabla I |, | \nabla I | = \sqrt{u_{x}^{2} + u_{y}^{2} + ε}

(3)

{PMD}_{r a w} = \frac{u_{x}^{2} u_{y y} - 2 u_{x} u_{y} u_{x y} + u_{y}^{2} u_{x x}}{2 (u_{x}^{2} + u_{y}^{2} + ε)}

(4)

where

u_{x}

and

u_{y}

denote the first-order partial derivatives along the X and Y directions, respectively.

u_{x x}

and

u_{y y}

denote second-order partial derivatives, and

u_{x y}

is the mixed partial derivative.

λ

is a learnable parameter that balances the raw diffusion and the regularization of the magnitude of the gradient.

ε

is a small constant used to prevent division by zero in the denominator, which is strictly set to

1 \times 10^{- 6}

in our implementation.

The discrete PMD formulation reflects local shape variations through iso-intensity contour curvature, yielding strong responses at point-like targets or sharp edges while remaining negligible in smooth backgrounds. However, its reliance on second-order derivatives makes it sensitive to high-frequency noise, which can obscure weak targets or induce false responses. To alleviate this issue, we introduce gradient-magnitude regularization, using stable first-order contrast cues to constrain the curvature term, thereby preserving discriminative structures while improving robustness in low-contrast and noisy infrared scenes.

In infrared images, small targets mainly appear as localized high-frequency components, whereas background clutter is dominated by low-frequency structures. Conventional feature extraction methods have fixed frequency responses and are therefore easily interfered with by low-frequency background information. To overcome this limitation, we design a Dynamic High-Frequency Perception module (Dynamic HFP).

Given an input feature map

X \in R^{C \times H \times W}

, we initiate the process by mapping it to the frequency domain using the two-dimensional Discrete Cosine Transform (2D-DCT):

X_{f} = DCT (X)

(5)

In the DCT spectral domain, the zero-frequency and low-frequency components, which correspond to the slowly varying background, are naturally concentrated at the top-left corner. To suppress this interference, we define a binary high-pass mask

W (h, w)

to erase the low-frequency background clutter.

W (h, w) = \{\begin{matrix} 0, h < H \times ρ_{h}, w < W \times ρ_{w}, \\ 1, otherwise, \end{matrix}

(6)

where

ρ_{h}

and

ρ_{w}

denote the frequency ratio control parameters. Experimental results on the validation set showed that performance remained stable for

ρ \in [0.2, 0.3]

, and

ρ = 0.25

was selected as the default for its consistent robustness.

Multiplied by W to erase the low-frequency energy, the high-frequency response map is reconstructed via the Inverse Discrete Cosine Transform (2D-IDCT):

{\tilde{X}}_{f} = X_{f} ⊙ W,

(7)

X_{H F} = IDCT ({\tilde{X}}_{f})

(8)

Subsequently, the high-frequency response map

X_{H F}

is fed into two parallel paths: the Channel-wise Interaction and the Spatial-wise Dynamic Interaction.

Channel-wise Interaction: First, global statistical features are extracted by applying both max pooling and average pooling to the high-frequency response map

X_{H F}

:

P_{m a x} = MaxPool (X_{H F}), P_{avg} = AvgPool (X_{H F})

(9)

Subsequently,

P_{m a x}

and

P_{avg}

are fused and transformed through a non-linear mapping to generate the channel attention weight

A_{c}

:

A_{c} = σ ({Conv}_{2 \times 1} ({Conv}_{1 \times 1} (ReLU (P_{max})) + {Conv}_{1 \times 1} (ReLU (P_{avg}))))

(10)

where

{Conv}_{1 \times 1}

and

{Conv}_{2 \times 1}

denote 1 × 1 and 2 × 1 convolutions used for cross-channel modeling. We utilize the

ReLU (x) = max (0, x)

activation to introduce non-linearity and the Sigmoid function

σ (x) = 1 / (1 + e^{- x})

to normalize the attention weights into the range

(0, 1)

. The resulting

A_{c}

rescales the input features to emphasize channels containing significant high-frequency target signatures.

Spatial-wise Dynamic Interaction: Parallel to the channel path, the spatial-wise interaction pathway focuses on localizing high-frequency saliency. To account for the diverse morphologies of small targets across different samples, we employ a sample-adaptive dynamic convolution strategy [74].

First, we compute the mean along the channel dimension to obtain the spatial attention feature:

X_{c a} = R e s h a p e (\frac{1}{C} \sum_{i = 1}^{C} X_{H F, i})

(11)

where

X_{c a} \in R^{1 \times H \times W}

explicitly highlights the high-frequency structural residuals while suppressing low-frequency background trends.

To achieve sample-specific spatial enhancement, we generate a dynamic convolution kernel

K_{d}

based on the original input feature X. As illustrated in the implementation details, a kernel generator branch produces

K_{d} \in R^{B \times 1 \times 3 \times 3}

through global pooling and non-linear projection:

K_{d} = R e s h a p e ({Conv}_{1 \times 1} (ReLU ({Conv}_{1 \times 1} (AvgPool (X)))))

(12)

A_{s} = σ (Conv 2 d (X_{c a}, K_{d}, groups = B))

(13)

where

AvgPool

denotes global average pooling, and the output is reshaped from

[B, 3 \times 3, 1, 1]

to

[B, 1, 3, 3]

as the dynamic convolution kernel. And, B represents the current number of batches. The dynamic kernel

K_{d}

facilitates sample-wise convolution over the high-frequency response map

X_{c a}

, yielding sample-dependent spatial weight

A_{s} \in R^{1 \times H \times W}

. This allows the model to adaptively “search” for high-frequency targets based on the specific context of each individual image, which is more effective than standard isotropic convolutions.

The final output is formulated as:

E_{o u t} = σ (GroupNorm (X ⊙ A_{s} + X ⊙ A_{c}))

(14)

where ⊙ denotes element-wise multiplication.

In summary, Dynamic HFP performs explicit high-frequency enhancement at the frequency domain level. Furthermore, it leverages a dynamic convolution kernel to achieve sample-dependent feature enhancement in the spatial dimension. It effectively amplifies the high-frequency response intensity and the adaptive capability towards targets required for robust infrared small target detection.

3.3. Multi-Scale Global Perception Module

Accurate background modeling is essential for infrared small-target detection, yet conventional CNNs are limited by local receptive fields and weak global context modeling, leading to clutter interference and false alarms. To address this issue, we propose a Multi-Scale Global Perception module (MSGP), as shown in Figure 3. It combines non-local attention [24] with multi-scale dilated convolutions [28,75], enabling long-range dependency capture and multi-scale context aggregation. Given an input feature map

E \in R^{c_{3} \times h_{3} \times w_{3}}

, MSGP produces two complementary feature branches:

X_{M S G P} = C o n ν_{3 \times 3} (X + S E (X_{N L A} \otimes X_{M S D C}))

(15)

where

X_{N L A}

denotes the branch of the non-local attention feature,

X_{M S D C}

denotes the branch of the multi-scale dilated convolution feature, and

S E (\cdot)

represents the squeeze-and-excitation operation.

In the non-local attention branch, we employ Non-local Attention (NLA) [24] to compute spatial attention weights and model long-range dependencies. This mechanism captures global contextual relations between targets and background, which are crucial in infrared scenes where local cues are weak. To further strengthen pixel-level discrimination, the output of the NLA module is fed into Pixel Attention (PA) [76]. PA aggregates channel responses at each spatial position, enriching fine-grained representations and enhancing sensitivity to small target patterns.

X_{N L A} = P A (N L A (X))

(16)

Here,

N L A (\cdot)

denotes the non-local attention operator and

P A (\cdot)

denotes pixel attention.

The second branch adopts a multi-scale dilated convolution scheme with dilation rates

(k = 1, 2, 3, 4)

[77], effectively enlarging the receptive field without increasing parameter complexity. This design enables multi-scale context modeling and enhances robustness to target size variations while capturing broader structural dependencies to suppress false alarms.

X_{M S D C} = C o n c a t (D c o n v_{d = 1} (X), . . ., D c o n v_{d = 4} (X))

(17)

Here,

D c o n ν_{k}

represents dilated convolution with dilation rate d.

Finally, a Squeeze-and-Excitation (SE) module [22] is used for channel-wise recalibration, adaptively focusing on informative features and further improving detection accuracy in complex infrared scenes.

3.4. Gabor Transformer Attention Module

Conventional methods are ineffective in modeling the frequency characteristics and spatial characteristics (including orientation, size, and location) of infrared small targets. Therefore, we propose a Gabor Transformer Attention Module (GTAM), which achieves selective frequency–spatial modeling by integrating Gabor-based feature selection with a self-attention mechanism, as shown in Figure 4.

Specifically, Gabor filters provide orientation-selective and band-pass frequency responses to isolate discriminative target structures, while self-attention facilitates global context information aggregation. Their joint design enables precise target–background separation and robust localization, as illustrated in Figure 5.

The Gabor kernel is formulated as follows:

G_{λ_{g}, σ, γ, θ} (x, y) = exp (- \frac{x_{θ}^{2} + γ^{2} y_{θ}^{2}}{2 σ^{2}}) cos (2 π \frac{x_{θ}}{λ_{g}})

(18)

where

λ_{g}

denotes the wavelength controlling the feature scale,

σ

is the standard deviation of the Gaussian window to regulate spatial spread, and

γ

is the aspect ratio to adjust the directional sensitivity.

λ_{g}

,

σ

and

γ

are all learnable parameters,

θ

represents the orientation angle, and the rotated coordinates

x_{θ}

,

y_{θ}

are defined as:

x_{θ} = x cos θ + y sin θ, y_{θ} = - x sin θ + y cos θ

(19)

Infrared small targets typically exhibit abrupt intensity transitions at their boundaries, which correspond to localized high-frequency structures in the spatial domain. To emphasize these discriminative features, we introduce a phase-aware response modulation mechanism within the Gabor filtering framework. In a Gabor function, the phase of the sinusoidal carrier along the dominant orientation

x_{θ}

is defined as:

ϕ (x_{θ}) = 2 π \frac{x_{θ}}{λ_{g}}

(20)

where

λ_{g}

denotes the wavelength that determines the center spatial frequency of the Gabor filter. The gradient magnitude of the phase is given by

| \nabla ϕ | = |\frac{\partial ϕ}{\partial x_{θ}}| = \frac{2 π}{λ_{g}}

(21)

which reflects the intrinsic spatial frequency of the sinusoidal carrier. A smaller wavelength

λ_{g}

corresponds to a higher-frequency filter that is more sensitive to rapid spatial variations and fine structural details. Motivated by this property, we design a phase-enhanced response formulation to adaptively emphasize high-frequency Gabor responses. The enhanced output is defined as:

G E (x, y) = | G_{λ_{g}, σ, γ, θ} (x, y) | \cdot (1 + w_{p} | \nabla ϕ |)

(22)

where

w_{p}

is a learnable phase-enhancement weight constrained to positive through the Softplus function. This formulation effectively amplifies responses associated with higher-frequency filters, thereby improving the sensitivity of the detector to sharp edge transitions and subtle target structures commonly observed in infrared small targets. As illustrated in Figure 5, trained Gabor kernels adapt to different target characteristics in both frequency and orientation dimensions.

Given an input feature map

E \in R^{c \times h \times w}

, GTAM performs multi-directional convolutions using Gabor kernels at scale k, with orientations

θ = \frac{i π}{8}

where

i = 0, \dots, N - 1

. The results are then reshaped to form the Gabor feature matrix

D^{k} \in R^{8 c \times w h}

:

D^{k} (x, y) = R e s h a p e (C o n c a t_{i = 0, \dots, 7} (G E_{θ = i π / 8} (E)))

(23)

The Query, Key, and Value vectors are constructed as follows:

Q^{k} = W_{Q}^{k} T, K^{k} = W_{K}^{k} D^{k}, V^{k} = W_{V}^{k} D^{k}

(24)

T = R e s h a p e (E)

(25)

where

W_{Q}^{k}

,

W_{K}^{k}

and

W_{V}^{k}

denote learnable

1 \times 1

convolutions,

T \in R^{c \times h w}

. The features are normalized after convolution, and the self-attention weight

A^{k}

is computed as follows:

{\hat{Q}}^{k} = Norm (Q^{k}), {\hat{K}}^{k} = Norm (K^{k}),

(26)

A^{k} = Softmax (\frac{{\hat{Q}}^{k} {({\hat{K}}^{k})}^{T}}{\sqrt{h w}})

(27)

The scale-specific output

O u t^{k}

is defined as follows:

O u t^{k} = A^{k} V^{k}

(28)

Finally, the results in two scales are concatenated and fused using a 1×1 convolution:

F_{o u t} = R e l u (B N (C o n v_{1 \times 1} (r e s h a p e (C o n c a t (O u t^{k = 5}, O u t^{k = 7})))))

(29)

Overall, the GTAM constructs joint orientation–frequency-scale features through multi-scale Gabor convolutions and then employs a Transformer-style multi-head attention mechanism to adaptively adjust the weights across scales. This enables the model to automatically focus on the most discriminative locations, orientations, and frequencies, thereby enhancing robustness and background suppression capability.

3.5. Loss Function

In this study, we adopt the SoftloU loss function [78] for model training. In FSGPNet, the predicted output is first mapped to the [0, 1] range through a Sigmoid function. Let

I_{i, j}

denote the raw prediction at pixel location

(i, j)

, and the Sigmoid-activated probability is expressed as

p_{i, j}

representing the predicted likelihood that the pixel belongs to the target. The computation is given by:

p_{i, j} = \frac{1}{1 + e^{- I_{i, j}}}

(30)

The SoftIoU loss is designed to measure the overlap between prediction and ground truth. It is defined as:

l o s s = 1 - \frac{\sum_{i, j} p_{i, j} \cdot g_{i, j}}{\sum_{i, j} p_{i, j} + \sum_{i, j} g_{i, j} - \sum_{i, j} p_{i, j} \cdot g_{i, j}}

(31)

Here,

g_{i, j}

denotes the ground truth label at pixel

(i, j)

, where the value is typically 0 or 1. Compared to the standard loU, SoftloU is differentiable and thus suitable for gradient-based optimization. Moreover, it alleviates the severe imbalance issue between small targets and large backgrounds, making it particularly effective for infrared small target detection.

4. Experiments

4.1. Experimental Setup

Datasets: The validation experiments utilized two public remote sensing datasets: NUDT-SIRST [19] and IRSTD-1K [20], comprising 1327 and 1000 images, respectively. To ensure a fair comparison, we followed the dataset splits method defined in [20] for NUDT-SIRST (50% for training) and in [47] for IRSTD-1k (80% for training), consistent with the official splitting standards provided by the authors of the original dataset.

Evaluation Metrics: Five pixel-level metrics are employed for the quantitative assessment of detection performance. While the Intersection over Union (IoU) is susceptible to target scale effects in small-target scenarios, the normalized IoU (nIoU) is introduced. It provides a fair and stable reflection of detection accuracy even for extremely small targets by normalizing the IoU with respect to scale. Simultaneously, the F1-score is used to balance precision and recall, the Probability of Detection (PD) characterizes detection sensitivity, and the False Alarm Rate (Fa) quantifies the algorithm’s robustness in complex backgrounds.

Implementation Details: The proposed method uses U-Net as baseline. The original convolutional modules in the encoder are replaced with FSEM to improve feature extraction capability. The number of down-sampling layers is 4, and the base channel is set to 16. FSGPNet was trained from scratch (without using pre-trained weights). Each input image is normalized using dataset-specific mean and variance statistics, with patch sizes set to 256 × 256 for NUDT-SIRST and 512 × 512 for IRSTD-1K. During training, random flipping and rotation are applied with a 0.5 probability to enhance the model’s spatial invariance. No brightness changes or noise addition is employed, ensuring the spectral characteristics of the infrared small targets are strictly preserved. Model weights and biases were initialized using the Kaiming initialization method [79]. The segmentation threshold was set to 0.5. SoftIoU loss [78] was used as the optimization objective, and the Adam optimizer [80] was employed for parameter updates, with an initial learning rate of 0.001. The learning rate was gradually decayed to 1 ×

10^{- 5}

using a cosine annealing schedule [81]. The training used a batch size of 8 for 400 epochs on the both datasets. The proposed FSGPNet is implemented with PyTorch 2.1.2 on a NVIDIA RTX GeForce 4090 GPU, an Intel Xeon(R) Platinum 8352V CPU (Intel Corporation, Santa Clara, CA, USA), and 90 GB of RAM.

4.2. Quantitative Results

We evaluate FSGPNet against representative state-of-the-art methods on the IRSTD-1K and NUDT-SIRST datasets. Quantitative results are summarized in Table 1 and Table 2, including detection accuracy metrics (IoU, nIoU and F1-score), detection performance (PD and Fa) and computational efficiency indicators (parameters, FLOPs and latency).

On IRSTD-1K (Table 1), FSGPNet achieves the best IoU (68.14%), nIoU (69.08%), and F1-score (81.07%) while also producing the lowest false alarm rate (Fa = 9.68 ×

10^{- 6}

). These results demonstrate its strong capability for accurate target segmentation and effective background suppression. Although its PD is slightly lower than that of ISTDU [82], FSGPNet consistently outperforms Transformer-based methods such as SCTransNet [62] in most segmentation-related metrics, indicating more precise target delineation under complex cluttered backgrounds.

From an efficiency perspective, FSGPNet has moderate computational complexity compared with extremely lightweight CNN-based methods such as ACM [19] and ALCNet [83], which achieve faster inference with significantly fewer parameters. However, those lightweight models suffer from noticeable performance degradation in segmentation accuracy and false alarm suppression. In contrast, FSGPNet provides a more favorable trade-off by significantly improving detection accuracy while maintaining a manageable computational cost.

On NUDT-SIRST (Table 2), FSGPNet achieves the best IoU (94.93%), nIoU (94.96%), and F1-score (97.39%), along with the lowest false alarm rate (Fa = 2.96 ×

10^{- 6}

). Compared with GSFANet [84], which also integrates frequency-domain and spatial-domain representations, FSGPNet improves IoU by over 1.51% while maintaining a relatively compact parameter scale. In addition, its substantially lower Fa compared with lightweight detectors such as RDIAN [85] further confirms the robustness of the proposed frequency–spatial jointly guided framework against background interference.

Overall, although FSGPNet is not the most lightweight model among all compared approaches, it consistently delivers superior segmentation accuracy and strong false alarm suppression on both datasets, demonstrating an effective balance between detection performance and computational complexity.

Table 1. Quantitative comparison with other state-of-the-art methods on the IRSTD-1K dataset (best results highlighted in bold, second-best results underlined).

Methods	Params (M)	Flops (G)	Lat. (ms)	IoU (%)	nIoU (%)	F1 (%)	Pd (%)	Fa ( $10^{- 6}$ )
ACM [19]	0.40	1.61	2.89	60.33	57.22	75.28	93.27	68.49
ALCNet [83]	0.43	1.51	2.73	58.08	58.74	73.51	92.93	74.45
AGPCNet [86]	0.22	14.87	17.81	67.73	65.94	79.85	93.26	14.52
DNANet [20]	4.70	57.04	131.71	65.73	66.43	79.21	89.56	12.34
UIUNet [21]	50.54	217.70	33.87	65.69	65.94	79.31	91.25	13.47
RDIAN [85]	0.22	14.87	17.88	59.93	61.05	74.96	87.21	33.35
ISTDU [82]	2.75	31.78	22.19	65.01	64.81	78.78	93.94	26.44
SCTransNet [62]	11.19	40.46	60.36	66.58	66.51	80.38	93.27	22.17
DATransNet [72]	2.18	32.73	17.97	65.69	64.74	79.31	90.91	17.52
GSFANet [84]	6.12	20.46	123.03	65.23	65.77	79.00	89.56	15.01
FM-Net [55]	3.61	14.17	54.56	66.99	68.39	80.25	93.60	15.54
FSGPNet	4.75	50.02	49.74	68.14	69.08	81.07	91.58	9.68

Table 2. Quantitative comparison with other state-of-the-art methods on the NUDT-SIRST dataset (best results in bold, second-best results underlined).

Methods	Params (M)	FLOPs (G)	Lat. (ms)	IoU (%)	nIoU (%)	F1 (%)	Pd (%)	Fa ( $10^{- 6}$ )
ACM [19]	0.40	0.40	2.94	64.86	66.91	78.66	96.83	29.14
ALCNet [83]	0.43	0.38	2.75	61.13	60.61	38.99	97.25	29.07
AGPCNet [86]	0.22	3.72	4.10	88.79	88.48	93.88	97.33	8.14
DNANet [20]	4.70	14.26	26.50	93.08	94.39	96.41	99.15	9.08
UIUNet [21]	50.54	54.43	25.30	90.51	90.58	95.02	98.84	8.34
RDIAN [85]	0.22	3.72	4.41	82.42	83.70	90.36	98.84	14.85
ISTDU [82]	2.75	7.94	18.74	91.76	92.22	95.70	98.52	3.77
SCTransNet [62]	11.19	10.12	32.50	94.28	94.54	97.06	99.37	3.49
DATransNet [72]	2.18	8.18	13.43	93.86	93.68	96.82	98.41	4.78
GSFANet [84]	2.97	5.25	35.20	93.42	93.75	96.60	99.26	7.86
FM-Net [55]	3.61	3.54	26.71	94.56	94.34	96.30	98.62	4.76
FSGPNet	4.75	12.51	25.46	94.93	94.96	97.39	98.84	2.96

Figure 6 shows the Receiver Operating Characteristic (ROC) curves of various advanced algorithms on the NUDT-SIRST dataset. Evidently, the ROC curve of FSGPNet surpasses those of all other algorithms. To further verify its robustness and stability, we also plotted and compared ROC curves on the IRSTD-1K dataset, as illustrated in Figure 7. By selecting an appropriate threshold interval, FSGPNet achieves the highest accuracy while maintaining the lowest false alarm rate.

4.3. Qualitative Evaluation

Figure 8 presents qualitative results from eight representative algorithms on the NUDT-SIRST and IRSTD-1K datasets. Among them, algorithms such as UIU-Net [21] and AGCPNet [86] often generate several false alarms and missed detections. Furthermore, even when targets are detected, their contours are typically unclear, which hinders accurate further identification of the target type. Among the learning-based algorithms, FSGPNet achieves precise target detection and effective contour segmentation. In the last row of Figure 8, only our method accurately separates the shape of the aircraft from the similar cloud background. This can be attributed to the fact that our method not only learns the characteristics of the target but also adapts to different target-scale variations, thereby accurately capturing the edge continuity of the target. In the second-to-last row of Figure 8, apart from our method and ACM [19] network, all other methods failed to detect the small target obscured behind tree branches, while ACM also generated false alarms on the branch areas. This can be attributed to their limitations in constructing only local contrast information and their lack of capability to establish long-range dependencies within the image.

4.4. Ablation Study

To validate the efficacy of each proposed component, we conducted comprehensive ablation experiments on NUDT-SIRST and IRSTD-1K. As shown in Table 3 and Table 4, the baseline model employs a U-Net architecture constructed solely from residual convolutional modules.

First, FSEM is introduced, which specifically targets the optimization of local feature representations. As a result, the probability of detection (Pd) significantly improved from 84.85% to 90.57% on IRSTD-1K, demonstrating that strengthening local details effectively boosts the model’s sensitivity to dim small targets. Subsequently, MSGP was incorporated to complement the local operations. By employing a multi-scale global perception mechanism, MSGP enables the model to capture the spatial distribution characteristics of targets more comprehensively. This is evidenced by the steady growth in segmentation metrics, with IoU and nIoU increasing 1.33% and 0.97% on NUDT-SIRST, respectively. Finally, GTAM was integrated into the architecture. While GTAM shares the goal of refining local features to ensure precision, its addition proved critical for suppressing background noise. The final FSGPNet achieves state-of-the-art performance: it yields the highest IoU of 68.14% and Pd of 91.58% while drastically reducing the false alarm rate to

9.68 \times 10^{- 6}

on IRSTD-1K. This confirms that the synergistic combination of local feature optimization (FSEM, GTAM) and global context perception (MSGP) allows the model to achieve robust detection with minimal false alarms.

Ablation Study of FSEM: To dissect the internal mechanisms of the frequency–spatial feature enhancement module (FSEM), we conducted an ablation study on its core components: Pinwheel Convolution (Pconv), Dynamic High Frequency Perception (DHFP), and Perona–Malik Diffusion (PMD). As shown in Table 5, while Pconv alone improves feature extraction by aligning with the Gaussian characteristics of targets (yielding an IoU of 94.70%), it results in a high false alarm rate of

7.63 \times 10^{- 6}

. The subsequent addition of DHFP mitigates this by enhancing high-frequency details. And, PMD utilizes adaptive diffusion to suppress noise while preserving essential edge details. Notably, the complete integration of all three components achieves the optimal performance, with IoU reaching 94.93% and Fa dropping significantly to

2.96 \times 10^{- 6}

.

Ablation Study of MSGP: To investigate the internal mechanisms of the Multi-Scale Global Perception (MSGP) module, we conducted an ablation study on its components: Non-Local Attention (NLA), Multi-Scale Dilated Convolution (MSDC), and Squeeze-and-Excitation (SE). The results in Table 6 show that while individual components offer limited gains, their integration yields optimal performance. Mechanistically, NLA enhances global modeling by capturing long-range dependencies, while MSDC enriches multi-scale feature extraction through varied dilation rates. The SE mechanism further refines the expression of critical channel features. The synergistic interaction of these components significantly boosts the model’s segmentation accuracy, enabling FSGPNet to achieve its highest IoU of 94.93%.

Ablation Study of GTAM: Table 7 presents the ablation results of the Gabor Transformer Attention Module (GTAM). GTAM performs joint spatial–frequency selection via self-attention over multi-directional Gabor responses, where the number of directions N controls directional frequency selectivity and the filter scale k determines scale adaptability. Starting from the baseline, introducing directional Gabor filtering (N = 4, k = [5]) significantly improves IoU from 92.51% to 93.67% while sharply reducing Fa from

5.79 \times 10^{- 6}

to

2.55 \times 10^{- 6}

. Moreover, adopting a multi-scale design (k = [5, 7]) yields additional gains, achieving an IoU of 94.52% and the lowest Fa of

2.16 \times 10^{- 6}

, indicating increased robustness to variation in the target scale. Increasing the number of directions to N = 8 further strengthens the directional sensitivity, and the combined multi-directional and multi-scale configuration (N = 8, k = [5, 7]) achieves the best overall performance, with the highest IoU (94.93%) and nIoU (94.96%). These results demonstrate that the joint modeling of directional and scale-aware frequency information enables GTAM to achieve a favorable trade-off between detection accuracy and false alarms.

4.5. Robustness

In practical infrared imaging systems, captured images are inevitably degraded by sensor thermal noise and atmospheric scattering, which often overwhelm the weak signatures of small targets. To comprehensively evaluate the robustness of the proposed FSGPNet, we conduct simulation experiments by introducing two typical types of noise into the IRSTD-1K test set: additive Gaussian noise to simulate thermal noise and multiplicative speckle noise to simulate electronic noise. Both noise types are configured with a zero mean and a variance of 0.001 [84].

To establish a comprehensive and convincing benchmark, we carefully select three state-of-the-art methods for comparison: DNANet [20], representing the typical paradigm of U-shape cascaded channel and spatial attention networks; SCTransNet [62], serving as a representative of Transformer-like architectures; and GSFANet [84], which represents networks based on spatial–frequency domain fusion. Quantitative results are summarized in Table 8 using nIoU, F1-score, Pd and Fa.

As shown in Table 8, extreme noise interference inevitably leads to a performance drop across all methods; however, FSGPNet demonstrates exceptional resilience. Under Gaussian noise, FSGPNet achieves the highest nIoU (62.00%) and F1-score (74.46%), maintaining a solid balance between target detection and false alarm suppression compared to models like SCTransNet. Furthermore, under speckle noise, our method clearly dominates, yielding the best performance across all four metrics, including an impressive nIoU of 67.41%. This superior performance underscores the significant advantages of FSGPNet in handling complex interference.

To further unveil the internal mechanism of the proposed method, Figure 9 presents a visualization of the decoder feature maps. During the initial decoding stage, the model is inevitably disturbed by various noise patterns, leading to inconsistent feature map outputs. Nevertheless, as the decoding process progresses, FSGPNet continuously disentangles the target from background interference and noise while maintaining consistent background modeling. Consequently, the outputs of the subsequent decoder layers gradually align with the actual target, effectively eliminating false detections.

This remarkable stability can be attributed to the synergistic effect of our frequency–spatial joint guidance. Specifically, the dynamic frequency masking in the HFP module effectively truncates high-frequency broadband noise, while the Gabor-guided structural priors prevent the model from overfitting to isolated noise pixels, ensuring stable target decoupling even in heavily corrupted environments. In conclusion, quantitative and qualitative analyses demonstrate that our proposed FSGPNet possesses outstanding anti-noise capabilities and robust generalization.

5. Discussion

In this section, we provide a deeper theoretical analysis of the proposed FSGPNet, focusing on the synergy between physics-informed modeling, spectral analysis, and directional perception. We also contrast our approach with existing dual-domain IRSTD methods to highlight our architectural advantages.

5.1. The Rationale for Frequency–Spatial Joint Modeling

In the IRSTD task, the spatial domain provides crucial local intensity and geometric information. However, complex backgrounds such as cloud edges or man-made structures often exhibit high local contrast, which easily leads to high false-alarm rates for purely spatial-domain methods. By introducing the frequency domain, we can treat the infrared image as a distribution of different spectral components. Since small targets typically occupy the high-frequency spectrum while continuous backgrounds reside in the low-frequency region, joint modeling allows the network to:

Decouple targets from clutter: Frequency saliency acts as a “global filter” to highlight potential target areas before spatial refinement.
Enhance edge integrity: While spatial smoothing might blur tiny targets, frequency-domain enhancement helps maintain the high-frequency “sharpness” essential for pinpointing dim targets.

5.2. Comparative Analysis with SOTA Dual-Domain Methods

Current frequency–spatial methods in IRSTD like AFSENet [54], GSFANet [84], SFDTNet [68], and FM-Net [55] have pioneered the use of frequency information. These methods can be categorized into three paradigms:

Parallel-Branch Enhancement (AFSENet [54], FM-Net [55]): These methods often use frequency transforms (such as DCT or Wavelet) as auxiliary branches to supplement spatial features. However, they often treat the two domains as independent entities, leading to a “loose coupling” that may fail to suppress non-stationary noise in extreme low-SNR scenarios.
Frequency-Guided Downsampling (GSFANet [84]): These methods utilize frequency components to replace traditional pooling, aiming to preserve small target details. Although effective, they lack the ability to adaptively model the directional features of targets.
Transform-Domain Filtering (SFDTNet [68]): These methods perform global filtering in the frequency domain. A common drawback is that aggressive high-frequency suppression to remove noise may inadvertently blur the sharp edges of tiny targets.

Compared with the above methods, FSGPNet introduces a more “deeply coupled” and “physically constrained” architecture:

Physics-informed Frequency Learning: Unlike AFSENet [54] or FM-Net [55], which are purely data-driven, our FSEM module integrates the Perona–Malik Diffusion (PMD) equation. By combining PMD’s edge-preserving smoothing with DCT’s frequency saliency, FSGPNet achieves precise and clean feature extraction-enhancing targets while suppressing background clutter based on physical continuity.
Directional Perception: While SFDTNet [68] uses isotropic frequency masks and GSFANet uses wavelets, our GTAM module leverages the directional selectivity of Gabor filters integrated with Transformer attention. This mimics the human visual system’s ability to distinguish between target-like textures and complex background edges [87].
Hierarchical Joint Guidance Strategy: Instead of fusing domains only at the bottleneck or the final layer, our frequency and spatial information interact at every stage of the U-shaped encoder–decoder. This ensures that the global frequency-domain saliency and the local spatial-domain geometry are mutually refined.

The core philosophy of FSGPNet is to move beyond purely data-driven black-box learning by introducing explicit mathematical and signal-processing constraints. By combining the physical constraints of PMD, the global frequency awareness of DCT, and the directional perception of Gabor filters, our method achieves a superior balance between detection sensitivity and false-alarm suppression.

6. Conclusions

In this paper, we developed a frequency–spatial domain jointly guided perceptual network (FSGPNet) to overcome the limitations inherent in infrared remote sensing imagery. A key contribution is the dual-domain strategy that leverages FSEM, MSGP and GTAM to enhance target perceptibility and minimize false alarms by learning discriminative features across domains. Extensive experiments conducted on public benchmark datasets demonstrate that the proposed method achieves significant improvements over state-of-the-art methods in both accuracy and robustness. The proposed method also provides a general framework for integrating information from the frequency and spatial domains, which can be extended to other challenging detection tasks in remote sensing and computer vision. Future work will explore the application of this framework to multi-modal data and real-time detection scenarios.

Author Contributions

Conceptualization, Y.H. and M.Y.; methodology, Y.H. and M.Y.; software, Y.H., M.Y. and B.L.; validation, B.L., J.L. and C.J.; formal analysis, Y.H. and B.L.; investigation, Y.H. and M.Y.; resources, W.C. and T.Z.; data curation, J.L. and C.J.; writing—original draft preparation, Y.H. and M.Y.; writing—review and editing, J.L., W.C. and T.Z.; visualization, Y.H.; supervision, T.Z.; project administration, T.Z.; funding acquisition, W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Program of the Chinese Academy of Sciences (KGFZD-145-23-05-03). And the APC was funded by Innovation Project Fund of SITP.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Teutsch, M.; Krüger, W. Classification of small boats in infrared images for maritime surveillance. In Proceedings of the 2010 International Waterside Security Conference; IEEE: Piscataway, NJ, USA, 2010; pp. 1–7. [Google Scholar]
Wei, Z.; Cong, M.; Wang, L. Algorithms for optical weak small targets detection and tracking: Review. In Proceedings of the International Conference on Neural Networks and Signal Processing, Nanjing, China, 14–17 December 2003. [Google Scholar]
Zhao, M.; Li, W.; Li, L.; Ma, P.; Cai, Z.; Tao, R. Three-order tensor creation and tucker decomposition for infrared small-target detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Zhu, H.; Liu, S.; Deng, L.; Li, Y.; Xiao, F. Infrared small target detection via low-rank tensor completion with top-hat regularization. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1004–1016. [Google Scholar] [CrossRef]
Qin, Y.; Bruzzone, L.; Gao, C.; Li, B. Infrared small target detection based on facet kernel and random walker. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7104–7118. [Google Scholar] [CrossRef]
Bai, X.; Bi, Y. Derivative entropy-based contrast measure for infrared small-target detection. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2452–2466. [Google Scholar] [CrossRef]
Sun, Y.; Yang, J.; An, W. Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3737–3752. [Google Scholar] [CrossRef]
Marvasti, F.S.; Mosavi, M.R.; Nasiri, M. Flying small target detection in IR images based on adaptive toggle operator. IET Comput. Vis. 2018, 12, 527–534. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Han, J.; Liu, S.; Qin, G.; Zhao, Q.; Zhang, H.; Li, N. A local contrast method combined with adaptive background estimation for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1442–1446. [Google Scholar] [CrossRef]
Rivest, J.F.; Fortin, R. Detection of dim targets in digital infrared imagery by morphological image processing. Opt. Eng. 1996, 35, 1886–1893. [Google Scholar] [CrossRef]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets 1999; SPIE: Bellingham, WA, USA, 1999; Volume 3809, pp. 74–83. [Google Scholar]
Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A robust infrared small target detection algorithm based on human visual system. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 950–959. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Guan, S.; Zhou, D.; Wang, X. Infrared Small Target Detection Based on Hierarchical Terrace Contrast Measure. IEEE Access 2024, 12, 92268–92280. [Google Scholar] [CrossRef]
Zhang, K.; Yang, K.; Li, S.; Chen, H.B. A difference-based local contrast method for infrared small target detection under complex background. IEEE Access 2019, 7, 105503–105513. [Google Scholar] [CrossRef]
Yang, B.; Zhang, X.; Zhang, J.; Luo, J.; Zhou, M.; Pi, Y. EFLNet: Enhancing feature learning network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Lin, F.; Bao, K.; Li, Y.; Zeng, D.; Ge, S. Learning contrast-enhanced shape-biased representations for infrared small target detection. IEEE Trans. Image Process. 2024, 33, 3047–3058. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. Hcf-net: Hierarchical context fusion network for infrared small object detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Zhong, S.; Zhang, F.; Duan, J. Context-guided reverse attention network with multiscale aggregation for infrared small target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9725–9734. [Google Scholar] [CrossRef]
Yang, Z.; Yu, H.; Zhang, J.; Tang, Q.; Mian, A. Deep learning based infrared small object segmentation: Challenges and future directions. Inf. Fusion 2025, 118, 103007. [Google Scholar] [CrossRef]
Xu, W.; Ding, Z.; Wang, Z.; Cui, Z.; Hu, Y.; Jiang, F. Think Locally, Act Globally: A Frequency-Spatial Fusion Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4109917. [Google Scholar] [CrossRef]
Duan, C.; Hu, B.; Liu, W.; Ma, T.; Ma, Q.; Wang, H. Infrared small target detection method based on frequency domain clutter suppression and spatial feature extraction. IEEE Access 2023, 11, 85549–85560. [Google Scholar] [CrossRef]
Zhao, D.; Sun, H.; Hu, Y.; Bai, X. Frequency-Gradient Collaborative Network with Channel Correction and Background Guidance for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5006615. [Google Scholar] [CrossRef]
Zhu, Y.; Ma, Y.; Fan, F.; Huang, J.; Yao, Y.; Zhou, X.; Huang, R. Towards robust infrared small target detection via frequency and spatial feature fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 2001115. [Google Scholar] [CrossRef]
Gao, X.; Zhang, Y.; Zhang, L.; Jiang, Y.; Xi, Y.; Tan, F.; Hou, Q. Infrared small target detection algorithm based on filter kernel combination optimization learning method. Infrared Phys. Technol. 2024, 139, 105346. [Google Scholar] [CrossRef]
Xu, Y.; Shao, A.; Kong, X.; Wu, J.; Chen, Q.; Gu, G.; Wan, M. Infrared small target detection based on sub-maximum filtering and local intensity weighted gradient measure. IEEE Sens. J. 2024, 24, 22236–22248. [Google Scholar] [CrossRef]
Liu, R.; Liu, Q.; Wang, X.; Fu, Y. GLCANet: Context Attention for Infrared Small Target Detection. In Proceedings of the CAAI International Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2023; pp. 244–255. [Google Scholar]
Zhang, T.; Cao, S.; Pu, T.; Peng, Z. AGPCNet: Attention-guided pyramid context networks for infrared small target detection. arXiv 2021, arXiv:2111.03580. [Google Scholar] [CrossRef]
Liu, J.; Zhang, J.; Wei, Y.; Zhang, L. Infrared small target detection based on multidirectional gradient. IEEE Geosci. Remote Sens. Lett. 2022, 20, 1–5. [Google Scholar] [CrossRef]
Li, Y.; Li, Z.; Li, W.; Liu, Y. Infrared small target detection based on gradient-intensity joint saliency measure. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7687–7699. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Chen, T.; Tan, Z.; Chu, Q.; Wu, Y.; Liu, B.; Yu, N. Tci-former: Thermal conduction-inspired transformer for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Montréal, QC, Canada, 16–23 February 2024; Volume 38, pp. 1201–1209. [Google Scholar]
Ma, T.; Cheng, K.; Chai, T.; Wu, Y.; Zhou, H. An Wavelet Steered network for efficient infrared small target detection. Infrared Phys. Technol. 2025, 148, 105850. [Google Scholar] [CrossRef]
Ma, Q.; Deng, S.; Li, B.; Zhu, Z.; Song, Z.; Li, X.; Hu, H. DWTFreqNet: Infrared Small Target Detection via Wavelet-Driven Frequency Matching and Saliency-Difference Optimization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5007815. [Google Scholar] [CrossRef]
Zhuang, S.; Hou, Y.; Qi, M.; Wang, D. FAA-Net: A Frequency-Aware Attention Network for Single-Frame Infrared Small Target Detection. IEEE Trans. Instrum. Meas. 2025, 74, 4514216. [Google Scholar] [CrossRef]
Chen, T.; Ye, Z. FreqODEs: Frequency neural ODE networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005912. [Google Scholar] [CrossRef]
Li, Y.; Wang, L.; Chen, S. Frequency-Spatial Interaction Reinforcement Paradigm for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5008915. [Google Scholar] [CrossRef]
Li, S.; Liu, Z.; Wang, W.; Li, Q. Adaptive Frequency Separation Enhancement Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5642613. [Google Scholar] [CrossRef]
Liu, Y.; Lin, Z.; Li, B.; Liu, T.; An, W. FM-Net: Frequency-Aware Masked-Attention Network for Infrared Small Target Detection. Remote Sens. 2025, 17, 2264. [Google Scholar] [CrossRef]
Perona, P.; Malik, J. Scale-space and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 12, 629–639. [Google Scholar] [CrossRef]
Gabor, D. Theory of communication. Part 1: The analysis of information. J. Inst. Electr.-Eng.-Part III Radio Commun. Eng. 1946, 93, 429–441. [Google Scholar] [CrossRef]
McIntosh, B.; Venkataramanan, S.; Mahalanobis, A. Infrared target detection in cluttered environments by maximization of a target to clutter ratio (TCR) metric using a convolutional neural network. IEEE Trans. Aerosp. Electron. Syst. 2020, 57, 485–496. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Dosovitskiy, A. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wu, T.; Li, B.; Luo, Y.; Wang, Y.; Xiao, C.; Liu, T.; Yang, J.; An, W.; Guo, Y. MTU-Net: Multilevel TransUNet for space-based infrared tiny ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. Sctransnet: Spatial-channel cross transformer network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 10–14 November 2025; Volume 39, pp. 9202–9210. [Google Scholar]
He, H.; Wan, M.; Xu, Y.; Kong, X.; Liu, Z.; Chen, Q.; Gu, G. WTAPNet: Wavelet Transform-based Augmented Perception Network for Infrared Small Target Detection. IEEE Trans. Instrum. Meas. 2024, 73, 5037217. [Google Scholar] [CrossRef]
Chen, L.; Gu, L.; Li, L.; Yan, C.; Fu, Y. Frequency Dynamic Convolution for Dense Image Prediction. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 30178–30188. [Google Scholar]
Chen, L.; Fu, Y.; Gu, L.; Yan, C.; Harada, T.; Huang, G. Frequency-aware feature fusion for dense image prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10763–10780. [Google Scholar] [CrossRef]
Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 10–14 November 2025; Volume 39, pp. 6896–6904. [Google Scholar]
Liu, Y.; Tu, B.; Liu, B.; He, Y.; Li, J.; Plaza, A. Spatial frequency domain transformation for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5634916. [Google Scholar] [CrossRef]
Liu, Y.; Liu, Y.; Liu, K.; Huang, Y.; Tang, C.; Zhou, W.; Chen, Z.; Xiang, W.; Zhang, H. MASDG: Multiview Augmented Single-Source Domain Generalization Method for Robust Remote Sensing Building Extraction. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
Liu, K.; Yan, X.; Liu, Y.; Tang, C.; Zhan, Y.; Luo, W.; Zhou, W.; Zhang, H. PMTSeg: Prompt-Driven Multimodal Transformer for Task-Adapted Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4705615. [Google Scholar] [CrossRef]
Zhang, F.; Liu, K.; Liu, Y.; Wang, C.; Zhou, W.; Zhang, H.; Wang, L. Multitarget domain adaptation building instance extraction of remote sensing imagery with domain-common approximation learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Hu, C.; Huang, Y.; Li, K.; Zhang, L.; Long, C.; Zhu, Y.; Pu, T.; Peng, Z. Datransnet: Dynamic attention transformer network for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 7001005. [Google Scholar] [CrossRef]
Chen, T.; Tan, Z.; Gong, T.; Chu, Q.; Liu, B.; Yu, N. Feature preservation and shape cues assist infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5006412. [Google Scholar] [CrossRef]
Xiong, Z.; Zhou, F.; Wu, F.; Yuan, S.; Fu, M.; Peng, Z.; Yang, J.; Dai, Y. DRPCA-Net: Make Robust PCA Great Again for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–16. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI conference on artificial intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Rahman, M.A.; Wang, Y. Optimizing intersection-over-union in deep neural networks for image segmentation. In Proceedings of the International Symposium on Visual Computing; Springer: Berlin/Heidelberg, Germany, 2016; pp. 234–244. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared Small-Target Detection U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Deng, C.; Zhao, Z.; Xu, X.; Xia, Y.; Li, J.; Plaza, A. Gsfanet: Global spatial-frequency attention network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5007017. [Google Scholar] [CrossRef]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Daugman, J.G. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Am. A 1985, 2, 1160–1169. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of our proposed frequency–spatial domain jointly guided perceptual network (FSGPNet).

Figure 2. The proposed frequency–spatial domain feature enhancement module (FSEM).

Figure 3. The proposed Multi-Scale Global Perception module (MSGP).

Figure 4. The proposed Gabor Transformer Attention Module (GTAM).

Figure 5. Initial parameters and trained parameters of Gabor filters.

Figure 6. ROC curves on NUDT-SIRST.

Figure 7. ROC curves on IRSTD-1K.

Figure 8. Visual results obtained using different IRSTD methods on the NUDT-SIRST and IRSTD-1K datasets. Circles in red, green, and yellow represent correctly detected targets, miss detection, and false alarms, respectively. (a) Input. (b) ACM. (c) ISTDU-Net. (d) DNANet. (e) UIUNet. (f) AGPCNet. (g) FSGPNet(Ours). (h) Ground truth.

Figure 9. Visualization of the deep feature maps and output results from the decoder. (a,b) Image with additive Gaussian noise; (c,d) Image with multiplicative speckle noise.

Table 3. Ablation studies of main model components on the NUDT-SIRST dataset (best results highlighted in bold).

FSEM	MSGP	GTAM	Params (M)	FLOPs (G)	Lat. (ms)	IoU (%)	nIoU (%)	F1 (%)	Pd (%)	Fa ( $10^{- 6}$ )
			0.87	2.09	2.18	90.94	90.01	95.23	96.40	4.27
✓			1.71	3.74	19.41	91.42	91.97	95.85	98.52	5.77
✓	✓		1.92	3.78	23.26	92.75	92.94	96.23	99.37	4.48
✓	✓	✓	4.75	12.51	26.93	94.93	94.96	97.39	98.84	2.96

Table 4. Ablation studies of main model components on the IRSTD-1K dataset (best results highlighted in bold).

FSEM	MSGP	GTAM	Params (M)	FLOPs (G)	Lat. (ms)	IoU (%)	nIoU (%)	F1 (%)	Pd (%)	Fa ( $10^{- 6}$ )
			0.87	8.35	4.40	64.02	63.10	77.64	84.85	26.10
✓			1.71	14.93	30.12	66.41	67.28	79.80	90.57	18.54
✓	✓		1.92	15.09	33.84	67.07	68.61	79.97	91.25	15.05
✓	✓	✓	4.75	50.02	50.32	68.14	69.08	81.07	91.58	9.68

Table 5. Quantitative results of FSEM on the NUDT-SIRST dataset (best results highlighted in bold).

Pconv	DHFP	PMD	IoU (%)	nIoU (%)	F1 (%)	Pd (%)	Fa ( $10^{- 6}$ )
			93.80	94.00	96.80	98.94	2.25
✓			94.70	93.31	98.52	96.21	7.63
✓	✓		94.62	94.06	99.05	96.71	4.25
✓	✓	✓	94.93	94.96	97.39	98.84	2.96

Table 6. Quantitative results of MSGP on the NUDT-SIRST dataset (best results highlighted in bold).

NLA	MSDC	SE	IoU (%)	nIoU (%)	F1 (%)	Pd (%)	Fa ( $10^{- 6}$ )
			93.97	94.16	96.89	98.73	1.52
✓			94.15	94.04	98.73	98.83	2.32
✓	✓		94.56	94.38	98.84	98.99	1.88
✓	✓	✓	94.93	94.96	97.39	98.84	2.96

Table 7. Quantitative results of GTAM on the NUDT-SIRST dataset (best results highlighted in bold).

N = 4	N = 8	k = [5]	k = [5, 7]	IoU (%)	nIoU (%)	F1 (%)	Pd (%)	Fa ( $10^{- 6}$ )
				92.51	92.76	97.99	96.11	5.79
✓		✓		93.67	93.91	98.94	96.73	2.55
✓			✓	94.52	94.76	99.15	97.18	2.16
	✓	✓		93.57	94.57	99.15	96.68	3.08
	✓		✓	94.93	94.96	97.39	98.84	2.96

Table 8. Quantitative comparison of noise robustness on the IRSTD-1K dataset (best results highlighted in bold).

Method	Gaussian Noise				Speckle Noise
Method	nIoU (%)	F1 (%)	Pd (%)	Fa ( $10^{- 6}$ )	nIoU (%)	F1 (%)	Pd (%)	Fa ( $10^{- 6}$ )
DNANet [20]	59.62	74.39	84.85	30.67	63.64	76.80	86.87	20.63
SCTransNet [62]	56.51	68.05	89.56	110.80	64.89	78.31	90.24	38.30
GSFANet [84]	58.32	74.04	84.85	57.81	64.84	78.84	89.56	24.01
FSGPNet	62.00	74.46	87.54	38.32	67.41	79.63	90.91	19.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, Y.; Ye, M.; Liu, B.; Li, J.; Jia, C.; Cui, W.; Zhang, T. Frequency–Spatial Domain Jointly Guided Perceptual Network for Infrared Small Target Detection. Remote Sens. 2026, 18, 1000. https://doi.org/10.3390/rs18071000

AMA Style

Han Y, Ye M, Liu B, Li J, Jia C, Cui W, Zhang T. Frequency–Spatial Domain Jointly Guided Perceptual Network for Infrared Small Target Detection. Remote Sensing. 2026; 18(7):1000. https://doi.org/10.3390/rs18071000

Chicago/Turabian Style

Han, Yeteng, Minrui Ye, Bohan Liu, Jie Li, Chaoxian Jia, Wennan Cui, and Tao Zhang. 2026. "Frequency–Spatial Domain Jointly Guided Perceptual Network for Infrared Small Target Detection" Remote Sensing 18, no. 7: 1000. https://doi.org/10.3390/rs18071000

APA Style

Han, Y., Ye, M., Liu, B., Li, J., Jia, C., Cui, W., & Zhang, T. (2026). Frequency–Spatial Domain Jointly Guided Perceptual Network for Infrared Small Target Detection. Remote Sensing, 18(7), 1000. https://doi.org/10.3390/rs18071000

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Frequency–Spatial Domain Jointly Guided Perceptual Network for Infrared Small Target Detection

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Infrared Small Target Detection

2.2. Frequency-Domain Feature Enhancement

2.3. Advanced Paradigms for Remote Sensing Small Target Detection

3. Materials and Methods

3.1. Method Overview

3.2. Frequency–Spatial Domain Feature Enhancement Module

3.3. Multi-Scale Global Perception Module

3.4. Gabor Transformer Attention Module

3.5. Loss Function

4. Experiments

4.1. Experimental Setup

4.2. Quantitative Results

4.3. Qualitative Evaluation

4.4. Ablation Study

4.5. Robustness

5. Discussion

5.1. The Rationale for Frequency–Spatial Joint Modeling

5.2. Comparative Analysis with SOTA Dual-Domain Methods

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI