3.1. Method Overview
As illustrated in
Figure 1, FSGPNet is built on the U-Net [
59] architecture and introduces three key components: Frequency–Spatial Feature Enhancement Module (FSEM), Multi-Scale Global Perception module (MSGP), and Gabor Transformer Attention Module (GTAM). Given an infrared image
I, four groups of FSEM and a max-pooling layer are used to extract high-level features
, where the channel dimensions are
and
. The deepest encoder feature is further processed by the MSGP module to obtain the global-aware representation. In the decoding stage, inputs from the skip connection are fed into the GTAM and then concatenated with the upsampled features from the previous decoder block [
72], producing the decoder feature maps
for
.
The design of FSGPNet is not a simple concatenation of independent modules; rather, FSEM, MSGP, and GTAM are deeply coupled following a “Local Enhancement → Global Contextualization → Selective Reconstruction” paradigm. In the encoder, FSEM acts as the foundational processor, explicitly extracting high-frequency details while utilizing Perona–Malik Diffusion (PMD) to physically diffuse background noise, ensuring that small target signatures are not lost during downsampling. At the structural bottleneck, MSGP leverages these purified features to establish long-range spatial dependencies and model the global background context, effectively preventing high-frequency clutter from causing false alarms. Finally, during the decoding stage, GTAM serves as the ultimate selective gate. It bridges the fine-grained local cues from FSEM and the global semantics from MSGP by utilizing Gabor-guided self-attention, adaptively selecting the most discriminative frequency–spatial structures for precise target reconstruction. Together, these three modules form a closed-loop mechanism that ensures target features are preserved, contextualized, and selectively refined.
3.2. Frequency–Spatial Domain Feature Enhancement Module
As CNN depth increases, resblocks progressively lose fine edge details of small targets, causing them to be overwhelmed by background clutter. To address this issue, we propose the Frequency–Spatial Domain Feature Enhancement Module (FSEM), a multi-branch architecture designed to enhance small-target feature and suppress background noise, as shown in
Figure 2.
In FSEM, the convolution operation of the residual convolutional block is replaced with PConv [
63], whose radially decaying receptive field conforms to the Gaussian spatial distribution of infrared small targets, enabling focused target activation.
To further mitigate the effects of low contrast, noise and target-background similarity in infrared images, we introduce a Perona–Malik Diffusion (PMD) branch, which employs a spatially adaptive diffusion coefficient
to regulate smoothing strength across regions, enabling effective noise suppression while preserving sharp boundaries and the geometric integrity of small targets.
where
denotes the input image, and
is the spatial gradient.
is the diffusion coefficient, which is used to control the degree of diffusion.
K denotes the gradient threshold. The discrete form of the PMD equation is provided in [
73], upon which we incorporate an additional regularization term based on the magnitude of the gradient:
where
and
denote the first-order partial derivatives along the X and Y directions, respectively.
and
denote second-order partial derivatives, and
is the mixed partial derivative.
is a learnable parameter that balances the raw diffusion and the regularization of the magnitude of the gradient.
is a small constant used to prevent division by zero in the denominator, which is strictly set to
in our implementation.
The discrete PMD formulation reflects local shape variations through iso-intensity contour curvature, yielding strong responses at point-like targets or sharp edges while remaining negligible in smooth backgrounds. However, its reliance on second-order derivatives makes it sensitive to high-frequency noise, which can obscure weak targets or induce false responses. To alleviate this issue, we introduce gradient-magnitude regularization, using stable first-order contrast cues to constrain the curvature term, thereby preserving discriminative structures while improving robustness in low-contrast and noisy infrared scenes.
In infrared images, small targets mainly appear as localized high-frequency components, whereas background clutter is dominated by low-frequency structures. Conventional feature extraction methods have fixed frequency responses and are therefore easily interfered with by low-frequency background information. To overcome this limitation, we design a Dynamic High-Frequency Perception module (Dynamic HFP).
Given an input feature map
, we initiate the process by mapping it to the frequency domain using the two-dimensional Discrete Cosine Transform (2D-DCT):
In the DCT spectral domain, the zero-frequency and low-frequency components, which correspond to the slowly varying background, are naturally concentrated at the top-left corner. To suppress this interference, we define a binary high-pass mask
to erase the low-frequency background clutter.
where
and
denote the frequency ratio control parameters. Experimental results on the validation set showed that performance remained stable for
, and
was selected as the default for its consistent robustness.
Multiplied by
W to erase the low-frequency energy, the high-frequency response map is reconstructed via the Inverse Discrete Cosine Transform (2D-IDCT):
Subsequently, the high-frequency response map is fed into two parallel paths: the Channel-wise Interaction and the Spatial-wise Dynamic Interaction.
Channel-wise Interaction: First, global statistical features are extracted by applying both max pooling and average pooling to the high-frequency response map
:
Subsequently,
and
are fused and transformed through a non-linear mapping to generate the channel attention weight
:
where
and
denote 1 × 1 and 2 × 1 convolutions used for cross-channel modeling. We utilize the
activation to introduce non-linearity and the Sigmoid function
to normalize the attention weights into the range
. The resulting
rescales the input features to emphasize channels containing significant high-frequency target signatures.
Spatial-wise Dynamic Interaction: Parallel to the channel path, the spatial-wise interaction pathway focuses on localizing high-frequency saliency. To account for the diverse morphologies of small targets across different samples, we employ a sample-adaptive dynamic convolution strategy [
74].
First, we compute the mean along the channel dimension to obtain the spatial attention feature:
where
explicitly highlights the high-frequency structural residuals while suppressing low-frequency background trends.
To achieve sample-specific spatial enhancement, we generate a dynamic convolution kernel
based on the original input feature
X. As illustrated in the implementation details, a kernel generator branch produces
through global pooling and non-linear projection:
where
denotes global average pooling, and the output is reshaped from
to
as the dynamic convolution kernel. And, B represents the current number of batches. The dynamic kernel
facilitates sample-wise convolution over the high-frequency response map
, yielding sample-dependent spatial weight
. This allows the model to adaptively “search” for high-frequency targets based on the specific context of each individual image, which is more effective than standard isotropic convolutions.
The final output is formulated as:
where ⊙ denotes element-wise multiplication.
In summary, Dynamic HFP performs explicit high-frequency enhancement at the frequency domain level. Furthermore, it leverages a dynamic convolution kernel to achieve sample-dependent feature enhancement in the spatial dimension. It effectively amplifies the high-frequency response intensity and the adaptive capability towards targets required for robust infrared small target detection.
3.3. Multi-Scale Global Perception Module
Accurate background modeling is essential for infrared small-target detection, yet conventional CNNs are limited by local receptive fields and weak global context modeling, leading to clutter interference and false alarms. To address this issue, we propose a Multi-Scale Global Perception module (MSGP), as shown in
Figure 3. It combines non-local attention [
24] with multi-scale dilated convolutions [
28,
75], enabling long-range dependency capture and multi-scale context aggregation. Given an input feature map
, MSGP produces two complementary feature branches:
where
denotes the branch of the non-local attention feature,
denotes the branch of the multi-scale dilated convolution feature, and
represents the squeeze-and-excitation operation.
In the non-local attention branch, we employ Non-local Attention (NLA) [
24] to compute spatial attention weights and model long-range dependencies. This mechanism captures global contextual relations between targets and background, which are crucial in infrared scenes where local cues are weak. To further strengthen pixel-level discrimination, the output of the NLA module is fed into Pixel Attention (PA) [
76]. PA aggregates channel responses at each spatial position, enriching fine-grained representations and enhancing sensitivity to small target patterns.
Here,
denotes the non-local attention operator and
denotes pixel attention.
The second branch adopts a multi-scale dilated convolution scheme with dilation rates
[
77], effectively enlarging the receptive field without increasing parameter complexity. This design enables multi-scale context modeling and enhances robustness to target size variations while capturing broader structural dependencies to suppress false alarms.
Here,
represents dilated convolution with dilation rate
d.
Finally, a Squeeze-and-Excitation (SE) module [
22] is used for channel-wise recalibration, adaptively focusing on informative features and further improving detection accuracy in complex infrared scenes.
3.4. Gabor Transformer Attention Module
Conventional methods are ineffective in modeling the frequency characteristics and spatial characteristics (including orientation, size, and location) of infrared small targets. Therefore, we propose a Gabor Transformer Attention Module (GTAM), which achieves selective frequency–spatial modeling by integrating Gabor-based feature selection with a self-attention mechanism, as shown in
Figure 4.
Specifically, Gabor filters provide orientation-selective and band-pass frequency responses to isolate discriminative target structures, while self-attention facilitates global context information aggregation. Their joint design enables precise target–background separation and robust localization, as illustrated in
Figure 5.
The Gabor kernel is formulated as follows:
where
denotes the wavelength controlling the feature scale,
is the standard deviation of the Gaussian window to regulate spatial spread, and
is the aspect ratio to adjust the directional sensitivity.
,
and
are all learnable parameters,
represents the orientation angle, and the rotated coordinates
,
are defined as:
Infrared small targets typically exhibit abrupt intensity transitions at their boundaries, which correspond to localized high-frequency structures in the spatial domain. To emphasize these discriminative features, we introduce a phase-aware response modulation mechanism within the Gabor filtering framework. In a Gabor function, the phase of the sinusoidal carrier along the dominant orientation
is defined as:
where
denotes the wavelength that determines the center spatial frequency of the Gabor filter. The gradient magnitude of the phase is given by
which reflects the intrinsic spatial frequency of the sinusoidal carrier. A smaller wavelength
corresponds to a higher-frequency filter that is more sensitive to rapid spatial variations and fine structural details. Motivated by this property, we design a phase-enhanced response formulation to adaptively emphasize high-frequency Gabor responses. The enhanced output is defined as:
where
is a learnable phase-enhancement weight constrained to positive through the Softplus function. This formulation effectively amplifies responses associated with higher-frequency filters, thereby improving the sensitivity of the detector to sharp edge transitions and subtle target structures commonly observed in infrared small targets. As illustrated in
Figure 5, trained Gabor kernels adapt to different target characteristics in both frequency and orientation dimensions.
Given an input feature map
, GTAM performs multi-directional convolutions using Gabor kernels at scale
k, with orientations
where
. The results are then reshaped to form the Gabor feature matrix
:
The Query, Key, and Value vectors are constructed as follows:
where
,
and
denote learnable
convolutions,
. The features are normalized after convolution, and the self-attention weight
is computed as follows:
The scale-specific output
is defined as follows:
Finally, the results in two scales are concatenated and fused using a 1×1 convolution:
Overall, the GTAM constructs joint orientation–frequency-scale features through multi-scale Gabor convolutions and then employs a Transformer-style multi-head attention mechanism to adaptively adjust the weights across scales. This enables the model to automatically focus on the most discriminative locations, orientations, and frequencies, thereby enhancing robustness and background suppression capability.