2.2.1. Overall Architecture
As illustrated in
Figure 2, the overall pipeline consists of four main stages. First, given the infrared image, multi-scale features
F are extracted by a Hiera-based image encoder. Subsequently, the features are processed through a
convolution to serve as the input
for the following stages. Second, the Contrast Query Generator (CQG) employs a filtering module to transform the multi-scale feature maps
into high-pass spatial features
, and subsequently performs saliency sampling on this map to obtain dynamic queries
. These queries then serve as the input for the two-branch Transformer decoder to localize targets and produce deep features
. Third, these features are forwarded to a multi-scale decoder, where each scale uses independent convolutions with non-shared parameters to capture scale-specific mapping patterns. After spatial resizing and channel reduction, the decoder produces a set of multi-scale predicted masks
for subsequent stages. Fourth, a cascaded sequence of Radial High-Pass Modulator (RHPM) modules is integrated across the mask generation stages to progressively suppress low-frequency clutter and produce the final high-resolution segmentation mask
.
2.2.2. Contrast Query Generator
The performance of transformer-based decoders is heavily dependent on the quality of object queries. For infrared small target detection, directly using randomly initialized queries often leads to slow convergence in the early stages, especially for weak infrared targets. To mitigate this issue, we propose a Contrast Query Generator (CQG) that converts high-frequency salient responses into location-aware query tokens, facilitating more accurate target localization.
To extract frequency domain information, the CQG first utilizes a filtering module to suppress low-frequency components from the input multi-scale features
, yielding high-pass spatial features
. These features are then compressed into a 2D energy map
via channel-wise averaging:
Consequently, high-energy regions in
indicate potential target locations.
To capture all potential targets, we then flatten
S into a vector and select the Top-
K points with the largest values as the candidate set
:
where
represents the normalized spatial coordinate (e.g., scaled to [0,1]) of the
k-th point, and
denotes the corresponding saliency intensity.
For each candidate point , its spatial coordinates are mapped via sine and cosine functions to generate a positional embedding vector , where D is the channel dimension of , thereby facilitating subsequent two-branch interaction.
Unlike traditional Top-K queries that directly use multiple tokens, we leverage the saliency values
to compute normalized aggregation weights with the Softmax function:
Subsequently, the position embeddings of all candidate points are weighted and summed to obtain the aggregated position features
:
This weighted aggregation can be interpreted as estimating the centroid of the salient high-frequency energy, allowing the query to focus on the most likely target region. Finally,
is fused with the learnable content query
via a linear projection, yielding the final query vector
:
After that,
, which encodes both task-related semantic context and a positional prior that indicates likely target locations, is fed into the Two-branch Transformer as the decoder input:
where SA
, CA
, and MLP
represent the self-attention, cross-attention, and multilayer perceptron modules, respectively.
After several iterations, the refined features
undergo spatial upsampling to reconstruct fine-grained resolution. Specifically, a hypernetwork MLP is employed to transform the updated query
into a set of channel-wise scaling weights, which serve as modulation factors. By applying these factors to
through channel-wise scaling, we effectively infuse the dense features with target-aware guidance. Finally, the modulated features pass through a projection layer for channel reduction and smoothing, yielding the deep features
:
where
denotes the hypernetwork MLP,
represents the spatial upsampling operation, and ⊗ denotes the element-wise multiplication along the channel dimension.
To ensure precise target perception, we adopt a resolution recovery strategy that captures fine-grained features by establishing skip-convs between the two stages. Specifically, the framework utilizes Up-Blocks to progressively restore the spatial resolution of , integrating feature feedback from skip-convs at each stage. This process generates a series of multi-scale predicted masks , which serve as the input for subsequent processes.
2.2.3. Radial High-Pass Modulator
Small targets in infrared images usually appear as high-frequency transients, while large-scale backgrounds, such as cloud layers and sea surfaces, are mainly concentrated in low-frequency components. To better separate weak target responses from dominant low-frequency clutter, we introduce the RHPM module, which performs adaptive radial high-pass filtering in the frequency domain. It adaptively determines the cutoff radius according to the energy distribution of the feature map, enabling scene-dependent suppression of low-frequency interference.
As shown in
Figure 3, the cascaded RHPM takes the input image
and the multi-scale masks
in
Section 2.2.2 as inputs.
Specifically, the mask from the deepest layer, denoted as
, is utilized for initial spatial filtering to highlight potential targets. The resulting enhanced features are then projected into the frequency domain through a two-dimensional Fast Fourier Transform (FFT):
where
represents the FFT operation and
represents the frequency feature map.
The process of high-pass filtering is illustrated in
Figure 4.
To evaluate the frequency distribution, we calculate the square of the frequency feature map as the power spectrum
:
where
is the amplitude value at pixel
. Finally, the total spectral energy
is obtained by aggregating the energy across all frequency components:
To achieve adaptive background suppression, we introduce stage-specific hyperparameters
; then, we optimize the cutoff radius
to satisfy the condition that the low-frequency energy
accounts for at least
of the total energy
. The optimization objective function in the first RHPM stage (corresponding to
) is defined as:
where
denotes the cumulative energy within a radius
r centered at the spectral origin
. We set the low-frequency components within the cutoff region to zero while retaining the high-frequency components. Based on this, we construct a dynamic filtering mask
to eliminate low-frequency components while preserving high-frequency details:
Considering that the amount of low-frequency background diminishes as the decoder transitions to shallower layers, we adaptively decrease the energy filtering rate
across stages with setting
for the respective levels, where the indices
represent the layers from shallow to deep. The sensitivity analysis of the hyperparameter is provided in
Section 3.3.4.
The enhanced high-frequency features are then reconstructed via the Inverse Fast Fourier Transform (iFFT):
where
denotes the iFFT. The subsequent stages follow a similar procedure to yield the final output
, the detailed workflow of which is illustrated in
Figure 3.
Finally, the spatial prediction
P is refined using the frequency domain features
generated by the third-stage RHPM to yield the final output
:
where
P is obtained by cascading the multi-scale prediction maps from the intermediate decoder stage.
2.2.4. Model Optimization
To address extreme foreground–background imbalance and weak local contrast in infrared imagery, we design a Contrast and Shape-Aware Adaptive (CSA) loss, denoted as . We adopt the Scale and Location Sensitive (SLS) loss as the primary supervision, and use the Target-Driven Adaptive (TDA) loss as auxiliary supervision to improve robustness under challenging conditions.
Primary Supervision. The SLS loss optimizes IoU while incorporating polar coordinate constraints to refine the target’s geometric center. It is defined as a sum of a scale-sensitive loss
and a location-sensitive loss
:
where
is defined as
where
I and
U denote the intersection and union areas between the prediction and the ground truth, respectively. The weighting factor
is adaptively calculated based on the total pixel deviation to smooth the loss landscape and facilitate more stable optimization. Meanwhile,
is defined as:
where
represents the radial distance consistency between the predicted and true centroid, while
measures the deviation in azimuth angle of the predicted centroid relative to the true centroid. This polar coordinate system supervision mechanism can effectively correct the deformation of the target, making the predicted mask better aligned with the ground truth in terms of shape.
Auxiliary Supervision. Inspired by the TDA loss [
41], we introduce an auxiliary loss
. Unlike conventional global image-level supervision,
operates on local patches centered around target regions, employing a patch-based supervision strategy coupled with an adaptive weighting mechanism.
To implement hard sample mining, we assign higher weights to targets with smaller scales or lower contrasts. For the
t-th connected component (target), we extract its pixel area
and local contrast
. An adaptive importance index
is defined by comparing these attributes against the mean values
and
calculated from the entire dataset:
where
denotes the sigmoid function. This formulation ensures that
increases as the target size decreases or the local contrast decreases, effectively forcing the model to focus on the most challenging samples.
Global prediction maps often suffer from severe class imbalance, where the overwhelming majority of background gradients can dilute the sparse positive gradients from extremely small targets. To mitigate this, we limit the supervision to local patches that contain the target and its nearby surroundings.
For a target with centroid
and bounding box dimensions
, the patch is defined as
where
is a random dilation factor providing surrounding context.
Based on this
, we crop the corresponding regions from the global predicted mask and the ground truth, denoted as
and
. To eliminate the influence of target size and ensure scale invariance during loss calculation, both
and
are uniformly rescaled to a fixed resolution of
. Within this normalized local region, we supervise the model using a weighted soft IoU loss
. Specifically,
incorporates a focal-like modulating factor to adaptively adjust the gradient contribution based on target difficulty, thereby prioritizing the optimization of hard-to-segment targets:
where
denotes the IoU between the rescaled patches, and
serves as an importance weight to emphasize hard targets. The final auxiliary loss is defined as the average loss across all
N targets in the image:
This local focus encourages the model to capture fine-grained morphology and edge features while reducing potential false alarms.
Contrast and Shape-Aware Adaptive Loss. Finally, the Contrast and Shape-Aware Adaptive (CSA) loss function integrates both global and local supervision mentioned above:
where
is a factor. This hybrid strategy improves localization capability while boosting target-level recall for faint targets in complex backgrounds.