Next Article in Journal
Characterizing the Three-Dimensional Urban Morphology and Vertical Growth Trajectory of Major Chinese Megacities over the Past Three Decades
Previous Article in Journal
Model of Randomly Oriented Spheroids for the Retrieval of Non-Spherical Particle Microphysical Parameters from 3β + 2α + 3δ Lidar Measurements, Part 2: ATLAS (Version 2.0) Retrieval Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

IR-SAM2: Target Enhancement with SAM2 for Infrared Small Target Detection

1
Quan Cheng Laboratory, Jinan 250103, China
2
Shandong Sacred Sun Power Sources Co., Ltd., Qufu 273100, China
3
State Key Laboratory of Infrared Physics, Shanghai Institute of Technical Physics of the Chinese Academy of Sciences, Hongkou, Shanghai 200083, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(12), 1891; https://doi.org/10.3390/rs18121891 (registering DOI)
Submission received: 15 April 2026 / Revised: 25 May 2026 / Accepted: 26 May 2026 / Published: 8 June 2026
(This article belongs to the Section AI Remote Sensing)

Highlights

What are the main findings?
  • IR-SAM2 introduces a novel target enhancement framework for infrared small target detection (IRSTD) by integrating a Contrast Query Generator (CQG) and a Radial High-Pass Modulator (RHPM), effectively incorporating frequency domain context to suppress low-frequency background clutter and enhance small target saliency.
  • IR-SAM2 achieves state-of-the-art performance on IRSTD-1k (IoU: 69.75%, P d : 96.60%) and NUDT-SIRST (IoU: 94.18%, P d : 98.94%), while striking an optimal balance between P d and F a on NUAA-SIRST, demonstrating superior robustness in complex backgrounds.
What are the implications of the main findings?
  • This work validates the feasibility of adapting vision foundation models (SAM2) to specialized infrared tasks by bridging the domain gap through spatio-frequency learning, offering a new paradigm for high-precision segmentation of extremely small targets with sparse visual cues.
  • The proposed target-centric CSA loss and frequency driven optimization provide a robust and transferable solution for real-world infrared monitoring systems, such as maritime surveillance and airspace early warning, where distinguishing dim targets from heavy low-frequency interference is a critical requirement.

Abstract

Foundation models such as the Segment Anything Model (SAM) have substantially advanced promptable object segmentation in remote sensing. However, extending these capabilities to infrared small target detection (IRSTD) remains highly challenging in the presence of severe background clutter and extremely low target visibility. In this paper, we propose IR-SAM2, an effective target enhancement framework for mask-level infrared small target segmentation in the IRSTD setting. Specifically, IR-SAM2 equips the SAM2 decoder with a dedicated frequency branch, facilitating simultaneous spatio-frequency learning and deep spatio-frequency fusion, while preserving SAM2’s pre-trained knowledge. Moreover, we introduce a target-centric loss to better guide the model in distinguishing small targets from complex backgrounds. Extensive experiments show that IR-SAM2 achieves highly competitive performance on the IRSTD-1k and NUDT-SIRST benchmarks, while striking an optimal balance between detection probability and false alarm rate on NUAA-SIRST. The results further demonstrate the effectiveness of spatio-frequency cues for complex-scene infrared small target segmentation. The source codes have been made publicly available to support reproducibility.

1. Introduction

As an important technology, infrared target detection (IRTD) [1] has been widely used in military and civilian fields, such as monitoring sea areas [2,3], infrared ship detection [4,5,6], and implementing early warning in airspace [7,8]. In this paper, we focus on mask-level infrared small target segmentation, often referred to as infrared small target detection (IRSTD). As illustrated in Figure 1, infrared small targets typically occupy only a few to several dozen pixels, which poses significant challenges for precise target-background separation. Furthermore, dominant low-frequency background components markedly diminish target contrast and visibility, rendering the decoder highly susceptible to background-induced activations and false positives in heavily cluttered environments. These impediments often lead to elevated false alarm rates or missed detections.
Meanwhile, vision foundation models have promoted artificial intelligence by leveraging massive datasets and Transformer-based architecture (e.g., LLMs) to achieve remarkable performance across various tasks. In computer vision, the Segment Anything Model [10] has emerged as a prominent foundation model, especially SAM2 [11], which excels in segmentation with its fine-grained contour capture and impressive zero-shot capabilities, requiring only sparse prompts like points or bounding boxes. This makes SAM2 suitable for downstream segmentation tasks with precise boundary detection, better generalization, and reduced annotation dependency.
In the context of IRSTD, due to the limited training samples, several previous works have explored different strategies for adapting SAM on IRSTD benchmarks. Typically, IRSAM [12] incorporates a Perona–Malik diffusion (PMD)-based block into SAM to better capture essential structural features while suppressing noise. Subsequently, SAM-SPL [13] introduced a unified self-prompt learning (SPL) framework to adapt SAM visual priors to IRSTD. However, due to the huge domain gap between the infrared and natural images, the adaptation of the fundamental model for IRSTD is constrained.
We observe that existing SAM-based adaptations primarily focus on spatial domain prompt engineering or structural tuning. Although these strategies are effective for natural images with rich textures and clear boundaries, they may still be insufficient to address the unique challenges of IRSTD. Infrared small targets typically occupy only a few to several dozen pixels and lack distinct structural semantics, manifesting merely as weak thermal radiation submerged in severe background clutter. Consequently, SAM’s native spatial attention mechanisms are easily overwhelmed by the dominant low-frequency energy of the background, leading to false activations and poor localization of dim targets. To bridge this fundamental domain gap, we argue that the target enhancement must go beyond the spatial domain and leverage the inherent physical characteristics of infrared imaging. From a frequency domain perspective, the distinction between targets and complex backgrounds becomes highly discriminative: large-scale background clutter is predominantly concentrated in low-frequency components, whereas small targets and local details inherently appear as high-frequency transients. By decoupling the feature representation into the frequency domain, we can explicitly isolate and suppress the overpowering low-frequency background interference before it misguides the Transformer decoder.
Motivated by this, we propose IR-SAM2, a SAM2-based framework that injects frequency domain cues into the decoder to enhance foreground target awareness in cluttered infrared scenes. Through a learnable high-pass filter module, IR-SAM2 can suppress low-frequency components that usually correspond to background noise. Moreover, IR-SAM2 introduces a target-centric CSA loss to guide the whole model to achieve robust perception across varying sizes and shapes. In this way, IR-SAM2 captures the small targets more accurately. Equipped with both feature-level and supervision-level enhancement strategies, our IR-SAM2 demonstrates superior performance across multiple IR datasets, including IRSTD-1k [14], NUAA-SIRST [15], and NUDT-SIRST [9].
Our main contributions are as follows:
  • We augment SAM-based IRSTD frameworks with frequency domain context, which is helpful for foreground target enhancement. Specifically, we propose a filter module embedded within the critical paths of feature extraction and mask generation. By leveraging adaptive frequency domain filtering, this module suppresses low-frequency background clutter. Unlike existing frequency domain methods that rely on static thresholds or are limited to single-channel rectangular cutoffs, our module introduces multi-channel, radial high-pass filtering to accommodate isotropic target characteristics.
  • We propose a target-centric CSA loss that shifts from conventional global image-level metrics to localized supervision centered on individual targets. By adaptively modulating loss contributions based on each target’s specific scale and local contrast, this approach ensures a more balanced optimization across all instances. This mechanism significantly improves instance-level separation in IR-SAM2, even in scenarios with dense target distributions.
  • Extensive experiments across IRSTD benchmarks verify that our method generates highly discriminative features that capture instance-level semantics and morphological details by combining frequency domain refinement with localized adaptive supervision.

2. Materials and Methods

2.1. Related Work

In this section, we first survey the historical and current progress within the IRSTD field. After that, we introduce recent innovations in SAM adaptations and loss functions specific to IRSTD, which provide the theoretical foundation for our proposed methodology.

2.1.1. Infrared Small Target Detection

Existing infrared small target detection (IRSTD) methods are generally categorized into traditional model-driven approaches and data-driven deep learning approaches. Early model-driven methods primarily leveraged hand-crafted features and mathematical priors. For instance, filter-based approaches [16,17,18] apply spatial or frequency domain kernels to suppress background clutter. Meanwhile, methods inspired by the human visual system [19,20] enhance target saliency by measuring local contrast. Furthermore, optimization-based frameworks, such as low-rank and sparse decomposition [21,22,23,24,25], model an infrared image as the superposition of a low-rank background component and a sparse target component. Although traditional models offer strong interpretability and computational efficiency, their performance heavily depends on rigid prior assumptions, making them struggle to adapt to dynamic scenes.
With the advancement of deep learning, data-driven methods have become mainstream, shifting the focus toward designing sophisticated network architectures for discriminative feature extraction. For instance, ISNet [14] introduces Taylor finite differences to capture target shape features, DNANet [9] employs densely nested interaction modules for multi-level feature fusion, and ACMNet [15] proposes asymmetric contextual modulation. However, Convolutional Neural Networks often suffer from limited receptive fields and down-sampling operations, restricting the model’s ability to capture global context and low-frequency background interference. Liu et al. [26] pioneered the application of Transformers to IRSTD by modeling the long-range dependencies of convolutional features. Building upon this, Wu et al. [27] introduced a multi-scale fusion paradigm, proposing the Multi-scale Vision Transformer Model (MVTM). However, in scenarios where target sparsity is less pronounced or the background is severely disrupted by heavy clutter, these methods may exhibit elevated false alarm rates and excessive target suppression.
Recently, some methods based on deep unfolding have also emerged, such as RPCANet [28] and DRPCA-Net [29]. While DRPCA-Net overcomes the limitations of fixed parameters in conventional deep unfolding networks by introducing a hypernetwork-driven dynamic parameter generation mechanism and enhances the interpretability of infrared small target detection, its performance remains constrained by the strict consistency requirements of model-driven approaches regarding low-rank and sparsity priors. Nevertheless, these operations are primarily confined to the spatial domain and remain less sensitive to the inherent frequency characteristics of infrared images.
While some approaches like FDA-IRSTD [30], FSGPNet [31] and HLSR-Net [32] attempt to integrate frequency domain analysis by separating high and low frequencies, they often rely on fixed thresholds and therefore lack adaptability to varying low-frequency interference across scenes. To overcome the constraints of static frequency processing, HDNet [33] introduced the dynamic high-pass filtering (DHPF) module, which adaptively computes and removes a specified proportion of low-frequency energy. HDNet is limited to single-channel processing and employs a rectangular frequency cutoff, which may be suboptimal for capturing the isotropic characteristics typical of infrared small targets.

2.1.2. Segment Anything Model

The Segment Anything Model (SAM) [10] introduced a prompt-based segmentation framework that demonstrates strong zero-shot generalization under diverse prompt types. However, SAM’s practical application is often hindered by high computational overhead and limited efficiency in parallel processing. To address these efficiency bottlenecks, SAM2 [11] decouples the image encoder and mask decoder to enable batched prompt processing. Building on this foundation, several specialized variants have been developed: Dual-SAM [34] integrates dual encoders with multi-level coupled prompting to improve performance in specific scenarios, while MAS-SAM [35] employs encoder adapters and multi-scale feature extractors to strengthen representation capabilities.
Regarding IRSTD-specific tasks, IRSAM [12] attempts to redesign the SAM architecture; however, it focuses primarily on structural modifications, but its utilization of SAM’s general visual priors for IRSTD-specific contextual understanding remains limited. SAM-SPL [13] leverages pre-trained general knowledge from SAM to facilitate task-specific contextual understanding. However, SAM-based adaptations may still exhibit limited generalization and struggle to segment fine-grained morphological structures, which are critical for infrared small targets. While prompt learning offers a promising approach for efficient knowledge transfer by modulating backbone behavior via learnable vectors, its application in IRSTD remains challenging. Specifically, generating task-aware prompts that can simultaneously supplement low-level target details and coordinate with high-level semantic guidance remains an open problem in the field.

2.1.3. Loss Function in IRSTD

The design of the loss function plays a pivotal role in an IRSTD task. Conventional loss functions, such as BCE loss [36], IoU loss [37], and Dice loss [38], are susceptible to extreme class imbalance and exhibit limited sensitivity to minute targets. To mitigate scale inconsistency, the scale and location sensitive (SLS) loss [39] incorporates scale and location-sensitive terms; however, it fundamentally remains a global image-level objective. Consequently, SLS lacks the ability to explicitly guide the model to focus on the local regions surrounding each target or to capture the distinctive local contrast characteristics essential for infrared small target detection.
Recently, SD loss [40] and TDA loss [41] dynamically adjust weights for targets of different scales. The SD loss introduces a dynamic weighting coefficient that adaptively adjusts according to the target’s pixel size; meanwhile, TDA loss employs an image-patch-based mechanism and an adaptive weighting strategy that adjusts the contribution of each target. These loss-driven refinements provide a potent strategy for enhancing overall IRSTD performance without increasing the model’s computational complexity during inference.

2.2. Methodology

In this section, we first present the overall pipeline of our method in Section 2.2.1. Section 2.2.2 and Section 2.2.3 then detail the CQG and RHPM modules, respectively. Finally, Section 2.2.4 describes the model optimization.

2.2.1. Overall Architecture

As illustrated in Figure 2, the overall pipeline consists of four main stages. First, given the infrared image, multi-scale features F are extracted by a Hiera-based image encoder. Subsequently, the features are processed through a 1 × 1 convolution to serve as the input F i n for the following stages. Second, the Contrast Query Generator (CQG) employs a filtering module to transform the multi-scale feature maps F i n into high-pass spatial features F λ , and subsequently performs saliency sampling on this map to obtain dynamic queries Q f i n a l . These queries then serve as the input for the two-branch Transformer decoder to localize targets and produce deep features F d e e p . Third, these features are forwarded to a multi-scale decoder, where each scale uses independent convolutions with non-shared parameters to capture scale-specific mapping patterns. After spatial resizing and channel reduction, the decoder produces a set of multi-scale predicted masks M a s k i for subsequent stages. Fourth, a cascaded sequence of Radial High-Pass Modulator (RHPM) modules is integrated across the mask generation stages to progressively suppress low-frequency clutter and produce the final high-resolution segmentation mask P f i n a l .

2.2.2. Contrast Query Generator

The performance of transformer-based decoders is heavily dependent on the quality of object queries. For infrared small target detection, directly using randomly initialized queries often leads to slow convergence in the early stages, especially for weak infrared targets. To mitigate this issue, we propose a Contrast Query Generator (CQG) that converts high-frequency salient responses into location-aware query tokens, facilitating more accurate target localization.
To extract frequency domain information, the CQG first utilizes a filtering module to suppress low-frequency components from the input multi-scale features F i n R H × W × C , yielding high-pass spatial features F λ R H × W × C . These features are then compressed into a 2D energy map S R H × W via channel-wise averaging:
S ( x , y ) = 1 C c = 1 C F λ ( c ) ( x , y ) .
Consequently, high-energy regions in S ( x , y ) indicate potential target locations.
To capture all potential targets, we then flatten S into a vector and select the Top-K points with the largest values as the candidate set P c a n d :
P c a n d = { ( p k , s k ) } k = 1 K ,
where p k = ( x k , y k ) represents the normalized spatial coordinate (e.g., scaled to [0,1]) of the k-th point, and s k denotes the corresponding saliency intensity.
For each candidate point p k , its spatial coordinates are mapped via sine and cosine functions to generate a positional embedding vector P E ( p k ) R D , where D is the channel dimension of F i n , thereby facilitating subsequent two-branch interaction.
Unlike traditional Top-K queries that directly use multiple tokens, we leverage the saliency values s k to compute normalized aggregation weights with the Softmax function:
α k = exp ( s k ) k = 1 K exp ( s k ) .
Subsequently, the position embeddings of all candidate points are weighted and summed to obtain the aggregated position features T a g g :
T a g g = k = 1 K α k · P E ( p k ) .
This weighted aggregation can be interpreted as estimating the centroid of the salient high-frequency energy, allowing the query to focus on the most likely target region. Finally, T a g g is fused with the learnable content query Q t a s k via a linear projection, yielding the final query vector Q f i n a l :
Q f i n a l = Q t a s k + MLP ( LayerNorm ( T a g g ) ) .
After that, Q f i n a l , which encodes both task-related semantic context and a positional prior that indicates likely target locations, is fed into the Two-branch Transformer as the decoder input:
Q f i n a l = MLP ( CA ( SA ( Q f i n a l ) , F i n , F i n ) ) , F i n = CA ( F i n , Q f i n a l , Q f i n a l ) ,
where SA ( · ) , CA ( · ) , and MLP ( · ) represent the self-attention, cross-attention, and multilayer perceptron modules, respectively.
After several iterations, the refined features F i n undergo spatial upsampling to reconstruct fine-grained resolution. Specifically, a hypernetwork MLP is employed to transform the updated query Q f i n a l into a set of channel-wise scaling weights, which serve as modulation factors. By applying these factors to F i n through channel-wise scaling, we effectively infuse the dense features with target-aware guidance. Finally, the modulated features pass through a projection layer for channel reduction and smoothing, yielding the deep features F d e e p :
F d e e p = Proj MLP h y p e r ( Q f i n a l ) Up ( F i n ) R ( 2 H ) × ( 2 W ) × C 4 ,
where MLP h y p e r denotes the hypernetwork MLP, Up ( · ) represents the spatial upsampling operation, and ⊗ denotes the element-wise multiplication along the channel dimension.
To ensure precise target perception, we adopt a resolution recovery strategy that captures fine-grained features by establishing skip-convs between the two stages. Specifically, the framework utilizes Up-Blocks to progressively restore the spatial resolution of F d e e p , integrating feature feedback from skip-convs at each stage. This process generates a series of multi-scale predicted masks { M a s k 0 , M a s k 1 , M a s k 2 } , which serve as the input for subsequent processes.

2.2.3. Radial High-Pass Modulator

Small targets in infrared images usually appear as high-frequency transients, while large-scale backgrounds, such as cloud layers and sea surfaces, are mainly concentrated in low-frequency components. To better separate weak target responses from dominant low-frequency clutter, we introduce the RHPM module, which performs adaptive radial high-pass filtering in the frequency domain. It adaptively determines the cutoff radius according to the energy distribution of the feature map, enabling scene-dependent suppression of low-frequency interference.
As shown in Figure 3, the cascaded RHPM takes the input image I R I R H × W × 1 and the multi-scale masks { M a s k 0 , M a s k 1 , M a s k 2 } in Section 2.2.2 as inputs.
Specifically, the mask from the deepest layer, denoted as M a s k 2 R H × W × 1 , is utilized for initial spatial filtering to highlight potential targets. The resulting enhanced features are then projected into the frequency domain through a two-dimensional Fast Fourier Transform (FFT):
F f , 2 = F ( I R I M a s k 2 + I R I ) ,
where F ( · ) represents the FFT operation and F f , 2 R H × W × 1 represents the frequency feature map.
The process of high-pass filtering is illustrated in Figure 4.
To evaluate the frequency distribution, we calculate the square of the frequency feature map as the power spectrum E ( u , v ) :
E ( u , v ) = | F f , 2 ( u , v ) | 2 ,
where F f , 2 ( u , v ) is the amplitude value at pixel ( u , v ) . Finally, the total spectral energy E t o t a l is obtained by aggregating the energy across all frequency components:
E t o t a l = u = 0 H 1 v = 0 W 1 E ( u , v ) .
To achieve adaptive background suppression, we introduce stage-specific hyperparameters Λ = [ λ 0 , λ 1 , λ 2 ] ; then, we optimize the cutoff radius r c u t to satisfy the condition that the low-frequency energy E l o w ( r ) accounts for at least λ 2 of the total energy E t o t a l . The optimization objective function in the first RHPM stage (corresponding to λ 2 ) is defined as:
r c u t = arg min r r s . t . 1 r min ( H , W ) 2 , E l o w ( r ) E t o t a l λ 2 .
where E l o w ( r ) denotes the cumulative energy within a radius r centered at the spectral origin ( c u , c v ) . We set the low-frequency components within the cutoff region to zero while retaining the high-frequency components. Based on this, we construct a dynamic filtering mask M ( u , v ) to eliminate low-frequency components while preserving high-frequency details:
M ( u , v ) = 0 , if ( u c u ) 2 + ( v c v ) 2 < r c u t 2 1 , otherwise .
Considering that the amount of low-frequency background diminishes as the decoder transitions to shallower layers, we adaptively decrease the energy filtering rate λ i across stages with setting Λ = [ λ 0 , λ 1 , λ 2 ] = [ 0.1 , 0.2 , 0.4 ] for the respective levels, where the indices { 0 , 1 , 2 } represent the layers from shallow to deep. The sensitivity analysis of the hyperparameter is provided in Section 3.3.4.
The enhanced high-frequency features are then reconstructed via the Inverse Fast Fourier Transform (iFFT):
X λ 2 = | F 1 ( F f , 2 M ) | ,
where F 1 ( · ) denotes the iFFT. The subsequent stages follow a similar procedure to yield the final output X λ 0 , the detailed workflow of which is illustrated in Figure 3.
Finally, the spatial prediction P is refined using the frequency domain features X λ 0 generated by the third-stage RHPM to yield the final output P f i n a l :
P = Conv 1 × 1 ( Concat [ M a s k 0 , M a s k 1 , M a s k 2 ] ) , P f i n a l = P σ ( X λ 0 ) + P R H × W × 1
where P is obtained by cascading the multi-scale prediction maps from the intermediate decoder stage.

2.2.4. Model Optimization

To address extreme foreground–background imbalance and weak local contrast in infrared imagery, we design a Contrast and Shape-Aware Adaptive (CSA) loss, denoted as L C S A . We adopt the Scale and Location Sensitive (SLS) loss as the primary supervision, and use the Target-Driven Adaptive (TDA) loss as auxiliary supervision to improve robustness under challenging conditions.
Primary Supervision. The SLS loss optimizes IoU while incorporating polar coordinate constraints to refine the target’s geometric center. It is defined as a sum of a scale-sensitive loss L S and a location-sensitive loss L L :
L S L S = L S + L L ,
where L S is defined as
L S = 1 β I + ϵ U + ϵ ,
where I and U denote the intersection and union areas between the prediction and the ground truth, respectively. The weighting factor β is adaptively calculated based on the total pixel deviation to smooth the loss landscape and facilitate more stable optimization. Meanwhile, L L is defined as:
L L = i = 1 N ( 1 ρ i + θ i ) ,
where ρ i represents the radial distance consistency between the predicted and true centroid, while θ i measures the deviation in azimuth angle of the predicted centroid relative to the true centroid. This polar coordinate system supervision mechanism can effectively correct the deformation of the target, making the predicted mask better aligned with the ground truth in terms of shape.
Auxiliary Supervision. Inspired by the TDA loss [41], we introduce an auxiliary loss L a u x . Unlike conventional global image-level supervision, L a u x operates on local patches centered around target regions, employing a patch-based supervision strategy coupled with an adaptive weighting mechanism.
To implement hard sample mining, we assign higher weights to targets with smaller scales or lower contrasts. For the t-th connected component (target), we extract its pixel area s t and local contrast c t . An adaptive importance index p t is defined by comparing these attributes against the mean values s m e a n and c m e a n calculated from the entire dataset:
p t = 1 + σ s t s mean + σ c t c mean ,
where σ ( · ) denotes the sigmoid function. This formulation ensures that p t increases as the target size decreases or the local contrast decreases, effectively forcing the model to focus on the most challenging samples.
Global prediction maps often suffer from severe class imbalance, where the overwhelming majority of background gradients can dilute the sparse positive gradients from extremely small targets. To mitigate this, we limit the supervision to local patches that contain the target and its nearby surroundings.
For a target with centroid ( C x , C y ) and bounding box dimensions ( w , h ) , the patch is defined as
P a t c h b o x = [ C x w 2 d , C y h 2 d , C x + w 2 + d , C y + h 2 + d ] ,
where d [ 2 , 5 ] is a random dilation factor providing surrounding context.
Based on this P a t c h b o x , we crop the corresponding regions from the global predicted mask and the ground truth, denoted as y ^ t and y t . To eliminate the influence of target size and ensure scale invariance during loss calculation, both y ^ t and y t are uniformly rescaled to a fixed resolution of 48 × 48 . Within this normalized local region, we supervise the model using a weighted soft IoU loss L t . Specifically, L t incorporates a focal-like modulating factor to adaptively adjust the gradient contribution based on target difficulty, thereby prioritizing the optimization of hard-to-segment targets:
L t = ( 1 I t p t ) · log ( I t + ε ) ,
where I t denotes the IoU between the rescaled patches, and p t serves as an importance weight to emphasize hard targets. The final auxiliary loss is defined as the average loss across all N targets in the image:
L a u x = 1 N t = 1 N L t .
This local focus encourages the model to capture fine-grained morphology and edge features while reducing potential false alarms.
Contrast and Shape-Aware Adaptive Loss. Finally, the Contrast and Shape-Aware Adaptive (CSA) loss function integrates both global and local supervision mentioned above:
L C S A = L S L S + w a u x · L a u x ,
where w a u x is a factor. This hybrid strategy improves localization capability while boosting target-level recall for faint targets in complex backgrounds.

3. Results

3.1. Experimental Setup

Datasets. We evaluate our approach on three public IRSTD benchmarks: IRSTD-1k [14], NUAA-SIRST [15], and NUDT-SIRST [9], with sample sizes of 1001, 427, and 1327, respectively. Following prior works [42], the NUAA-SIRST and NUDT-SIRST datasets are partitioned into training and testing sets with a 1:1 ratio. For IRSTD-1k, we adopt a training-to-testing split ratio of 4:1.
Evaluation Metrics. We evaluate our model using standard metrics categorized into two dimensions: pixel-level segmentation quality, measured by intersection over union (IoU), and target-level detection performance, represented by probability of detection ( P d ) and false alarm rate ( F a ). Additionally, Receiver Operating Characteristic (ROC) curves are generated to provide a comprehensive comparison between our method and current state-of-the-art solutions.
The IoU is defined as
IoU = A i n t e r A u n i o n = i = 1 N T P [ i ] i = 1 N ( T [ i ] + P [ i ] T P [ i ] ) ,
where A i n t e r and A u n i o n denote the cumulative areas of intersection and union across N samples, respectively. The variables T [ i ] , P [ i ] , and T P [ i ] denote the numbers of ground-truth, predicted, and true positive pixels for the i-th sample, respectively.
The probability of detection is defined as
P d = N d e t e c t N t o t a l ,
where N d e t e c t is the count of accurately predicted targets and N t o t a l is the total number of all targets.
The false alarm rate is defined as
F a = P f a l s e P t o t a l ,
where P f a l s e is the number of falsely predicted target pixels and P t o t a l is the total number of pixels in the image.
The ROC curves are generated by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) across a range of decision thresholds.
Implementation Details. Following prior works [42], input images are resized to 256 × 256 . The model is optimized using the adaptive Nesterov momentum algorithm (Adan) [43] for 300 epochs, with a batch size of 12. We set the initial learning rate at 0.01 and employ a CosineAnnealingLR scheduler to facilitate a smooth decay to 1 × 10 6 . We adopt the first three stages of the SAM2 (Hiera–Tiny) image encoder as the backbone feature extractor. All experiments are implemented within the PyTorch framework and accelerated by an NVIDIA GeForce RTX 4090 GPU.

3.2. Performance Comparison

3.2.1. Compared Methods

To validate the effectiveness of the proposed method on infrared small target segmentation across different sensing platforms, we compare IR-SAM2 with a wide range of state-of-the-art IRSTD methods:
  • Model-driven methods: top-hat [44], max–median [16], IPI [21], RIPT [22], NRAM [45] and PSTNN [23].
  • Deep learning methods: ACMNet [15], ALCNet [46], ISNet [14], RDIAN [42], MDvsFA [47], DNANet [9], AGPCNet [48], UIUNet [49], MSHNet [39], SCTransNet [50], PConv [40], MMLNet [51], L2SKNet [52], SAM-SPL [13], HDNet [33] and MTUNet [27].
  • Deep unfolding-based methods: RPCANet [28] and DRPCA-Net [29].

3.2.2. Quantitative Comparison

Table 1 summarizes the quantitative comparison between our IR-SAM2 and 24 state-of-the-art (SOTA) methods across three benchmark datasets. Overall, IR-SAM2 demonstrates superior performance in terms of IoU, P d , and F a across all three datasets. Specifically, on the IRSTD-1k [14], IR-SAM2 obtains the best IoU (69.75%) and P d (96.60%), showing that the combination of frequency-aware enhancement and target-centric supervision is particularly effective in cluttered scenes. Despite not attaining the peak IoU on NUAA-SIRST [15], IR-SAM2 offers a more favorable P d / F a balance than leading methods such as UIUNet, SCTransNet, and MMLNet, highlighting its robust discriminative capability against complex clutter.
On NUDT-SIRST [9], although SAM-SPL achieves slightly better IoU and F a , IR-SAM2 still attains the highest P d (98.94%), suggesting that our method is more sensitive to target responses but may occasionally suppress extremely weak targets during frequency filtering. Nonetheless, IR-SAM2 consistently surpasses other filtering-based models, such as HDNet, on both the IRSTD-1k and NUDT-SIRST datasets. This validates the effectiveness of adapting the SAM2 [11] framework for infrared small target segmentation.
We also presented ROC curves for several advanced methods on the IRSTD-1k dataset. As illustrated in Figure 5, where a curve positioned closer to the top-left corner indicates better performance, IR-SAM2 consistently maintains the uppermost position. Notably, our method attains a higher TPR at significantly lower FPR levels, reflecting its exceptional sensitivity and ability to suppress false alarms in challenging scenarios.

3.2.3. Qualitative Comparison

Figure 6 provides a qualitative comparison between our IR-SAM2 and six representative state-of-the-art methods. We select representative samples from the IRSTD-1k [14] and NUDT-SIRST [9] datasets to visually illustrate the segmentation performance. In the visualization, ground truth targets, false alarms, and missed detections are highlighted with red, yellow, and blue boxes, respectively.
The inherent difficulty of IRSTD is clearly illustrated in the raw infrared images shown in Figure 6. As observed in the first, fifth, and sixth rows, the targets are extremely minute and often occupy only a few pixels, causing them to blend into the background and making them exceptionally difficult to distinguish from their surroundings. Furthermore, the irregular morphologies of the targets (as seen in the second, third, and fourth rows) pose a significant challenge to the multi-scale perception capabilities of existing models.
In contrast, our proposed IR-SAM2 achieves the strongest overall qualitative performance, showing more accurate localization and fewer false alarms across targets of different scales. Whether dealing with typical small targets or those with relatively larger dimensions, our method accurately localizes visually blurred targets within complex backgrounds while effectively suppressing false alarms. Although the visualizations indicate that IR-SAM2 still produces a few false positives, similar behavior can also be observed in other methods to varying degrees.
The qualitative results in Figure 6 show that competing methods often either miss dim targets or generate spurious responses in structurally complex backgrounds. This is particularly evident in cluttered scenes, where background structures exhibit intensity patterns similar to target responses. In contrast, IR-SAM2 produces more compact and accurate predictions around the target region, with fewer spurious activations in the surrounding background. These observations suggest that the proposed frequency-aware filtering helps suppress low-frequency clutter while preserving target-related high-frequency cues.

3.3. Ablation Study

In this section, we conduct comprehensive ablation studies to validate the effectiveness of our proposed modules on the three public IRSTD benchmarks (IRSTD-1k [14], NUDT-SIRST [9], and NUAA-SIRST [15]).

3.3.1. Importance of Each Component

We validate the effectiveness of our proposed components by comparing four network configurations: (i) a baseline model that retains the same SAM2 encoder–decoder backbone without RHPM and CQG; (ii) baseline + CQG, which adds the CQG module to the decoder; (iii) baseline + RHPM, which inserts the RHPM module along the mask generation path; and (iv) the full IR-SAM2 model. Table 2 reports the corresponding quantitative comparisons.
Compared with the baseline model, the integration of either the RHPM or CQG module yields consistent performance gains in most metrics. Specifically, on the IRSTD-1k dataset, the inclusion of the RHPM module increases the IoU from 65.43% to 67.20% while significantly reducing the false alarm rate from 17.15 × 10 6 to 8.50 × 10 6 . This demonstrates the efficacy of dynamic high-pass filtering in suppressing background clutter. Similarly, the CQG module independently enhances the IoU to 66.84%, validating its capability in refining target representation through contextual guidance. The full IR-SAM2 model, which combines both components, achieves the overall best performance across all three datasets. Notably, on the IRSTD-1k dataset, IR-SAM2 reaches an IoU of 69.75% and a P d of 96.60%, outperforming the baseline by 4.32% and 3.07%, respectively. Furthermore, on the NUAA-SIRST dataset, the full model achieves the lowest false alarm rate ( 4.79 × 10 6 ) among all configurations. These results indicate that RHPM and CQG are not only individually effective but also highly complementary when combined. Specifically, RHPM primarily contributes to clutter suppression, as reflected by the reduction in F a , whereas CQG mainly improves target localization by providing more informative decoder queries. Their combination leads to the best overall performance across all three datasets.

3.3.2. Different Loss Functions

To validate the effectiveness of the proposed CSA loss, we conducted comparative experiments against three commonly used loss functions in IRSTD, including BCE loss [36], IoU loss [37], TDA loss [41], and SLS loss [39]. The quantitative results are summarized in Table 3.
Overall, the proposed CSA loss performs better in the majority of evaluation metrics and datasets compared to the baseline loss functions. While the BCE loss shows a marginal advantage on the NUDT-SIRST [9] dataset for specific metrics, the CSA loss consistently maintains more robust segmentation results across the broader benchmarks.
For a further qualitative assessment, we visualized the segmentation results of our method alongside several representative baselines in Figure 7.
It can be observed that models trained with conventional loss functions generally suffer from missed targets when encountering targets with small scales and low local contrast. Although employing the TDA loss alone demonstrates some capability in capturing the features of these dim and small targets, it is highly susceptible to high-frequency noise and background clutter. This vulnerability inevitably drives up the F a . Moreover, as illustrated in Figure 7, the TDA loss exhibits a weak regression capability for target shapes and boundaries. Consequently, its fundamental pixel-level segmentation performance remains insufficient, directly leading to relatively lower IoU scores. Therefore, the TDA loss is primarily suitable only as an auxiliary loss rather than a standalone objective. In contrast, the model integrated with CSA loss successfully identifies even smaller and more blurred targets in multi-target scenarios. These observations suggest that CSA loss improves learning on hard targets by adaptively emphasizing low-contrast and small-scale instances in local regions, a capability that is difficult to achieve with purely global objectives such as BCE or IoU loss.
We also evaluated the sensitivity of the weight w a u x within the CSA loss on the IRSTD-1k [14] and NUDT-SIRST [9] datasets. By varying w a u x across values of 0.1, 0.2, 0.3, and 0.5, we analyzed the response of evaluation metrics (see Table 4).
It was observed that an excessively large weight (e.g., w a u x = 0.5 ) causes the model to over-prioritize the auxiliary task, thereby compromising the learning efficacy of the primary branch. Conversely, an overly small weight (e.g., w a u x = 0.1 ) fails to provide sufficient supervision signals to facilitate effective target refinement. The optimal balance is achieved at w a u x = 0.2 , where the model attains the highest P d while maintaining a consistently low F a across both datasets.

3.3.3. Ablation Study of the CQG Module

To investigate the impact of the number of candidate points K in the CQG on model performance, a parameter sensitivity analysis is conducted on the IRSTD-1k and NUDT-SIRST datasets. The quantitative evaluation results for different values of K ( K { 1 , 5 , 10 , 20 } ) are summarized in Table 5.
As shown in the table, when K = 1 , the model exhibits the lowest segmentation accuracy (IoUs of 67.38% and 91.95%) and the highest false alarm rates ( F a up to 14.12 × 10 6 and 8.09 × 10 6 ) on both datasets. Mechanistically, selecting only a single point with the highest energy makes the model highly susceptible to interference from isolated high-frequency noise or heavy background clutter. Furthermore, a single point cannot effectively cover the holistic features of dim and small targets that possess irregular morphologies or span multiple pixels, thereby resulting in severe deviations in position estimation.
As K increases to 10, all performance metrics of the model reach their optimum, with the IoUs improving to 69.75% and 94.18% and P d reaching 96.60% and 98.94% on IRSTD-1k and NUDT-SIRST, respectively. This validates the rationality of the CQG module design: an appropriate number of candidate points can sufficiently cover the salient regions of potential targets. By leveraging the saliency intensity s k to compute the Softmax-normalized aggregation weights α k and performing a weighted summation over the positional encodings, the model can more robustly estimate the centroid of the high-frequency energy. This enables the final query vector Q f i n a l to precisely focus on the most likely target regions, effectively enhancing the detection rate and segmentation accuracy under complex backgrounds.
However, when K is further increased to 20, although the F a decreases due to broader contextual aggregation, both the IoU and P d exhibit a slight degradation. This performance decline is attributed to the fact that an excessively large K inevitably introduces background edges surrounding the target or non-target high-frequency noise into the candidate set P c a n d . During the weighted aggregation computation, the inclusion of these redundant points dilutes the attention weights of the true targets, causing a shift in the position prior estimation and subsequently weakening the model’s capability for precise perception and localization of extremely small targets.
In summary, K = 10 achieves the optimal balance between target feature coverage and noise exclusion. Therefore, in all subsequent experiments in this paper, the default value of K is set to 10.

3.3.4. Ablation Study of the RHPM Module

To investigate the influence of energy filtering rate across layers from shallow to deep, denoted as Λ = [ λ 0 , λ 1 , λ 2 ] , we conducted parameter sensitivity experiments on the IRSTD-1k and NUDT-SIRST datasets. As summarized in Table 6, we first evaluated equal rate configurations where λ 0 = λ 1 = λ 2 . Increasing the energy filtering rate from 0.1 to 0.2 improved the IoU from 67.17% to 68.50% on IRSTD-1k and from 92.06% to 93.47% on NUDT-SIRST. These gains suggest that moderate filtering effectively suppresses background clutter while accentuating target features. However, further increasing the rate to 0.4 led to significant performance degradation. Specifically, the IoU on NUDT-SIRST plummeted to 89.04%, indicating that excessive filtering may over-smooth the morphological details of minute targets and compromise localization precision.
Notably, the non-uniform configuration Λ = [ 0.1 , 0.2 , 0.4 ] yielded the best overall performance by progressively intensifying filtering from deep to shallow layers. On IRSTD-1k [14], this setting achieved a peak IoU of 69.75% and a minimum F a of 9.03 × 10 6 . Similarly, the IoU and P d on NUDT-SIRST [9] reached 94.18% and 98.94%, respectively. This improvement is primarily attributed to the adaptive filtering requirements during the decoding process. In the initial stage, where deep features are utilized for filtering, the input I R I contains an abundance of low-frequency background clutter, necessitating a more aggressive energy filtering rate of 0.4 to ensure effective suppression. As the decoder progressively transitions to shallower layers, the amount of low-frequency background within the predicted maps gradually diminishes. Consequently, a lower filtering rate (0.1) is sufficient for these subsequent stages, allowing the model to maintain target-related feature gain while avoiding the over-smoothing of fine details.

3.4. Computational Efficiency

To evaluate the architectural complexity and computational efficiency of our proposed IR-SAM2, we employ Params, FLOPs, and FPS as quantitative metrics. As summarized in Table 7, our model is compared against several state-of-the-art (SOTA) methods from the past five years on the IRSTD-1k [14] dataset, encompassing both representative CNN-based and Transformer-based architectures.
Notably, compared to the recent SAM-based method SAM-SPL, our IR-SAM2 achieves a significant performance gain of 5.18% in IoU while maintaining a nearly identical computational overhead (approx. 25.5 M Params and 30.5 G FLOPs) and a comparable inference speed (54.35 vs. 56.18 FPS). Furthermore, when compared to SCTransNet, which holds the second-best IoU, our model not only achieves higher accuracy (69.75% vs. 68.03%) but also delivers a much higher frame rate (54.35 vs. 34.36 FPS) with less than half the FLOPs (30.47 G vs. 67.40 G). Although some lightweight CNNs like DNANet offer higher FPS, their segmentation accuracy falls drastically behind.
To provide a more intuitive and comprehensive evaluation, the trade-off among segmentation accuracy, computational cost, and model scale is illustrated in Figure 8.
Specifically, the overall detection performance is quantified by the IoU on the IRSTD-1k dataset, while the area of each circle corresponds to the parameter size of the respective method. As can be observed from the chart, IR-SAM2 achieves superior overall performance without introducing excessive computational overhead, thereby showcasing high computational efficiency and validating its effectiveness for practical infrared small target detection tasks.

4. Discussion

Although IR-SAM2 demonstrates highly competitive performance across multiple public IRSTD benchmarks, the framework still has several notable limitations in extreme scenarios and under specific hardware constraints.
First, in extremely low signal-to-noise ratio (SNR) scenarios, the frequency domain filtering module (RHPM) may inadvertently suppress exceedingly faint targets. This partially explains why IR-SAM2 achieves a relatively lower IoU on the NUAA-SIRST [15] dataset compared to its performance on IRSTD-1k [14] and NUDT-SIRST [9]. By examining the typical failure cases illustrated in Figure 9, the underlying causes of this phenomenon can be clearly observed.
As shown in the first two rows of Figure 9, when the radiation intensity of small targets is extremely low and highly blended with complex backgrounds, their high-frequency characteristics become barely distinguishable. During the adaptive suppression of low-frequency background energy by RHPM, these weak target responses are prone to being erased alongside background clutter, leading to missed detections.
As illustrated in the last row of Figure 9, for infrared targets with blurred edges, the intensity transition from the center to the periphery is gradual. These blurred edges often exhibit relatively low-frequency characteristics in the frequency domain. When performing high-pass filtering, the RHPM module tends to misclassify these low-gradient blurred edges as background and remove them. Consequently, the final predicted mask retains only the brightest core region of the target. This “over-suppression” phenomenon results in a predicted mask area that is significantly smaller than the ground truth, directly contributing to the degradation of the IoU metric in certain scenarios.
Second, the design of CQG may be insufficient for dense or multiple small targets. Based on the principles detailed in Section 2.2.2, CQG extracts the Top-K high-frequency salient points and employs Softmax weights to perform global weighted aggregation of their positional encodings, thereby estimating the high-frequency energy centroid of potential target regions. However, when multiple spatially dispersed, discrete small targets are present in the image, this global weighted aggregation approach may weaken the positional awareness of each individual point within the calculated query features. Consequently, it fails to provide precise instance-level positional guidance for the Transformer decoder. This single-query aggregation mechanism inherently limits the model’s capability for efficient separation and precise localization of dense multiple targets.
Furthermore, regarding computational efficiency, although our method holds an advantage over other Transformer-based models, the architecture built upon the SAM2 backbone still introduces relatively high computational complexity (as summarized in Table 7, with approximately 25.55 M Params and 30.47 G FLOPs). In airborne or spaceborne infrared monitoring edge devices, such computational overhead may hinder true real-time deployment. To address the aforementioned limitations, future work will focus on the following directions: (1) developing more perception-aware adaptive filtering strategies that effectively suppress low-frequency clutter while better preserving the complete morphological structure of targets with blurred edges; (2) exploring multi-query generation or local-clustering-based contextual aggregation mechanisms for dense multi-target scenarios to replace the current global centroid estimation, thereby enhancing instance-level perception capabilities; (3) further investigating model compression and knowledge distillation techniques to advance the lightweight design of the framework; and conducting systematic cross-dataset evaluation, such as training on one IRSTD benchmark and testing on another, to further assess and improve its cross-domain generalization capability, thereby meeting the practical deployment requirements of real-world monitoring systems.

5. Conclusions

In this work, we presented IR-SAM2, a SAM2-based framework for infrared small target segmentation. The proposed method introduces the Radial High-Pass Modulator (RHPM) module to incorporate frequency domain priors for better target–background discrimination, and the Contrast Query Generator (CQG) to convert high-frequency salient responses into location-aware query tokens for precise localization. Furthermore, a target-centric Contrast and Shape-Aware Adaptive (CSA) loss is designed to address the limitations of global objectives under extreme foreground–background imbalance. Extensive ablation studies comprehensively verify the effectiveness and complementary nature of each proposed component, enabling our unified pipeline to achieve competitive performance on multiple public IRSTD benchmarks.

Author Contributions

Conceptualization, Z.H.; methodology, Z.H.; software, Z.H.; validation, Z.H.; formal analysis, Z.H.; investigation, Z.H.; resources, Y.Z. and Z.L.; data curation, Z.H. and J.M.; writing—original draft preparation, Z.H. and J.M.; writing—review and editing, Z.H., X.D. and X.L.; visualization, Z.H.; supervision, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported by the Research Project of Quancheng Laboratory, China (Grant No. QCL20250107) and the Open Fund of State Key Laboratory of Infrared Physics (Grant No. SITP-SKLIP-YB-2025-06).

Data Availability Statement

The IRSTD-1k dataset is available at https://github.com/RuiZhang97/ISNet (accessed on 2 November 2025). The NUDT-SIRST dataset is available at https://github.com/YeRen123455/Infrared-Small-Target-Detection (accessed on 2 November 2025). The NUAA-SIRST dataset is available at https://github.com/YimianDai/sirst (accessed on 24 November 2025).

Conflicts of Interest

Author Yanyu Zhang was employed by the company Shandong Sacred Sun Power Sources Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-frame infrared small-target detection: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
  2. Strickland, R.N. Infrared techniques for military applications. In Infrared Methodology and Technology; CRC Press: Boca Raton, FL, USA, 2023; pp. 397–427. [Google Scholar]
  3. Chen, F.; Gao, C.; Liu, F.; Zhao, Y.; Zhou, Y.; Meng, D.; Zuo, W. Local patch network with global attention for infrared small target detection. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 3979–3991. [Google Scholar] [CrossRef]
  4. Chen, L.; Wu, G.; Wu, T.; Qiu, Z.; Liu, H.; Wang, S.; Huang, F. Towards Robust Infrared Ship Detection via Hierarchical Frequency and Spatial Feature Attention. Remote Sens. 2026, 18, 605. [Google Scholar] [CrossRef]
  5. Sun, Y.; Lian, J. IRSD-Net: An Adaptive Infrared Ship Detection Network for Small Targets in Complex Maritime Environments. Remote Sens. 2025, 17, 2643. [Google Scholar] [CrossRef]
  6. Guo, L.; Wang, Y.; Guo, M.; Zhou, X. YOLO-IRS: Infrared ship detection algorithm based on self-attention mechanism and KAN in complex marine background. Remote Sens. 2024, 17, 20. [Google Scholar] [CrossRef]
  7. Yi, H.; Yang, C.; Qie, R.; Liao, J.; Wu, F.; Pu, T.; Peng, Z. Spatial-temporal tensor ring norm regularization for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 7000205. [Google Scholar] [CrossRef]
  8. Guo, L.; Chen, X.; Gao, C.; Zhao, Z.; Rao, P. Infrared Temporal Differential Perception for Space-Based Aerial Targets. Remote Sens. 2025, 17, 3487. [Google Scholar] [CrossRef]
  9. Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
  10. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
  11. Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
  12. Zhang, M.; Wang, Y.; Guo, J.; Li, Y.; Gao, X.; Zhang, J. IRSAM: Advancing segment anything model for infrared small target detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 233–249. [Google Scholar]
  13. Fu, Y.; Lyu, J.; Ma, P.; Liu, Z.; Ng, M.K. A unified SAM-guided self-prompt learning framework for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5008014. [Google Scholar] [CrossRef]
  14. Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 877–886. [Google Scholar]
  15. Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021; pp. 950–959. [Google Scholar]
  16. Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets, Denver, CO, USA, 20–22 July 1999; Volume 3809, pp. 74–83. [Google Scholar]
  17. Rivest, J.F.; Fortin, R. Detection of dim targets in digital infrared imagery by morphological image processing. Opt. Eng. 1996, 35, 1886–1893. [Google Scholar] [CrossRef]
  18. Deng, L.; Zhang, J.; Xu, G.; Zhu, H. Infrared small target detection via adaptive M-estimator ring top-hat transformation. Pattern Recognit. 2021, 112, 107729. [Google Scholar] [CrossRef]
  19. Han, J.; Liu, S.; Qin, G.; Zhao, Q.; Zhang, H.; Li, N. A local contrast method combined with adaptive background estimation for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1442–1446. [Google Scholar] [CrossRef]
  20. Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1670–1674. [Google Scholar] [CrossRef]
  21. Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
  22. Dai, Y.; Wu, Y. Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
  23. Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
  24. Sun, Y.; Yang, J.; An, W. Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3737–3752. [Google Scholar] [CrossRef]
  25. Zhu, H.; Xiangchu, F. Multiframe Infrared Small Target Detection via Novel Low-Rank Approximation and Robust CUR Decomposition. Remote Sens. 2026, 18, 892. [Google Scholar] [CrossRef]
  26. Liu, F.; Gao, C.; Chen, F.; Meng, D.; Zuo, W.; Gao, X. Infrared small and dim target detection with transformer under complex backgrounds. IEEE Trans. Image Process. 2023, 32, 5921–5932. [Google Scholar] [CrossRef]
  27. Wu, T.; Li, B.; Luo, Y.; Wang, Y.; Xiao, C.; Liu, T.; Yang, J.; An, W.; Guo, Y. MTU-Net: Multilevel TransUNet for space-based infrared tiny ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5601015. [Google Scholar] [CrossRef]
  28. Wu, F.; Zhang, T.; Li, L.; Huang, Y.; Peng, Z. RPCANet: Deep unfolding RPCA based infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 4809–4818. [Google Scholar]
  29. Xiong, Z.; Zhou, F.; Wu, F.; Yuan, S.; Fu, M.; Peng, Z.; Yang, J.; Dai, Y. DRPCA-Net: Make robust PCA great again for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5005516. [Google Scholar] [CrossRef]
  30. Zhu, Y.; Ma, Y.; Fan, F.; Huang, J.; Yao, Y.; Zhou, X.; Huang, R. Toward robust infrared small target detection via frequency and spatial feature fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 2001115. [Google Scholar] [CrossRef]
  31. Han, Y.; Ye, M.; Liu, B.; Li, J.; Jia, C.; Cui, W.; Zhang, T. Frequency–Spatial Domain Jointly Guided Perceptual Network for Infrared Small Target Detection. Remote Sens. 2026, 18, 1000. [Google Scholar] [CrossRef]
  32. Ma, T.; Guo, G.; Li, Z.; Yang, Z. Infrared small target detection method based on high-low frequency semantic reconstruction. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6012505. [Google Scholar] [CrossRef]
  33. Xu, M.; Yu, C.; Li, Z.; Tang, H.; Hu, Y.; Nie, L. Hdnet: A hybrid domain network with multi-scale high-frequency information enhancement for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5004115. [Google Scholar] [CrossRef]
  34. Zhang, P.; Yan, T.; Liu, Y.; Lu, H. Fantastic animals and where to find them: Segment any marine animal with dual sam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 2578–2587. [Google Scholar]
  35. Yan, T.; Wan, Z.; Deng, X.; Zhang, P.; Liu, Y.; Lu, H. MAS-SAM: Segment any marine animal with aggregated features. arXiv 2024, arXiv:2404.15700. [Google Scholar] [CrossRef]
  36. Zhang, K.; Ni, S.; Yan, D.; Zhang, A. Review of dim small target detection algorithms in single-frame infrared images. In Proceedings of the 2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 18–20 June 2021; Volume 4, pp. 2115–2120. [Google Scholar]
  37. Huang, Y.; Tang, Z.; Chen, D.; Su, K.; Chen, C. Batching soft IoU for training semantic segmentation networks. IEEE Signal Process. Lett. 2019, 27, 66–70. [Google Scholar] [CrossRef]
  38. Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis, Québec City, QC, Canada, 14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 240–248. [Google Scholar]
  39. Liu, Q.; Liu, R.; Zheng, B.; Wang, H.; Fu, Y. Infrared small target detection with scale and location sensitivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 17490–17499. [Google Scholar]
  40. Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 9202–9210. [Google Scholar]
  41. Shoji, Y.; Toizumi, T.; Ito, A. Target Driven Adaptive Loss For Infrared Small Target Detection. arXiv 2025, arXiv:2506.01349. [Google Scholar] [CrossRef]
  42. Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000513. [Google Scholar] [CrossRef]
  43. Xie, X.; Zhou, P.; Li, H.; Lin, Z.; Yan, S. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9508–9520. [Google Scholar] [CrossRef]
  44. Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
  45. Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared small target detection via non-convex rank approximation minimization joint l 2, 1 norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
  46. Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
  47. Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
  48. Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
  49. Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef] [PubMed]
  50. Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. Sctransnet: Spatial-channel cross transformer network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
  51. Li, Q.; Zhang, W.; Lu, W.; Wang, Q. Multi-branch mutual-guiding learning for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5605710. [Google Scholar]
  52. Wu, F.; Liu, A.; Zhang, T.; Zhang, L.; Luo, J.; Peng, Z. Saliency at the helm: Steering infrared small target detection with learnable kernels. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5000514. [Google Scholar] [CrossRef]
Figure 1. Spatial and frequency domain analysis of infrared small targets from NUDT-SIRST [9]. The top row displays the original grayscale image with zoomed-in target regions and its corresponding 3D intensity surface. It is evident that the targets are extremely small and easily submerged in heavy background clutter. The bottom row presents the 2D and 3D frequency spectra; the central concentration signifies low-frequency background clutter, whereas the peripheral high-frequency components contain the small targets and noises.
Figure 1. Spatial and frequency domain analysis of infrared small targets from NUDT-SIRST [9]. The top row displays the original grayscale image with zoomed-in target regions and its corresponding 3D intensity surface. It is evident that the targets are extremely small and easily submerged in heavy background clutter. The bottom row presents the 2D and 3D frequency spectra; the central concentration signifies low-frequency background clutter, whereas the peripheral high-frequency components contain the small targets and noises.
Remotesensing 18 01891 g001
Figure 2. Overview of the proposed IR-SAM2. IR-SAM2 adopts a typical “encoder–decoder” structure and introduces a frequency-driven optimization mechanism specifically for small object detection tasks.
Figure 2. Overview of the proposed IR-SAM2. IR-SAM2 adopts a typical “encoder–decoder” structure and introduces a frequency-driven optimization mechanism specifically for small object detection tasks.
Remotesensing 18 01891 g002
Figure 3. Internal structure and workflow of Residual Cascaded RHPM.
Figure 3. Internal structure and workflow of Residual Cascaded RHPM.
Remotesensing 18 01891 g003
Figure 4. Illustration of the low-frequency background suppression process within the RHPM. The workflow specifically details the processing pipeline for the deepest mask ( M a s k 2 ) as a representative example, while the analogous operations for M a s k 1 and M a s k 0 are omitted for brevity.
Figure 4. Illustration of the low-frequency background suppression process within the RHPM. The workflow specifically details the processing pipeline for the deepest mask ( M a s k 2 ) as a representative example, while the analogous operations for M a s k 1 and M a s k 0 are omitted for brevity.
Remotesensing 18 01891 g004
Figure 5. ROC curves of IR-SAM2 and other advanced methods on the IRSTD-1k dataset. Curves closer to the top left represent better performance.
Figure 5. ROC curves of IR-SAM2 and other advanced methods on the IRSTD-1k dataset. Curves closer to the top left represent better performance.
Remotesensing 18 01891 g005
Figure 6. Qualitative comparisons between our IR-SAM2 and the six representative state-of-the-art methods. The correctly detected targets, false alarms, and missed targets are framed by red, yellow, and blue bounding boxes, respectively. For better visualization, a close-up view of the target is shown in the image corners.
Figure 6. Qualitative comparisons between our IR-SAM2 and the six representative state-of-the-art methods. The correctly detected targets, false alarms, and missed targets are framed by red, yellow, and blue bounding boxes, respectively. For better visualization, a close-up view of the target is shown in the image corners.
Remotesensing 18 01891 g006
Figure 7. Qualitative comparisons between our CSA loss and other commonly used loss functions. The correctly detected targets, false alarms, and missed targets are framed by red, yellow, and blue bounding boxes, respectively. For better visualization, a close-up view of the target is shown in the image corners.
Figure 7. Qualitative comparisons between our CSA loss and other commonly used loss functions. The correctly detected targets, false alarms, and missed targets are framed by red, yellow, and blue bounding boxes, respectively. For better visualization, a close-up view of the target is shown in the image corners.
Remotesensing 18 01891 g007
Figure 8. Computational efficiency analysis of different methods on IRSTD-1k.
Figure 8. Computational efficiency analysis of different methods on IRSTD-1k.
Remotesensing 18 01891 g008
Figure 9. Visual examples of typical low-SNR scenarios where faint targets are over-suppressed. Red bounding boxes indicate correctly detected targets, while blue bounding boxes indicate missed targets. Close-up views of the target regions are shown in the image corners for better visualization.
Figure 9. Visual examples of typical low-SNR scenarios where faint targets are over-suppressed. Red bounding boxes indicate correctly detected targets, while blue bounding boxes indicate missed targets. Close-up views of the target regions are shown in the image corners for better visualization.
Remotesensing 18 01891 g009
Table 1. Performance comparison of different models. Blue, orange, and green indicate the best, second-best, and third-best performance, respectively.
Table 1. Performance comparison of different models. Blue, orange, and green indicate the best, second-best, and third-best performance, respectively.
MethodPublishIRSTD-1kNUDT-SIRSTNUAA-SIRST
IoU (%) P d (%) F a (10−6) IoU (%) P d (%) F a (10−6) IoU (%) P d (%) F a (10−6)
Traditional Methods
Top-hat [44]OE’965.5969.36702.3622.2691.32771.5179.7416456
Max–median [16]SPIE’997.0065.2159.734.2058.4136.896.0284.34774.3
IPI [21]TIP’1327.9281.3716.1828.6374.4941.231.0987.0530467
RIPT [22]JSTARS’1714.1177.5528.3129.1791.85344.316.7969.7659.33
NRAM [45]RS’189.88272.4824.7312.0872.5884.776.9356.4019.27
PSTNN [23]RS’1924.5771.9935.2627.7266.1344.1730.3072.8048.99
Deep Learning Methods
MDvsFA [47]ICCV’1949.5082.1180.3375.1490.4725.3460.3089.3556.35
ACMNet [15]WACV’2160.3393.2768.4968.4896.2610.2769.4492.0222.71
ALCNet [46]TGRS’2163.5285.8612.9685.0397.8918.2073.7497.2526.79
DNANet [9]TIP’2265.7191.8417.6179.9896.9312.7874.8193.5338.27
ISNet [14]CVPR’2264.7387.2115.0387.7795.569.4270.4995.0667.98
UIUNet [49]TIP’2265.3789.2321.5290.5298.838.3477.5392.409.33
MTUNet [27]TGRS’2365.2489.569.5184.0297.997.4974.8599.087.09
AGPCNet [48]TAES’2364.9289.5618.5685.3597.356.78
RDIAN [42]TGRS’2364.3792.2618.281.0698.3614.0970.7495.0648.16
MSHNet [39]CVPR’2467.1693.8815.0380.5597.9911.7773.597.2531.05
SCTransNet [50]TGRS’2468.0393.2710.7494.0998.624.2977.596.9513.92
L2SKNet [52]TGRS’2567.8190.2417.4693.5897.575.3373.4398.1720.82
PConv [40]AAAI’2567.4592.2010.70
MMLNet [51]TGRS’2567.2194.2814.0081.8198.4311.7778.7198.8825.71
SAM-SPL [13]TGRS’2564.5789.808.3594.6397.242.5573.897.251.95
HDNet [33]TGRS’2566.6693.8817.4685.1798.522.7877.0510012.95
Deep Unfolding-Based Methods
RPCANet [28]CVPR’2463.2188.3143.989.3197.1428.765.0893.5810.85
DRPCA-Net [29]TGRS’2564.1492.0917.9294.1698.412.55
Our Proposed Method
Ours69.7596.609.0394.1898.942.9775.1699.084.79
Table 2. Ablation study of different components on three datasets. The symbol “–” indicates that the corresponding component is not used, while “✓” indicates that the component is used. The best results are highlighted in bold.
Table 2. Ablation study of different components on three datasets. The symbol “–” indicates that the corresponding component is not used, while “✓” indicates that the component is used. The best results are highlighted in bold.
ComponentsIRSTD-1kNUDT-SIRSTNUAA-SIRST
RHPM CQG IoU (%) P d (%) F a (10−6) IoU (%) P d (%) F a (10−6) IoU (%) P d (%) F a (10−6)
65.4393.5317.1591.4798.413.8473.6899.086.56
67.2088.778.5087.3698.943.2472.0310032.65
66.8492.517.5186.2298.414.5973.6597.2421.29
69.7596.609.0394.1898.942.9775.1699.084.79
Table 3. Performance comparison ofdifferent functions. The best results are highlighted in bold.
Table 3. Performance comparison ofdifferent functions. The best results are highlighted in bold.
MethodIRSTD-1kNUDT-SIRSTNUAA-SIRST
IoU (%) P d (%) F a (10−6) IoU (%) P d (%) F a (10−6) IoU (%) P d (%) F a (10−6)
BCE loss [36]63.4484.694.5594.5298.941.8472.7798.1617.92
IoU loss [37]63.1593.5436.9790.6598.203.7774.1297.259.94
TDA loss [41]65.2895.9325.8188.9898.737.6371.4098.1621.11
SLS loss [39]67.6490.147.5285.6898.738.1673.6298.2414.58
CSA loss(Ours)69.7596.609.0394.1898.942.9775.1699.084.79
Table 4. Parameter sensitivity analysis of w a u x on IRSTD-1k and NUDT-SIRST datasets. The best results are highlighted in bold.
Table 4. Parameter sensitivity analysis of w a u x on IRSTD-1k and NUDT-SIRST datasets. The best results are highlighted in bold.
SettingIRSTD-1kNUDT-SIRST
w aux IoU (%) P d (%) F a (10−6) IoU (%) P d (%) F a (10−6)
0.565.9590.8110.2490.3397.242.18
0.367.4693.937.0690.8498.5111.23
0.269.7596.609.0394.1898.942.97
0.168.9195.598.3592.7798.944.25
Table 5. Parameter sensitivity analysis of K in the CQG on IRSTD-1k and NUDT-SIRST datasets. The best results are highlighted in bold.
Table 5. Parameter sensitivity analysis of K in the CQG on IRSTD-1k and NUDT-SIRST datasets. The best results are highlighted in bold.
SettingIRSTD-1kNUDT-SIRST
K IoU (%) P d (%) F a (10−6) IoU (%) P d (%) F a (10−6)
167.3895.9114.1291.9598.738.09
569.3096.415.6293.9898.622.29
1069.7596.609.0394.1898.942.97
2068.1995.7910.9393.2998.944.61
Table 6. Sensitivity analysis of energy filtering rate Λ on IRSTD-1k and NUDT-SIRST datasets. The best results are highlighted in bold.
Table 6. Sensitivity analysis of energy filtering rate Λ on IRSTD-1k and NUDT-SIRST datasets. The best results are highlighted in bold.
SettingsIRSTD-1kNUDT-SIRST
Λ = [λ0, λ1, λ2] IoU (%) P d (%) F a (10−6) IoU (%) P d (%) F a (10−6)
[ 0.1 , 0.1 , 0.1 ] 67.1795.9114.1292.0698.737.14
[ 0.2 , 0.2 , 0.2 ] 68.5096.269.5093.4798.511.37
[ 0.4 , 0.4 , 0.4 ] 65.8197.2711.6989.0498.098.06
[ 0.1 , 0.2 , 0.4 ] 69.7596.609.0394.1898.942.97
Table 7. Comparison ofmodel complexity and efficiency in terms of Params, FLOPs, and FPS between our IR-SAM2 and recent advanced models. ↑ indicates that higher values are better, while ↓ indicates that lower values are better. “–” indicates that the corresponding data are not available.
Table 7. Comparison ofmodel complexity and efficiency in terms of Params, FLOPs, and FPS between our IR-SAM2 and recent advanced models. ↑ indicates that higher values are better, while ↓ indicates that lower values are better. “–” indicates that the corresponding data are not available.
MethodTypeYearIoU (%) ↑Params (M) ↓FLOPs (G) ↓FPS (f/s) ↑
ALCNet [46]CNN202163.520.516.09
DNANet [9]CNN202265.714.6914.2668.96
UIUNet [49]CNN202365.3750.5454.429.87
MTUNet [27]Transformer202365.2412.7534.4310.87
AGPCNet [48]CNN202364.9212.3643.18
RPCANet [28]CNN202463.210.6844.572.80
SCTransNet [50]Transformer202468.0311.1967.4034.36
MMLNet [51]CNN202567.83.5820.4138.17
SAM-SPL [13]SAM202564.5725.4830.4656.18
IR-SAM2 (Ours)SAM69.7525.5530.4754.35
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hao, Z.; Dang, X.; Zhang, Y.; Miao, J.; Li, Z.; Lu, X. IR-SAM2: Target Enhancement with SAM2 for Infrared Small Target Detection. Remote Sens. 2026, 18, 1891. https://doi.org/10.3390/rs18121891

AMA Style

Hao Z, Dang X, Zhang Y, Miao J, Li Z, Lu X. IR-SAM2: Target Enhancement with SAM2 for Infrared Small Target Detection. Remote Sensing. 2026; 18(12):1891. https://doi.org/10.3390/rs18121891

Chicago/Turabian Style

Hao, Zongduo, Xiaocui Dang, Yanyu Zhang, Jinshui Miao, Zhiming Li, and Xiankai Lu. 2026. "IR-SAM2: Target Enhancement with SAM2 for Infrared Small Target Detection" Remote Sensing 18, no. 12: 1891. https://doi.org/10.3390/rs18121891

APA Style

Hao, Z., Dang, X., Zhang, Y., Miao, J., Li, Z., & Lu, X. (2026). IR-SAM2: Target Enhancement with SAM2 for Infrared Small Target Detection. Remote Sensing, 18(12), 1891. https://doi.org/10.3390/rs18121891

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop