Next Article in Journal
A Spaceborne Tomographic SAR Reconstruction Method Based on Building Structural Characteristics
Previous Article in Journal
Multi-Frequency GNSS-IR Water-Level Estimation Using NMEA Observations from Low-Cost GNSS Receivers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Semantic Density-Guided ResNet for Dense Infrared Small Target Detection

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(9), 1397; https://doi.org/10.3390/rs18091397
Submission received: 2 February 2026 / Revised: 12 March 2026 / Accepted: 24 March 2026 / Published: 1 May 2026
(This article belongs to the Section Remote Sensing Image Processing)

Highlights

What are the main findings?
  • A Semantic Density-Guided ResNet (SDG-ResNet) is proposed to explicitly exploit high-level semantic density for improving infrared small target detection.
  • The proposed method consistently improves detection performance, especially in dense target scenarios while maintaining competitive results in sparse scenes.
What are the implications of the main findings?
  • High-level semantic density information can serve as an effective global prior to guide low-level feature refinement for dense infrared target detection.
  • The proposed SDG-ResNet can be seamlessly integrated into existing transformer-based detectors, offering a practical and lightweight solution for space-based remote sensing applications.

Abstract

Dense infrared small target detection (ISTD) in long-range remote sensing is critical for multi-target surveillance, yet existing benchmarks mostly contain only sparsely distributed targets and rarely reflect dense scenes. To address this limitation, we construct a new dense satellite ISTD dataset, IR-SatDense, by compositing small targets onto real satellite infrared backgrounds and partitioning it into subsets using the Average Minimum Inter-Target Distance (AMID) to explicitly control target density. By visualizing multi-stage backbone features, we observe that in dense scenes the deepest stage naturally forms compact, high-response target clusters in the semantic feature maps, while low- and middle-level features remain heavily cluttered. This motivates us to treat high-level semantic density as a global prior to guide low-level feature enhancement. Therefore, we propose Semantic Density-Guided ResNet (SDG-ResNet), a plug-in backbone that attaches a lightweight semantic density head to the deepest stage and injects the predicted density map into intermediate layers through Semantic Density-Guided Refine (SDGR) blocks with residual spatial gating. Integrated into representative transformer-based detectors, including Deformable DETR, DETA, and DINO, SDG-ResNet consistently improves the probability of detection (PD) at comparable false alarm (FA) levels on IR-SatDense while maintaining competitive performance on the sparse dataset IRSTD-1K.

1. Introduction

Infrared imaging does not rely on external illumination, enabling all-day and all-weather operation, strong penetration through smoke and haze, and high robustness under complex lighting conditions. Owing to these advantages, it has been widely adopted in long-distance target detection, environmental perception, and animal protection [1,2,3,4]. Among these applications, small infrared target detection plays a critical role in remote sensing situational awareness, long-range object monitoring, and maritime search and rescue and has emerged as a core research topic in intelligent infrared image analysis [5,6].
In recent remote sensing missions, a single infrared frame often contains numerous small targets that are extremely tiny, exhibit very low signal-to-noise ratios (SNRs), and are densely distributed amid complex background clutter. Under such dense conditions, conventional infrared small target detection methods typically suffer from degraded detection accuracy, increased miss rates, and elevated false alarm levels [7,8,9]. In particular, when the minimum inter-target distance becomes comparable to or smaller than the target size, responses from adjacent targets tend to overlap, and fine-grained target details are progressively lost during backbone downsampling, making accurate detection and reliable separation of neighboring targets especially challenging [10].
To better understand how deep backbones behave in dense and sparse scenes, Figure 1 visualizes feature responses at low, middle, and high stages for a dense target cluster and for a single-target scene. In the dense case, the low- and middle-level feature maps contain both clutter and target responses, whereas the high-level feature map shows a clear, compact semantic focus around the dense target cluster. In contrast, for the single-target case, the high-level semantic responses are much weaker and more diffuse. This phenomenon suggests that, in dense infrared scenes, dense target clusters naturally generate strong, well-aggregated semantic information at high levels, which can serve as a reliable prior to enhance low-level target details while suppressing background noise. However, most existing infrared small target detection networks still rely on generic multi-scale fusion or attention mechanisms built on ResNet features [1,2,11,12], where all spatial locations are processed in a largely density-agnostic manner. Although these methods aggregate multi-level semantics, they do not explicitly construct or reuse a high-level semantic density map of dense target clusters as a global prior to guide the refinement of low-level features in dense regions.
Existing infrared small target detection models, therefore, exhibit several limitations in dense scenes:
(1) Local spatial saliency limitation: Most current approaches are built upon local spatial saliency, enhancing targets by exploiting intensity contrast between a target and its surrounding background. When multiple adjacent targets are densely distributed in the same region, these methods struggle to distinguish subtle energy differences between targets and background, resulting in target adhesion, blurred responses, and incomplete separation in the detection maps.
(2) Density-aware semantic guidance deficiency: Although recent deep models employ powerful backbones and transformer-based heads, the backbone feature extraction process is typically bottom-up and does not explicitly encode target density. Features at different spatial locations are processed in a homogeneous manner, regardless of whether they belong to dense target clusters or mostly background. As illustrated in Figure 1, dense target clusters already induce strong, aggregated responses in high-level semantic features, but these cues are not reused to guide lower layers. Consequently, mid- and low-level features in dense regions are easily dominated by clutter-like responses, causing a sharp drop in probability of detection (PD) and a rise in false alarms (FAs) when the Average Minimum Inter-Target Distance (AMID) becomes small.
(3) Dense-target structural limitation: Traditional sparse-target models do not explicitly model the structural hierarchy of densely distributed targets in the spatial domain. In multi-target scenes with high spatial density, they often produce ambiguous spatial structures and mutual feature interference, which leads to degraded localization accuracy and unstable detection performance.
To overcome the above limitations in dense-target detection, this paper proposes a Semantic Density-Guided (SDG) backbone that explicitly leverages high-level semantic density to guide low-level feature enhancement. Instead of introducing complex attention blocks or modifying the detection head, SDG estimates a semantic density map from the deepest backbone stage and reuses it as a global prior to refine intermediate features. Concretely, a lightweight semantic density head predicts a one-channel semantic density map from the highest-level feature, and a set of Semantic Density-Guided Refine (SDGR) blocks injects this prior into mid- and low-level feature maps via residual spatial gating. In this way, dense target clusters that are already prominent in high-level semantics are used to enhance fine-grained target details in dense regions while suppressing noise responses in complex background areas. As a result, the backbone can respond differently in dense and sparse regions while preserving the benefits of existing transformer-based detectors.
The main contributions of this paper can be summarized as follows:
(1) Algorithmic contribution: We propose a Semantic Density-Guided backbone (SDG-ResNet) that augments a standard ResNet with a semantic density head and lightweight SDGR blocks. The deepest-stage feature is compressed into a semantic density map, which is then reused to modulate intermediate features through residual spatial gating. This design explicitly exploits high-level semantic density from dense target clusters to enhance low-level target details, providing density-aware semantic guidance for dense infrared small target detection while introducing only negligible additional parameters and FLOPs.
(2) Dataset contribution: We construct a novel simulated dataset named IR-SatDense (Infrared Satellite-Based Dense Small Target Dataset). Built on real satellite-based infrared backgrounds, IR-SatDense comprehensively simulates diverse target densities, signal-to-noise ratios (SNRs), and morphologies and is organized into subsets according to the Average Minimum Inter-Target Distance (AMID). This dataset provides controllable and realistic experimental scenarios for systematically evaluating detection algorithms under complex dense infrared conditions.
(3) Experimental contribution: We integrate the proposed SDG-ResNet into several transformer-based detectors, including DINO, Deformable DETR, and DETA, and conduct extensive experiments on the IR-SatDense dataset as well as on the sparse-target benchmark IRSTD-1K. Experimental results demonstrate that SDG-based detectors consistently improve PD at comparable FA levels, with particularly large gains in small-AMID (high-density) regimes, while maintaining strong performance on sparse infrared small target datasets.
The structure of this paper is as follows. Section 2 reviews existing infrared small target detection datasets and methods. Section 3 describes the construction and statistical analysis of the proposed IR-SatDense dataset. Section 4 presents the Semantic Density-Guided ResNet (SDG-ResNet) backbone and its integration into DETR-style detectors. Section 5 reports implementation details, benchmark comparisons, ablation studies, and visual analyses on IR-SatDense and IRSTD-1K. Finally, Section 6 concludes the paper and discusses future research directions.

2. Related Works

2.1. Existing Infrared Small Target Detection Datasets

Most publicly available datasets for infrared small target detection (ISTD) have been primarily designed for sparse-target scenarios, where each frame contains only a few isolated targets and the minimum inter-target distance is relatively large. Although these datasets have significantly advanced early ISTD research, they still fail to represent the spatial distribution characteristics of dense targets under complex backgrounds. In recent years, researchers have increasingly shifted their attention toward dense infrared small target detection (dense ISTD), leading to the creation of several new datasets tailored for this task. Overall, existing datasets can be broadly divided into two categories: sparse-target datasets and dense-target datasets.
(1) Sparse infrared small target datasets: Representative sparse datasets include NUDT-SIRST [1], SIRST and its extensions (SIRSTv2, SIRST-AUG) [11,13,14], and IRSTD-1K [2]. These datasets mainly consist of single-frame images containing a few, independently distributed targets, with most images including only one or two targets. For instance, NUDT-SIRST contains approximately 1000 images with an average of 1.2 targets per frame, SIRST provides 427 images (about 90% single-target), and IRSTD-1K offers 1000 pixel-level annotated images with multiple target categories and higher annotation precision. However, these datasets remain focused on sparse-target conditions and do not reflect scenarios where multiple small targets appear densely within a single frame.
(2) Dense infrared small target datasets: To overcome the above limitations, several dense or multi-target datasets have been developed, including DMIST-60/100 [15] and DenseSIRST [16]. The DMIST series provides multi-frame sequences containing varying numbers of targets to simulate realistic detection and tracking scenarios, while DenseSIRST is a single-frame dataset featuring densely distributed targets and pixel-level semantic background annotations, enabling studies on how semantic priors contribute to dense ISTD. Nevertheless, even in these datasets, the Average Minimum Inter-Target Distance is still several times larger than the target size, indicating that their density is insufficient to represent highly crowded distributions where targets are almost contiguous.
Compared with the aforementioned datasets, our IR-SatDense dataset exhibits significantly higher target density and stronger realism. It is synthesized using the proposed Dense Single-Frame Target Dataset Generator (DSTDGen) based on real satellite infrared backgrounds and is organized into multiple subsets corresponding to different density levels measured by the Average Minimum Inter-Target Distance (AMID). This dataset provides a comprehensive and realistic benchmark for evaluating ISTD algorithms in complex and densely populated scenarios and is particularly suitable for analyzing how detection performance changes as AMID decreases from sparse to extremely dense regimes.

2.2. Infrared Small Target Detection Methods

Infrared small target detection (ISTD), as a core component of infrared sensing systems, remains a fundamental yet challenging research topic. According to their underlying mechanisms, existing ISTD approaches can be broadly classified into two categories: model-driven methods and data-driven methods.
(1) Model-driven methods: These methods rely on handcrafted features and prior assumptions to distinguish targets from background clutter. Feature-based approaches, including LCM, ILCM, NLCM, and RLCM [17,18,19,20], enhance local contrast between targets and background by exploiting intensity differences. Background modeling methods, such as top-hat filtering and max–median filtering [21,22], leverage local background consistency to suppress low-frequency components and highlight potential targets. Low-rank sparse decomposition (LRSD)-based models, including IPI and PSTNN [23,24], assume a low-rank background and sparse targets to achieve separation. Although these model-driven algorithms have achieved considerable progress, they often struggle in complex or dense-target scenes where the structural hierarchy of multiple targets and background interactions is not explicitly modeled.
(2) Data-driven methods: With the rapid development of deep learning, data-driven approaches have become the dominant paradigm in ISTD. These methods use neural networks to automatically learn multi-level features and complex mappings between targets and backgrounds, offering stronger robustness and adaptability than traditional handcrafted approaches. Recent works are mostly segmentation-based, where small targets are localized via pixel-wise prediction. Representative methods include SANet [25], ACM [13], AGPCNet [11], DNANet [1], IAANet [26], ISNet [2], and UIUNet [12]. Although these models achieve impressive performance in sparse-target scenes, they often fail in dense scenes where closely spaced targets tend to merge in segmentation masks, leading to degraded localization accuracy.
To address these challenges, anchor-based detection frameworks have been introduced as more explicit and discriminative alternatives. By directly regressing bounding-box coordinates and class labels, anchor-based detectors effectively separate adjacent targets and support both single-frame and multi-frame detection. Recently, transformer-based end-to-end detectors (e.g., DETR, Deformable DETR, DAB-DETR, DN-DETR, and DINO) [27,28,29,30,31] have further advanced this field by incorporating deformable attention, dynamic anchors, and denoising strategies to jointly improve accuracy and convergence speed. Overall, anchor-based and transformer-based detectors demonstrate superior discriminative capability and spatial precision in dense infrared small target detection. However, their backbones typically extract features in a purely bottom-up and density-agnostic manner, treating all spatial locations uniformly without explicit semantic modeling of target density. This limits their robustness when targets become extremely dense (small AMID) and motivates the semantic density-guided backbone proposed in this work.

3. IR-SatDense Dataset Synthesis

In this section, we describe the proposed IR-SatDense dataset in detail.

3.1. Dataset Construction

(1) Data Collection: To construct a dense infrared small target dataset that reflects realistic on-orbit imaging conditions with complex backgrounds, a total of 2154 background images were collected from public satellite infrared imagery sources. These images cover diverse Earth observation scenarios and represent various background characteristics commonly encountered in infrared imaging tasks. All background images were cropped and preprocessed to a uniform resolution of 512 × 512 pixels. According to background complexity, the dataset is categorized into four levels: (i) easy, (ii) medium, (iii) complex, and (iv) extremely complex scenes. Representative examples and the statistical distribution of each background level are shown in Figure 2.
(2) Targets and Annotations: To construct the proposed IR-SatDense dataset, a Dense Single-Frame Target Dataset Generator (DSTDGen) algorithm is developed. The pseudocode is presented in Algorithm 1, which consists of four main stages designed to automatically generate infrared small target samples with controllable density distributions and precise annotations on diverse backgrounds.
Step 1: Parameter and Template Initialization. The input includes a background infrared image I R H × W and a set of preselected small-target templates { T 1 , T 2 , , T n } . The mean target number μ a and its variance σ a determine the number of targets N generated on each image. The mean nearest inter-target distance d controls spatial density, while the SNR range [ S N R min , S N R max ] defines target intensity. Each image initializes a random start position p 1 = ( y 1 , x 1 ) and selects N random templates for subsequent placement (corresponding to lines 1–5 in Algorithm 1).
Step 2: Target Placement under Spatial Constraints. Targets are iteratively placed on the background while satisfying the mean nearest distance d. The first target is randomly rotated and placed at p 1 . For each subsequent target j, its position p j is sampled within the convex envelope of previously placed targets { P 1 , , P j 1 } . If the mean inter-target distance deviates from d by more than 0.5 pixels, a local coordinate adjustment ( ± Δ x , ± Δ y ) is applied to fine-tune p j . If a valid position is found, the image and statistics are updated; otherwise, a fail counter is increased to maintain stability. This process ensures controllable inter-target spacing and stable dense distribution (corresponding to lines 6–22 in Algorithm 1).
Step 3: Target Composition with Controllable Signal-to-Noise Ratio (SNR). To simulate the radiometric properties of real infrared point targets, each template T i is normalized and enhanced according to a sampled SNR within [ S N R min , S N R max ] . For each target position, the local background mean μ b g and standard deviation σ b g are calculated, and target brightness is adjusted as
T i = S N R i · σ b g · T i max ( T i ) .
The adjusted target is then superimposed on the background image to form a composite result, while generating a binary mask and coordinate file to record precise pixel positions. This ensures that synthetic targets exhibit realistic contrast and radiometric characteristics consistent with real infrared imagery (corresponding to lines 9–15 in Algorithm 1).
Step 4: Output Generation and Statistical Annotation. After all N targets are added, the algorithm outputs four components: (1) The synthesized infrared image I , (2) Its corresponding binary mask M , (3) The coordinate file C containing pixel positions, and (4) The statistical table S with { SNR , μ b g , σ b g , peak , size , d } . Together, these outputs constitute one complete IR-SatDense sample with controllable density, brightness, and SNR (corresponding to line 23 in Algorithm 1).
Algorithm 1 Pseudocode of DSTDGen Algorithm
  • Input: Background image I R H × W , candidate target set { T 1 , , T n } , mean target number μ a , variance σ a , mean nearest distance d, SNR range [ S N R min , S N R max ] .
1:
Initialize output folders and parameters.
2:
for each background image I in IR-SatDense do
3:
    N random _ normal   ( μ a , σ a )
4:
   Choose start position p 1 , select N templates { T i }
5:
   Initialize S T A T S = , o u t p u t = I
6:
   for  j = 1 to N do
7:
      T j random _ rotate   ( T j )
8:
     if  j = 1  then
9:
         ( I j , P j , i n f o ) add _ target _ SNR   ( o u t p u t , T j , p j )
10:
         V j expand _ envelope   ( P j , d )
11:
     else
12:
         p j random _ point   ( V j 1 )
13:
         ( I j , P j , i n f o ) add _ target _ SNR   ( o u t p u t , T j , p j )
14:
        if  | mean _ distance   ( { P 1 , , P j } ) d | > 0.5  then
15:
          Adjust p j within ( ± Δ x , ± Δ y ) ; update I j if valid, else increase fail counter.
16:
        end if
17:
         V j expand _ envelope   ( P j , d )
18:
     end if
19:
     Append i n f o to S T A T S
20:
      o u t p u t I j
21:
   end for
22:
   Save I , M , C , and S
23:
end for
  • Output: Synthetic image I , mask M , coordinate file C , and statistics S .

3.2. Statistical Analysis

To comprehensively evaluate the representativeness and effectiveness of the proposed IR-SatDense dataset, a statistical comparison is conducted against several widely used benchmark datasets for infrared small target detection (ISTD). In dense-target scenarios, conventional metrics such as target count or area are insufficient to describe the degree of spatial compactness. Therefore, we introduce the Average Minimum Inter-Target Distance (AMID) metric to quantitatively characterize the density of target distributions within each image.
Definition 1.
Average Minimum Inter-Target Distance (AMID). For the i-th image I i in the dataset containing M i targets { T j } j = 1 M i , the minimum Euclidean distance d j from the j-th target to all other targets is defined as
d j = min k j min y k T k , y j T j y k y j 2 .
The image-level average minimum distance is then given by
A M I D i = 1 M i j = 1 M i d j .
Finally, the overall dataset-level AMID is computed as
A M I D a l l = 1 N i = 1 N A M I D i ,
where N denotes the total number of images in the dataset. A smaller AMID value indicates stronger spatial compactness among targets, corresponding to a higher-density and more challenging detection scenario. Representative visual examples for different AMID ranges are shown in Figure 3.
As summarized in Table 1, the proposed IR-SatDense dataset contains 2154 images, divided into 50% for training, 25% for validation, and 25% for testing. To facilitate more detailed performance evaluation across different spatial density levels, the test subset is further partitioned based on the AMID metric, allowing a systematic analysis of model robustness under varying target compactness.
Compared with existing datasets, IR-SatDense exhibits substantially higher target density and smaller average target size, while simultaneously covering multiple background complexity levels. Specifically, its average target area is approximately 10.42 pixels (corresponding to an average target width of about 3 pixels), which accurately reflects the small-scale and low-intensity nature of real infrared small targets. Furthermore, the dataset’s AMID value of 1.51 is significantly smaller than that of previous dense-target datasets, indicating much closer inter-target spacing and a higher degree of detection difficulty. Figure 2b illustrates the proportional distribution of target density levels across images with different background complexities, demonstrating that IR-SatDense provides a comprehensive benchmark for dense-target infrared detection research and for studying how detection performance degrades as AMID decreases.

4. Proposed Baseline

In this section, we present the proposed Semantic Density-Guided ResNet (SDG-ResNet) backbone and its integration into a DINO-based detector for dense infrared small target detection. The core idea is to estimate a semantic density map from the deepest ResNet stage and use it as a global prior to refine mid- and low-level features via lightweight residual gating.

4.1. Motivation

To quantitatively assess the influence of target density on infrared small target detection, we employ DINO to derive PD and FA curves under varying Average Minimum Inter-Target Distance (AMID) and IoU thresholds. In this context, AMID characterizes the average distance to the nearest neighboring target in the image plane, where a smaller AMID corresponds to a denser target distribution.
As shown in Figure 4, when the AMID decreases from sparse to dense intervals, the PD of existing detectors consistently drops, especially under stricter IoU thresholds. At the same time, the FA curves show the opposite trend: dense scenes (small AMID) yield significantly more false alarms than sparse scenes. These observations clearly demonstrate that current architectures are not robust enough in dense infrared small-target scenarios, even if they perform well when targets are relatively sparse.
Although different detectors adopt different backbones and heads, most of them share a common design philosophy: backbone features are extracted in a purely bottom-up manner, and the notion of “how dense the targets are” is not explicitly encoded in the feature representation. All spatial locations are essentially treated in the same way, regardless of whether they belong to dense target regions or mostly background. As a result, when many small targets appear in close proximity, mid- and low-level features tend to be dominated by clutter-like responses, and the detector has difficulty maintaining high PD and low FA in such dense regimes.
Motivated by these observations, we aim to endow the backbone with an explicit awareness of target density and a simple mechanism to adapt its feature responses in dense regions. Instead of relying solely on the detection head, we introduce a Semantic Density-Guided ResNet (SDG-ResNet). In SDG-ResNet, the deepest ResNet stage is used to estimate a semantic density map that reflects the spatial distribution of potential target clusters, and this map is then employed to refine intermediate features through lightweight residual gating. In this way, the backbone can respond differently in dense and sparse areas, with the specific goal of improving PD and suppressing FA under small-AMID, dense infrared small-target scenarios.

4.2. Network Overview

Given an input infrared image I,
I R 3 × H × W ,
a standard ResNet-50 backbone extracts three feature maps
X 3 R C 3 × H 3 × W 3 , X 4 R C 4 × H 4 × W 4 , X 5 R C 5 × H 5 × W 5 .
where ( C 3 , C 4 , C 5 ) = ( 512 , 1024 , 2048 ) for ResNet-50. In dense infrared scenes, the deepest feature X 5 encodes strong semantic responses of target clusters, while X 3 and X 4 contain detailed structures but are heavily contaminated by background clutter.
To exploit this property, we augment the backbone with two components:
  • A semantic density head attached to X 5 that predicts a one-channel semantic density map D ;
  • Two Semantic Density-Guided Refine (SDGR) blocks that use D to refine X 4 and X 3 in a residual manner.
As illustrated in Figure 5, the proposed SDG-ResNet consists of an overall detection framework, a semantic density head, and SDGR blocks for cross-stage feature refinement. Formally, the refined feature maps are given by
X ˜ 4 = SDGR ( X 4 , D ) , X ˜ 3 = SDGR ( X 3 , D ) , X ˜ 5 = X 5 .
The set { X ˜ 3 , X ˜ 4 , X ˜ 5 } is then fed into a ChannelMapper neck and a DINO transformer head, which remain unchanged with respect to the baseline detector. We refer to the resulting detector as SDG DINO.

4.3. Semantic Density Head

4.3.1. Architecture

The goal of the semantic density head is to compress the high-level feature X 5 into a scalar field that reflects the spatial distribution of targets or target clusters. The head consists of a 1 × 1 convolution for channel reduction, followed by a 3 × 3 convolution for local context aggregation.
Given X 5 R C 5 × H 5 × W 5 , the intermediate feature F 5 and the semantic density map D are jointly computed as
F 5 = ReLU BN 1 W 1 X 5 , D = σ W 2 F 5 ,
where W 1 R C m × C 5 × 1 × 1 is a 1 × 1 convolution kernel with C m = 256 , W 2 R 1 × C m × 3 × 3 is a 3 × 3 convolution kernel, * denotes convolution, BN 1 ( · ) is batch normalization, ReLU ( · ) is the rectified linear unit, and σ ( · ) is the sigmoid function.
The output D [ 0 , 1 ] 1 × H 5 × W 5 can be interpreted as a cluster-aware objectness prior:
D ( u , v ) p cluster u , v | X 5 ,
where p cluster ( · ) denotes the probability that location ( u , v ) belongs to a target or target cluster.

4.3.2. Design Rationale

The semantic density head is intentionally shallow and linear in the channel dimension. It does not attempt to re-learn complex patterns but rather projects the existing high-level semantics into a single-channel prior. This design has three advantages:
  • It preserves the original X 5 for the detection head, avoiding interference with high-level semantics.
  • It provides an interpretable, spatially dense prior that can be reused across multiple backbone stages.
  • It adds only a negligible number of parameters and FLOPs.

4.4. Semantic Density-Guided Refine Block

The Semantic Density-Guided Refine Block (SDGR) injects the semantic prior D into a low-level feature map X ( { 3 , 4 } ) to suppress background clutter and selectively enhance responses near dense semantic regions. Intuitively, D plays the role of a high-level gating signal that indicates where small-target clusters are likely to appear, while X provides fine-grained local texture and contrast information.

4.4.1. Density Upsampling and Embedding Alignment

Because D is defined at the spatial resolution of X 5 , it is first upsampled to match the resolution of X :
D = U D ; H , W ,
where U ( · ) denotes bilinear interpolation and D [ 0 , 1 ] 1 × H × W .
To reduce computational cost and to learn a compact joint representation, both X and D are projected into a shared low-dimensional embedding space with r = C / R channels (we set R = 4 ):
Z = W low ( ) X , Z d = W d D ,
where Z , Z d R r × H × W , and both W low ( ) and W d are 1 × 1 kernels corresponding to the low_proj and d_proj layers in the network implementation.
This step is analogous to the linear projections used in attention gates [32], where low-level and high-level features are first mapped into a common intermediate space before computing attention coefficients.

4.4.2. Feature–Density Fusion and Gate Prediction

The two embeddings are concatenated along the channel dimension and fused by a 3 × 3 convolution with batch normalization and ReLU:
Z f = ReLU BN 2 W f [ Z , Z d ] ,
where [ Z , Z d ] denotes channel-wise concatenation and W f is a 3 × 3 kernel. This fusion stage allows the network to jointly reason about local appearance (from Z ) and semantic density (from Z d ) within a 3 × 3 neighborhood, instead of making gating decisions based on a single pixel.
A spatial gate is then generated by a 1 × 1 convolution followed by a sigmoid function:
G = σ W g Z f ,
where W g is a 1 × 1 kernel and G [ 0 , 1 ] 1 × H × W . The gate value G ( x , y ) measures how strongly the low-level feature at position ( x , y ) should be preserved or suppressed, conditioned jointly on local appearance and high-level semantic density.
Although the SDGR block produces a spatial gating map, it is fundamentally different from conventional spatial attention mechanisms. Typical attention modules (e.g., CBAM [33] or attention gates [32]) estimate attention weights directly from the same feature map that is being refined, focusing on local saliency or channel interactions. In contrast, our SDGR block is driven by an explicitly constructed semantic density prior derived from the deepest backbone stage. The gating signal is therefore not computed from the low-level feature itself but projected from high-level semantic clustering responses that encode global target-density information. This cross-stage prior injection enables density-aware modulation of intermediate features, rather than generic saliency reweighting. Consequently, SDG-ResNet explicitly models spatial target density as a structural property of dense scenes, instead of treating attention as a purely local feature recalibration mechanism.

4.4.3. Residual Refinement

Finally, we refine X using a residual gating formulation:
X ˜ = X + γ X G X , = X 1 + γ ( G 1 ) ,
where ⊙ denotes element-wise multiplication, 1 is an all-ones map broadcastable to the shape of G , and γ is a learnable scalar parameter initialized to 10 2 .
At the beginning of training, γ is close to zero, and the SDGR block behaves almost as an identity mapping, which stabilizes optimization and preserves the benefits of ImageNet pre-training. As training proceeds, γ is automatically adjusted such that features in high-density regions (where G is close to 1) are preserved or slightly enhanced, while responses in low-density regions (where G tends to be smaller) are progressively suppressed. This residual formulation thus realizes a semantic density-guided, spatially adaptive modulation of low-level features while avoiding aggressive modifications that could harm the backbone representation in ambiguous areas.
It should be clarified that the semantic density head is not designed as an independent density regression branch. Instead, the predicted density map serves as an intermediate prior for feature modulation and is optimized implicitly through the overall detection objective. During backpropagation, gradients from the detection loss propagate through the SDGR blocks to the density head, enabling it to learn density-aware representations without requiring explicit density annotations. This implicit supervision mechanism is consistent with many attention-based modules that are trained end-to-end without auxiliary losses.

4.5. Integration with DINO and Complexity Analysis

The proposed SDG-ResNet is integrated into a DINO-style transformer detector without modifying the detection head. The refined feature maps { X ˜ 3 , X ˜ 4 , X ˜ 5 } are first converted by a ChannelMapper to a unified channel dimension, and then fed into the DINO encoder–decoder. Let L det denote the original DINO detection loss, which combines classification, bounding-box regression, and IoU/GIoU terms, including auxiliary losses for intermediate layers. We do not introduce any extra loss terms, and the overall training objective is
L total = L det .
Therefore, the semantic density head and the SDGR blocks are supervised implicitly through the detection objective.
In terms of complexity, SDG-ResNet introduces:
  • One 1 × 1 and one 3 × 3 convolution on X 5 for semantic density estimation;
  • For each of X 3 and X 4 , two 1 × 1 convolutions, one 3 × 3 convolution, and one 1 × 1 gating convolution;
  • Two scalar parameters γ 3 and γ 4 .
All the added operations act on feature maps that are already computed by the backbone, and the extra FLOPs are negligible compared with the ResNet and transformer encoder–decoder. This makes SDG-ResNet a practical and efficient backbone for dense infrared small target detection.

5. Experiments

In this section, we introduce the evaluation metrics, experimental settings, comparisons with state-of-the-art (SOTA) methods, and ablation studies.

5.1. Implementation Details

(1) Dataset: Experiments are conducted on the IR-SatDense dataset, which contains a large number of small infrared targets distributed across diverse background scenes. The targets are very small (average width about 3 pixels) and have low signal-to-noise ratios, enabling evaluation under complex and cluttered infrared conditions. According to the Average Minimum Inter-Target Distance (AMID), the test set is divided into multiple subsets to assess detection performance under varying density levels, with particular focus on the challenging dense regime ( AMID 3 ).
(2) Implementation: All detectors are implemented within the DINO framework using ResNet-50 or SDG-ResNet as the backbone. For the proposed variants (SDG Deformable DETR, SDG DETA, SDG DINO), we simply replace the standard ResNet-50 backbone with SDG-ResNet while keeping all other hyperparameters unchanged to ensure a fair comparison. The AdamW optimizer is adopted with a learning rate of 1 × 10 4 and a batch size of 2 for 180,000 iterations. Input images are normalized to match DINO’s default preprocessing pipeline. All experiments are conducted on a single NVIDIA RTX 4090 GPU.
(3) Evaluation Metrics: To comprehensively evaluate detection performance, we adopt three metrics: probability of detection (PD), false alarm rate (FA), and FLOPs/Params. PD and FA measure detection capability and robustness, while FLOPs and Params characterize computational complexity.
All methods are evaluated under a unified box-level protocol. For anchor-based detectors (DETR, Deformable DETR, DETA, DINO and our SDG variants), the network directly predicts a set of bounding boxes { B p } . For segmentation-based ISTD methods (e.g., ACM, ALCNet, RDIAN, DNA_Net, ISTDU-Net, UIUNet, U-Net, ResUNet), the network outputs a binary mask for each test image. We first extract all connected components from the predicted mask and, for each component, compute its tight axis-aligned enclosing rectangle. These rectangles are treated as the predicted boxes B p . Ground-truth annotations are also represented as axis-aligned bounding boxes. In this way, both detection and segmentation methods are evaluated with exactly the same box-based criteria.
The matching between a predicted box B p and a ground-truth box B g is determined by the intersection over union (IoU):
I o U = | B p B g | | B p B g | .
A prediction is counted as a true positive (TP) if I o U T I o U . A one-to-one matching strategy is adopted: each ground-truth target is matched to at most one prediction (the one with the highest IoU), and unmatched predictions are treated as false alarms. For small targets, IoU is highly sensitive to positional offsets, so the IoU threshold is uniformly set to
T I o U = 0.50 ,
which provides a reasonable balance between localization precision and tolerance.
All predicted boxes whose confidence scores exceed a fixed threshold are counted as detections. Let N gt denote the total number of ground-truth targets in the test set, and let N det denote the total number of predicted boxes. Among all detections, N T P are matched as true positives, and the rest belong to the false alarm set F . The probability of detection (PD) and the false alarm rate (FA) are defined as
P D = N T P N gt , F A = k F | B k | A total ,
where | B k | is the area (in pixels) of the k-th false-alarm box and A total is the total image area over the whole test set (i.e., the sum of the pixel numbers of all test images).
In other words, PD measures the fraction of correctly detected targets among all ground-truth targets, while FA measures the proportion of image area occupied by false-alarm boxes.

5.2. Benchmark Results

Table 2 reports the detection performance at T I o U = 0.50 on the IR-SatDense test set, including both the overall results (All) and the three density intervals defined by AMID. Overall, integrating the proposed SDG-ResNet backbone into DETR-style detectors leads to consistent PD improvements across most density regimes while keeping FA at a comparable or even lower level than the corresponding baselines.
It is worth noting that the Average Minimum Inter-Target Distance (AMID) is inversely related to the target density in the scene. A smaller AMID value indicates that targets are more densely distributed, while a larger AMID corresponds to relatively sparse target configurations. Therefore, the AMID-based evaluation provides a quantitative analysis of the detector performance under different target density conditions.
For DINO, replacing the vanilla ResNet-50 with SDG-ResNet yields a PD increase from 86.44% to 86.82% on the whole test set (+0.38%), with FA changing only slightly from 1.30 to 1.34 × 10 4 . In terms of density-specific results, SDG DINO improves PD in all AMID intervals: from 85.55% to 85.71% (+0.16%) for AMID 1 , from 87.43% to 87.51% (+0.08%) for 1 < AMID 2 , and from 86.84% to 87.41% (+0.57%) for 2 < AMID 3 .
For Deformable DETR, the baseline model performs poorly on IR-SatDense, achieving only 5.49% PD overall. After introducing SDG-ResNet, SDG Deformable DETR improves the overall PD to 7.13% (+1.64%), with similar trends across all three AMID intervals (e.g., from 6.42% to 7.90% for AMID 1 ). Although the absolute PD values remain modest, this relative gain demonstrates that SDG can noticeably strengthen the detection capability of weaker DETR-style baselines.
Table 3 further shows that the SDG-equipped detectors consistently achieve higher PD across IoU thresholds from 0.3 to 0.7 while maintaining comparable FA. This confirms that the performance gain is not limited to IoU = 0.50 but reflects improved localization robustness.
In summary, the benchmark results confirm that semantic density guidance provides clear and consistent improvements for DETR-style detectors on IR-SatDense under different density conditions.

5.3. Comparison with State-of-the-Art Methods

To further evaluate the effectiveness and efficiency of the proposed SDG-ResNet, we compare SDG-enhanced detectors with representative segmentation-based ISTD networks and DETR-style detectors on IR-SatDense. Table 4 summarizes the probability of detection (PD), false alarm rate (FA), and the model complexity in terms of parameters and FLOPs.
Among all compared methods, DINO already provides a very strong baseline, achieving 86.44% PD and 1.30 × 10 4 FA with 47.5M parameters and 178.5G FLOPs. After inserting SDG-ResNet, SDG DINO further improves PD to 86.82% (+0.38%) with only a small increase in model size and computation (49.8M params, +4.8%; 186.1G FLOPs, +4.3%), while FA remains at a similar level ( 1.34 × 10 4 , +0.04). This shows that SDG brings measurable accuracy gains at a very modest additional cost.
For Deformable DETR, the baseline obtains 5.49% PD and 9.37 × 10 4 FA with 40.0M parameters and 123.3G FLOPs. The SDG version, SDG D-DETR, increases PD to 7.13% (+1.64%) with 42.4M parameters (+6.0%) and 130.8G FLOPs (+6.1%), while FA remains on the same order ( 9.72 × 10 4 ). Although the absolute performance is still lower than that of DINO, this relative improvement verifies that SDG can noticeably strengthen the detection capability of weaker DETR-style baselines in dense small-target scenarios.
For DETA, introducing SDG-ResNet brings clear gains in both accuracy and robustness. The overall PD increases from 63.70% to 64.75% (+1.05%), while FA is reduced from 1.15 to 1.08 × 10 4 ( 0.07 ). This improvement is achieved with a moderate overhead in complexity: the number of parameters grows from 48.3 M to 50.6 M (about +4.8%), and FLOPs from 182.0 G to 189.5 G (about +4.1%). These results indicate that even for a head-optimized detector like DETA, semantic density guidance at the backbone level can still provide a favorable accuracy–complexity trade-off.
Compared with the segmentation-based ISTD methods (e.g., DNA_Net, ISTDU-Net, UIUNet), SDG DINO achieves the highest PD on IR-SatDense while maintaining a competitive FA, despite having a larger model size. Taken together, these results demonstrate that the proposed SDG-ResNet is a lightweight yet effective plug-in backbone for dense infrared small target detection: it yields clear PD improvements for strong DETR-based detectors at a negligible cost in parameters and FLOPs and can be seamlessly integrated into existing architectures.

5.4. Ablation Study

We conduct ablation experiments on IR-SatDense based on the DINO detector at T I o U = 0.50 . The test set is divided into three density ranges according to AMID ( A M I D 1 , 1 < A M I D 2 , 2 < A M I D 3 ) plus the overall set (All). We compare four variants: the original DINO (baseline), SDG@Res4 (only an SDGR block on Res4), SDG@Res3 (only on Res3), and SDG (full), which inserts SDGR blocks at both Res3 and Res4.
As shown in Table 5, all SDG variants improve PD over the DINO baseline (86.44% PD, 1.30 × 10 4 FA) on the whole test set. SDG@Res4 and SDG@Res3 increase PD to 86.50% (+0.06%) and 86.67% (+0.23%), while slightly reducing FA to 1.26 and 1.22 × 10 4 , respectively. The full SDG configuration achieves the highest PD of 86.82% (+0.38%) with a marginal FA change to 1.34 × 10 4 (+0.04).
The gains are most evident in the dense regime ( A M I D 1 ), where many targets are tightly clustered: PD increases from 85.55% (baseline) to 85.96%, 85.78%, and 85.71% for SDG@Res4, SDG@Res3, and full SDG, with FA staying around the baseline level. In the densest practical regime 2 < A M I D 3 , PD improves from 86.84% to 86.97%, 87.12%, and 87.41% (+0.57% for full SDG). Overall, these results show that injecting a shared semantic density prior into Res3/Res4 consistently enhances detection performance, and jointly refines both stages (full SDG) provides the best PD–FA trade-off.
Although SDG-ResNet introduces additional parameters and computational overhead, the increase remains modest relative to the baseline backbone. For example, when integrated into DINO, the number of parameters increases from 47.5M to 49.8M (approximately +4.8%), and the FLOPs increase from 178.5G to 186.1G (approximately +4.3%). Compared with the overall computational scale of transformer-based detectors, this additional cost is relatively small and does not affect practical deployment feasibility. Meanwhile, SDG-ResNet consistently improves PD across dense scenarios. These results indicate a favorable performance–efficiency trade-off for dense infrared small target detection.
To examine whether the proposed SDG module affects optimization stability, we compare the training loss curves between baseline detectors and their SDG-equipped variants. As shown in Figure 6, all SDG-equipped models exhibit smooth convergence behavior that closely follows their corresponding baselines. No noticeable oscillation or divergence is observed during training. Moreover, the convergence speed remains comparable across all models, indicating that the introduced density-guided refinement does not adversely impact training stability.

5.5. Comparison Results on Sparse Target Dataset IRSTD-1K

To further verify the generalization ability of SDG on conventional sparse infrared small targets, we also conduct experiments on the IRSTD-1K dataset [2]. The quantitative results are summarized in Table 6. We can observe that classical segmentation-based methods (ISTDU-Net, UIUNet, U-Net, etc.) already achieve very high PD values above 80% on this sparse benchmark, while DINO attains a strong trade-off with 85.46% PD and the lowest FA of 0.17 × 10 4 . Introducing SDG-ResNet into DINO further improves PD slightly to 85.81% while keeping FA unchanged.
These results indicate that IRSTD-1K is relatively easy in terms of target density and that transformer-based detectors such as DINO remain highly competitive even without explicit density modeling. The performance improvement on IRSTD-1K is relatively modest compared with that on IR-SatDense. This is expected because IRSTD-1K is primarily a sparse-target dataset, where most images contain only one or a few isolated targets with relatively large inter-target distances. In such scenarios, dense semantic clustering rarely occurs at high-level feature maps, and the predicted density prior tends to be spatially diffuse. As a result, the SDGR blocks behave close to identity mappings, leading to stable but limited gains. Importantly, SDG-ResNet does not degrade performance in sparse scenes, indicating that the density-guided mechanism remains compatible with conventional sparse-target detection settings.

5.6. Performance Under Different Background Complexity

To further analyze the robustness of the proposed semantic density guidance (SDG) mechanism under different background conditions, we evaluate the detection performance on the predefined complexity subsets of the IR-SatDense dataset.
The results are summarized in Table 7. Overall, the proposed SDG brings consistent improvements for Deformable DETR and DETA across all background complexity levels. For DINO, SDG also improves performance in most cases, especially under easy scene, medium scene, and complex scene conditions, while only a marginal fluctuation is observed under the most challenging extremely complex scene condition.
In particular, the improvements are more noticeable for relatively weaker baselines such as Deformable DETR and DETA, indicating that the proposed semantic density guidance effectively enhances the robustness of dense infrared small target detection under varying background complexities.

5.7. Performance Under Different Target Sizes

To further analyze the detection capability for extremely small infrared targets, we evaluate the detection probability (PD) under different target size intervals. The bounding-box areas are divided into three groups: A 3 × 3 , 3 × 3 < A 4 × 4 , and A > 4 × 4 pixels. The proportions of ground-truth targets in these intervals are 38.16%, 41.76%, and 20.08%, respectively.
As shown in Table 8, detection performance increases with target size for all methods, indicating that extremely small targets remain the most challenging scenario in infrared imagery. Nevertheless, the proposed SDG module consistently improves detection performance across different size intervals. In particular, more noticeable improvements are observed for extremely small targets ( A 3 × 3 ), demonstrating that the semantic density guidance mechanism effectively enhances the representation of weak target signals.

5.8. Visual Analysis

To further illustrate the detection effectiveness of the proposed SDG-ResNet, we conduct a qualitative comparison on the four background complexity levels defined in Figure 2, namely easy, medium, complex, and extremely complex on-orbit scenes. The visualization results are shown in Figure 7, where each column corresponds to one background level, and the rows compare the baseline DINO with DINO + (equipped with SDG-ResNet).
Across all four background types, the baseline DINO either misses part of the densely distributed targets or produces spurious responses in cluttered non-target regions. By contrast, DINO + detects more true targets within dense clusters and effectively suppresses false alarms on background structures. This confirms that introducing SDG-ResNet can simultaneously enhance dense-target detection and reduce false alarms under diverse on-orbit background conditions.

6. Conclusions and Further Analysis

In this paper, we addressed the challenging problem of dense infrared small target detection, where tiny low-SNR targets appear in highly crowded configurations. We first constructed a new satellite dense infrared dataset, IR-SatDense, in which target density, inter-target spacing, and SNR can be flexibly controlled. Based on the proposed AMID metric, IR-SatDense reveals that the probability of detection (PD) of existing detectors degrades sharply while the false alarm rate (FA) increases as targets become more densely packed.
To mitigate this density-induced degradation, we proposed a Semantic Density-Guided ResNet (SDG-ResNet) backbone. SDG-ResNet predicts a semantic density map from the deepest ResNet stage and reuses it as a global prior to refine mid- and low-level features via lightweight Semantic Density-Guided Refine (SDGR) blocks. Integrated into representative DETR-like detectors such as Deformable DETR, DETA, and DINO, SDG consistently improves PD at comparable FA, especially in the most challenging small-AMID regime, while introducing only negligible additional parameters and FLOPs.
Experiments on both the dense IR-SatDense and the sparse IRSTD-1K datasets demonstrate that SDG-ResNet enhances robustness in dense-target regimes without sacrificing performance in sparse scenarios. In future work, we plan to extend semantic density guidance to multi-frame and multi-scale settings and to explore joint modeling of temporal density evolution and long-range motion patterns in satellite infrared sensing.

Author Contributions

Conceptualization, X.Z. and W.A.; methodology, X.Z. and X.Y.; software, X.Z.; validation, X.Z., X.Y. and N.C.; formal analysis, X.Z., X.Y. and B.L.; investigation, B.L., R.L., C.X. and M.L.; resources, W.A.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z., X.Y., N.C., B.L. and W.A.; visualization, X.Z.; supervision, M.L. and W.A.; project administration, M.L. and W.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 12503098.

Data Availability Statement

The proposed IR-SatDense dataset and implementation code are publicly available at https://github.com/Lucifer094/SDG (accessed on 20 March 2026). All other datasets used in this study (e.g., NUDT-SIRST, SIRST, IRSTD-1K, DenseSIRST, DMIST) are public and cited in the corresponding references.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
  2. Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
  3. Lin, J.; Li, S.; Zhang, L.; Yang, X.; Yan, B.; Meng, Z. IR-TransDet: Infrared dim and small target detection with IR-transformer. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5004813. [Google Scholar] [CrossRef]
  4. Chen, N.; Li, B.; Wang, Y.; Ying, X.; Wang, L.; Zhang, C.; Guo, Y.; Li, M.; An, W. Motion and Appearance Decoupling Representation for Event Cameras. IEEE Trans. Image Process. 2025, 34, 5964–5977. [Google Scholar] [CrossRef] [PubMed]
  5. Li, R.; An, W.; Xiao, C.; Li, B.; Wang, Y.; Li, M.; Guo, Y. Direction-coded temporal U-shape module for multiframe infrared small target detection. IEEE Trans. Neural Netw. Learn. Syst. 2023, 36, 555–568. [Google Scholar] [CrossRef]
  6. Li, R.; An, W.; Ying, X.; Wang, Y.; Dai, Y.; Wang, L.; Li, M.; Guo, Y.; Liu, L. Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better. arXiv 2025, arXiv:2506.12766. [Google Scholar] [CrossRef]
  7. Ying, X.; Xiao, C.; An, W.; Li, R.; He, X.; Li, B.; Cao, X.; Li, Z.; Wang, Y.; Hu, M.; et al. Visible-thermal tiny object detection: A benchmark dataset and baselines. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6088–6096. [Google Scholar] [CrossRef]
  8. Ying, X.; Liu, L.; Lin, Z.; Shi, Y.; Wang, Y.; Li, R.; Cao, X.; Li, B.; Zhou, S.; An, W. Infrared small target detection in satellite videos: A new dataset and a novel recurrent feature refinement framework. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5002818. [Google Scholar] [CrossRef]
  9. Ying, X.; Liu, L.; Wang, Y.; Li, R.; Chen, N.; Lin, Z.; Sheng, W.; Zhou, S. Mapping degeneration meets label evolution: Learning infrared small target detection with single point supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; p. 15528. [Google Scholar]
  10. Li, B.; Wang, L.; Wang, Y.; Wu, T.; Lin, Z.; Li, M.; An, W.; Guo, Y. Mixed-precision network quantization for infrared small target segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5000812. [Google Scholar] [CrossRef]
  11. Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
  12. Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef]
  13. Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 950–959. [Google Scholar]
  14. Dai, Y.; Li, X.; Zhou, F.; Qian, Y.; Chen, Y.; Yang, J. One-stage cascade refinement networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000917. [Google Scholar] [CrossRef]
  15. Chen, S.; Ji, L.; Zhu, S.; Ye, M.; Ren, H.; Sang, Y. Towards dense moving infrared small target detection: New datasets and baseline. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005513. [Google Scholar] [CrossRef]
  16. Xiao, M.; Dai, Q.; Zhu, Y.; Guo, K.; Wang, H.; Shu, X.; Yang, J.; Dai, Y. Background semantics matter: Cross-task feature exchange network for clustered infrared small target detection with sky-annotated dataset. arXiv 2024, arXiv:2407.20078. [Google Scholar]
  17. Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
  18. Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A robust infrared small target detection algorithm based on human visual system. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
  19. Qin, Y.; Li, B. Effective infrared small target detection utilizing a novel local contrast method. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1890–1894. [Google Scholar] [CrossRef]
  20. Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
  21. Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
  22. Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets 1999; SPIE: Bellingham, WA, USA, 1999; Volume 3809, pp. 74–83. [Google Scholar]
  23. Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
  24. Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
  25. Zhu, J.; Chen, S.; Li, L.; Ji, L. Sanet: Spatial attention network with global average contrast learning for infrared small target detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  26. Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior attention-aware network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002013. [Google Scholar] [CrossRef]
  27. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  28. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  29. Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
  30. Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
  31. Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
  32. Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
  33. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  34. Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
  35. Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000513. [Google Scholar] [CrossRef]
  36. Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared Small-Target Detection U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7506205. [Google Scholar] [CrossRef]
  37. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  38. Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar]
  39. Xiong, Z.; Zhou, F.; Wu, F.; Yuan, S.; Fu, M.; Peng, Z.; Yang, J.; Dai, Y. DRPCA-Net: Make robust PCA great again for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5005516. [Google Scholar] [CrossRef]
  40. Ouyang-Zhang, J.; Cho, J.H.; Zhou, X.; Krähenbühl, P. Nms strikes back. arXiv 2022, arXiv:2212.06137. [Google Scholar] [CrossRef]
Figure 1. Visualization of backbone feature responses at low-, middle-, and high-level stages for dense and single-target infrared scenes. The left column shows a dense target cluster, where the high-level feature map exhibits strong and compact semantic responses. The right column shows a single-target scene, where the high-level semantic responses are weaker and more diffuse. This observation suggests that dense target clusters naturally concentrate semantic information at high levels, which can be exploited to guide low-level feature enhancement and suppress noise in non-target regions.
Figure 1. Visualization of backbone feature responses at low-, middle-, and high-level stages for dense and single-target infrared scenes. The left column shows a dense target cluster, where the high-level feature map exhibits strong and compact semantic responses. The right column shows a single-target scene, where the high-level semantic responses are weaker and more diffuse. This observation suggests that dense target clusters naturally concentrate semantic information at high levels, which can be exploited to guide low-level feature enhancement and suppress noise in non-target regions.
Remotesensing 18 01397 g001
Figure 2. Different background complexities under realistic on-orbit imaging conditions. (a) Representative examples of four levels: (i) easy, (ii) medium, (iii) complex, and (iv) extremely complex scenes; (b) Statistical distribution of each background level in the dataset.
Figure 2. Different background complexities under realistic on-orbit imaging conditions. (a) Representative examples of four levels: (i) easy, (ii) medium, (iii) complex, and (iv) extremely complex scenes; (b) Statistical distribution of each background level in the dataset.
Remotesensing 18 01397 g002
Figure 3. Visualization examples of infrared images with different target density intervals based on the Average Minimum Inter-Target Distance (AMID). From left to right: (a) A M I D 1 , (b) 1 < A M I D 2 , and (c) 2 < A M I D 3 . The red boxes indicate annotated target regions.
Figure 3. Visualization examples of infrared images with different target density intervals based on the Average Minimum Inter-Target Distance (AMID). From left to right: (a) A M I D 1 , (b) 1 < A M I D 2 , and (c) 2 < A M I D 3 . The red boxes indicate annotated target regions.
Remotesensing 18 01397 g003
Figure 4. Experimental results of the DINO model trained on the IR-SatDense dataset and evaluated on subsets with different target density intervals. (a) Probability of detection (PD) curves with respect to the intersection over union (IoU) threshold under different AMID levels; (b) False alarm rate (FA) curves with respect to the IoU threshold under different AMID levels.
Figure 4. Experimental results of the DINO model trained on the IR-SatDense dataset and evaluated on subsets with different target density intervals. (a) Probability of detection (PD) curves with respect to the intersection over union (IoU) threshold under different AMID levels; (b) False alarm rate (FA) curves with respect to the IoU threshold under different AMID levels.
Remotesensing 18 01397 g004
Figure 5. Overview of the proposed SDG-ResNet and its integration into a DINO-based detector. (a) Overall framework: The deepest feature X 5 is converted by the semantic density head into a density map D , which guides SDGR blocks to refine X 4 and X 3 ; all three features are then fed to the DETR-style head for final detection. (b) Semantic density head: A lightweight 1 × 1 3 × 3 convolution stack compresses X 5 into a single-channel semantic density map D . (c) Semantic Density-Guided Refine Block (SDGR): The upsampled density map is fused with intermediate features to predict a spatial gate, and the gated response is added residually to obtain refined features.
Figure 5. Overview of the proposed SDG-ResNet and its integration into a DINO-based detector. (a) Overall framework: The deepest feature X 5 is converted by the semantic density head into a density map D , which guides SDGR blocks to refine X 4 and X 3 ; all three features are then fed to the DETR-style head for final detection. (b) Semantic density head: A lightweight 1 × 1 3 × 3 convolution stack compresses X 5 into a single-channel semantic density map D . (c) Semantic Density-Guided Refine Block (SDGR): The upsampled density map is fused with intermediate features to predict a spatial gate, and the gated response is added residually to obtain refined features.
Remotesensing 18 01397 g005
Figure 6. Training loss curves of baseline detectors and their SDG-equipped variants. All models exhibit smooth and stable convergence behavior, and the convergence speed remains comparable after introducing the SDG module.
Figure 6. Training loss curves of baseline detectors and their SDG-equipped variants. All models exhibit smooth and stable convergence behavior, and the convergence speed remains comparable after introducing the SDG module.
Remotesensing 18 01397 g006
Figure 7. Qualitative comparison of baseline detectors and SDG-ResNet variants on four background complexity levels in IR-SatDense. From left to right, each column corresponds to one background type: (i) easy, (ii) medium, (iii) complex, and (iv) extremely complex scenes under realistic on-orbit imaging conditions. For each detector, the first row shows the detection results of the baseline model, and the second row shows the corresponding results when equipped with the SDG-ResNet backbone. The detectors include Deformable DETR, DETA, and DINO. Red bounding boxes indicate ground-truth targets, and green bounding boxes indicate detected targets. Compared with their respective baselines, the SDG-equipped variants recover more densely distributed targets and suppress false alarms in non-target regions across all background levels, leading to clearer and more reliable detection in dense infrared scenes.
Figure 7. Qualitative comparison of baseline detectors and SDG-ResNet variants on four background complexity levels in IR-SatDense. From left to right, each column corresponds to one background type: (i) easy, (ii) medium, (iii) complex, and (iv) extremely complex scenes under realistic on-orbit imaging conditions. For each detector, the first row shows the detection results of the baseline model, and the second row shows the corresponding results when equipped with the SDG-ResNet backbone. The detectors include Deformable DETR, DETA, and DINO. Red bounding boxes indicate ground-truth targets, and green bounding boxes indicate detected targets. Compared with their respective baselines, the SDG-equipped variants recover more densely distributed targets and suppress false alarms in non-target regions across all background levels, leading to clearer and more reliable detection in dense infrared scenes.
Remotesensing 18 01397 g007
Table 1. Main characteristics of representative infrared small target detection datasets. Dens.: indicates whether the dataset includes explicit stratification of target density levels. AutoG.: automatically generated dataset using algorithmic synthesis. Pix.: pixel-level annotations. BBox: bounding-box annotations. Pt.: point-level annotations. ImgNum: number of images in the dataset. TgtAvg.: average number of targets per image.
Table 1. Main characteristics of representative infrared small target detection datasets. Dens.: indicates whether the dataset includes explicit stratification of target density levels. AutoG.: automatically generated dataset using algorithmic synthesis. Pix.: pixel-level annotations. BBox: bounding-box annotations. Pt.: point-level annotations. ImgNum: number of images in the dataset. TgtAvg.: average number of targets per image.
DatasetAMIDAreaDens.AutoG.Pix.BBoxPt.ImgNumTgtAvg.
NUDT-SIRST [1]30.95NoNoYesNoNo10001.40
SIRST [13]33.03NoNoYesNoNo4271.25
IRSTD-1K [2]23.00NoNoYesNoNo10001.50
SIRST-AUG [11]83.00NoNoYesNoNo90701.02
SIRSTv2 [14]19.00NoNoYesYesYes10240.68
DMIST-60 [15]16.0136.46NoYesNoYesNo13,77961.0
DMIST-100 [15]12.2636.28NoYesNoYesNo13,779101.0
DenseSIRST [16]11.0111.52NoYesYesYesYes102413.38
IR-SatDense (Ours)1.5110.42YesYesYesYesYes215458.04
Table 2. Quantitative comparison of detection performance at IoU = 0.50 across different density intervals. The probability of detection (PD) and false alarm rate (FA) are reported for the entire test set (All) and for subsets divided by AMID ranges. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on the whole test set. ↑ indicates higher is better, and ↓ indicates lower is better.
Table 2. Quantitative comparison of detection performance at IoU = 0.50 across different density intervals. The probability of detection (PD) and false alarm rate (FA) are reported for the entire test set (All) and for subsets divided by AMID ranges. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on the whole test set. ↑ indicates higher is better, and ↓ indicates lower is better.
MethodsBackboneAllAMID 1 1 < AMID 2 2 < AMID 3
PD (%) ↑FA ( 10 4 ) ↓PD (%) ↑FA ( 10 4 ) ↓PD (%) ↑FA ( 10 4 ) ↓PD (%) ↑FA ( 10 4 ) ↓
ACM [13]28.9622.015.8930.5326.5422.2952.8312.65
ALCNet [34]44.9015.1412.1824.4849.7013.1271.057.01
RDIAN [35]75.725.9962.728.2482.224.5081.534.39
DNA_Net [1]80.434.7670.077.3584.843.5785.973.17
ISTDU-Net [36]79.115.0366.987.4584.473.8586.473.58
UIUNet [12]77.863.9666.175.6382.523.0784.643.27
U-Net [37]79.694.5464.627.1484.313.7485.053.38
ResUNet [38]79.114.8167.897.4184.313.7485.043.39
DRPCA-Net [39]34.8313.1414.3218.1441.2211.1448.669.59
DETR [27]ResNet5000000000
DN-DETR [30]ResNet5000000000
Deformable DETR [28]ResNet505.499.376.428.095.489.435.2510.92
DETA [40]ResNet5063.701.1554.231.5764.151.1270.280.93
DINO [31]ResNet5086.441.3085.551.5687.431.2686.841.26
Deformable DETR+SDG-ResNet507.13 (+1.64)9.72 (+0.35)7.90 (+1.48)8.26 (+0.17)7.11 (+1.63)9.69 (+0.26)7.34 (+2.09)11.12 (+0.20)
DETA+SDG-ResNet5064.75 (+1.05)1.08 (−0.07)55.76 (+1.53)1.42 (−0.15)65.49 (+1.34)1.03 (−0.09)71.55 (+1.27)0.86 (−0.07)
DINO+SDG-ResNet5086.82 (+0.38)1.34 (+0.04)85.71 (+0.16)1.51 (−0.05)87.51 (+0.08)1.31 (+0.05)87.41 (+0.57)1.36 (+0.10)
Table 3. PD/FA comparison under different IoU thresholds on IR-SatDense. Improvements over the baselines are shown in parentheses. Higher PD indicates better detection performance, while lower FA indicates fewer false alarms.
Table 3. PD/FA comparison under different IoU thresholds on IR-SatDense. Improvements over the baselines are shown in parentheses. Higher PD indicates better detection performance, while lower FA indicates fewer false alarms.
MethodsIoU = 0.3IoU = 0.4IoU = 0.5IoU = 0.6IoU = 0.7
PD (%)FA ( 10 4 )PD (%)FA ( 10 4 )PD (%)FA ( 10 4 )PD (%)FA ( 10 4 )PD (%)FA ( 10 4 )
Deformable DETR26.935.8513.097.995.499.372.1610.070.6610.41
DETA75.020.3470.260.6463.701.1550.992.4429.905.21
DINO94.570.4492.450.6686.441.3071.353.1442.967.08
Deformable DETR+30.38 (+3.45)5.99 (+0.14)15.72 (+2.63)8.25 (+0.26)7.13 (+1.64)9.72 (+0.35)2.93 (+0.77)10.52 (+0.45)0.98 (+0.32)10.94 (+0.53)
DETA+76.12 (+1.10)0.32 (−0.02)71.48 (+1.22)0.57 (−0.07)64.75 (+1.05)1.08 (−0.07)51.87 (+0.88)2.39 (−0.05)30.47 (+0.57)5.14 (−0.07)
DINO+94.93 (+0.36)0.48 (+0.04)92.80 (+0.35)0.69 (+0.03)86.82 (+0.38)1.34 (+0.04)71.44 (+0.09)3.24 (+0.10)43.10 (+0.14)7.20 (+0.12)
Table 4. Comparison of different methods on the IR-SatDense dataset in terms of probability of detection (PD), false alarm rate (FA), number of parameters (Params), and computational complexity (FLOPs). PD and FA are expressed in percentage (%) and 10 4 scale, respectively. Params denote the total number of learnable parameters, while FLOPs are computed based on a 512 × 512 input image. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on the IR-SatDense test set. Higher PD indicates better detection performance, while lower FA, Params, and FLOPs indicate better efficiency.
Table 4. Comparison of different methods on the IR-SatDense dataset in terms of probability of detection (PD), false alarm rate (FA), number of parameters (Params), and computational complexity (FLOPs). PD and FA are expressed in percentage (%) and 10 4 scale, respectively. Params denote the total number of learnable parameters, while FLOPs are computed based on a 512 × 512 input image. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on the IR-SatDense test set. Higher PD indicates better detection performance, while lower FA, Params, and FLOPs indicate better efficiency.
MethodsPD (%) ↑FA ( 10 4 ) ↓Params ↓FLOPs ↓
ACM [13]28.9622.010.4 M0.4 G
ALCNet [34]44.9015.140.4 M0.4 G
RDIAN [35]75.725.990.2 M3.7 G
DNA_Net [1]80.434.764.7 M14.3 G
ISTDU-Net [36]79.115.032.8 M7.9 G
UIUNet [12]77.863.962.8 M7.9 G
U-Net [37]79.694.542.8 M7.9 G
ResUNet [38]79.114.812.8 M7.9 G
DRPCA-Net [39]34.8313.141.17 M74.36 G
DETR [27]0.000.0041.5 M60.5 G
DN-DETR [30]0.000.0043.6 M65.3 G
Deformable DETR [28]5.499.3740.0 M123.3 G
DETA [40]63.701.1548.3 M182.0 G
DINO [31]86.441.3047.5 M178.5 G
Deformable DETR+7.13 (+1.64)9.72 (+0.35)42.4 M130.8 G
DETA+64.75 (+1.05)1.08 (−0.07)50.6 M189.5 G
DINO+86.82 (+0.38)1.34 (+0.04)49.8 M186.1 G
Table 5. Ablation study of SDG-ResNet on the IR-SatDense dataset. Probability of detection (PD) and false alarm rate (FA) are reported at T I o U = 0.50 for the whole test set (All) and for three density intervals defined by AMID. SDG consistently improves PD over the DINO baseline, especially in the most dense regime ( A M I D 1 ), while keeping FA at a comparable level. ↑ indicates higher is better, and ↓ indicates lower is better.
Table 5. Ablation study of SDG-ResNet on the IR-SatDense dataset. Probability of detection (PD) and false alarm rate (FA) are reported at T I o U = 0.50 for the whole test set (All) and for three density intervals defined by AMID. SDG consistently improves PD over the DINO baseline, especially in the most dense regime ( A M I D 1 ), while keeping FA at a comparable level. ↑ indicates higher is better, and ↓ indicates lower is better.
MethodsAllAMID 1 1 < AMID 2 2 < AMID 3
PD (%) ↑FA ( 10 4 ) ↓PD (%) ↑FA ( 10 4 ) ↓PD (%) ↑FA ( 10 4 ) ↓PD (%) ↑FA ( 10 4 ) ↓
DINO (baseline)86.441.3085.551.5687.431.2686.841.26
SDG@Res486.501.2685.961.4487.471.2086.971.20
SDG@Res386.671.2285.781.4187.291.1787.121.17
SDG (full)86.821.3485.711.5187.511.3187.411.36
Table 6. Comparison of detection performance on the sparse infrared small target dataset IRSTD-1K. Probability of detection (PD) and false alarm rate (FA) are reported on the full test set. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on IRSTD-1K. ↑ indicates higher is better, and ↓ indicates lower is better.
Table 6. Comparison of detection performance on the sparse infrared small target dataset IRSTD-1K. Probability of detection (PD) and false alarm rate (FA) are reported on the full test set. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on IRSTD-1K. ↑ indicates higher is better, and ↓ indicates lower is better.
MethodsPD (%) ↑FA ( 10 4 ) ↓
ACM [13]82.490.97
ALCNet [34]84.180.78
RDIAN [35]80.810.30
DNA_Net [1]82.830.26
ISTDU-Net [36]87.210.44
UIUNet [12]86.200.60
U-Net [37]84.580.49
ResUNet [38]82.830.32
DETR [27]0.000.00
DN-DETR [30]0.000.00
Deformable DETR [28]77.180.18
DETA [40]85.650.17
DINO [31]85.460.17
Deformable DETR+78.46 (+1.28)0.21 (+0.03)
DETA+86.22 (+0.57)0.10 (−0.07)
DINO+85.81 (+0.35)0.17 (−0.00)
Table 7. PD/FA comparison under different background complexity levels on IR-SatDense at IoU = 0.5. Improvements over the baselines are shown in parentheses. Higher PD indicates better detection performance, while lower FA indicates fewer false alarms.
Table 7. PD/FA comparison under different background complexity levels on IR-SatDense at IoU = 0.5. Improvements over the baselines are shown in parentheses. Higher PD indicates better detection performance, while lower FA indicates fewer false alarms.
MethodsEasyMediumComplexExtreme
PD (%)FA ( 10 4 )PD (%)FA ( 10 4 )PD (%)FA ( 10 4 )PD (%)FA ( 10 4 )
Deformable DETR4.809.175.289.715.679.206.219.41
DETA58.221.1763.341.2765.121.1667.651.04
DINO80.751.4886.761.3588.051.1789.921.21
Deformable DETR + SDG6.90 (+2.10)9.40 (+0.23)7.44 (+2.16)9.80 (+0.08)6.98 (+1.30)9.67 (+0.47)7.22 (+1.01)10.05 (+0.64)
DETA + SDG59.21 (+0.99)1.18 (+0.01)64.86 (+1.53)1.12 (−0.15)66.63 (+1.50)1.06 (−0.09)67.96 (+0.31)0.99 (−0.05)
DINO + SDG81.74 (+0.98)1.50 (+0.02)86.98 (+0.22)1.33 (−0.02)88.59 (+0.54)1.22 (+0.05)89.74 (−0.18)1.35 (+0.14)
Table 8. PD comparison under different target size intervals on IR-SatDense at IoU = 0.5. The proportions of ground-truth targets in each size interval are reported in parentheses. Improvements over the corresponding baselines are shown in parentheses. Higher PD indicates better detection performance.
Table 8. PD comparison under different target size intervals on IR-SatDense at IoU = 0.5. The proportions of ground-truth targets in each size interval are reported in parentheses. Improvements over the corresponding baselines are shown in parentheses. Higher PD indicates better detection performance.
Methods A 3 × 3 (38.16%) 3 × 3 < A 4 × 4 (41.76%) A > 4 × 4 (20.08%)
Deformable DETR0.136.8815.74
DETA41.2972.7188.16
DINO77.1392.0292.34
Deformable DETR + SDG0.69 (+0.56)9.54 (+2.66)17.22 (+1.48)
DETA + SDG43.05 (+1.76)73.61 (+0.90)88.23 (+0.07)
DINO + SDG77.10 (−0.03)92.72 (+0.70)92.84 (+0.50)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; An, W.; Ying, X.; Li, R.; Chen, N.; Li, B.; Xiao, C.; Li, M. Semantic Density-Guided ResNet for Dense Infrared Small Target Detection. Remote Sens. 2026, 18, 1397. https://doi.org/10.3390/rs18091397

AMA Style

Zhang X, An W, Ying X, Li R, Chen N, Li B, Xiao C, Li M. Semantic Density-Guided ResNet for Dense Infrared Small Target Detection. Remote Sensing. 2026; 18(9):1397. https://doi.org/10.3390/rs18091397

Chicago/Turabian Style

Zhang, Xin, Wei An, Xinyi Ying, Ruojing Li, Nuo Chen, Boyang Li, Chao Xiao, and Miao Li. 2026. "Semantic Density-Guided ResNet for Dense Infrared Small Target Detection" Remote Sensing 18, no. 9: 1397. https://doi.org/10.3390/rs18091397

APA Style

Zhang, X., An, W., Ying, X., Li, R., Chen, N., Li, B., Xiao, C., & Li, M. (2026). Semantic Density-Guided ResNet for Dense Infrared Small Target Detection. Remote Sensing, 18(9), 1397. https://doi.org/10.3390/rs18091397

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop