Semantic Density-Guided ResNet for Dense Infrared Small Target Detection

Zhang, Xin; An, Wei; Ying, Xinyi; Li, Ruojing; Chen, Nuo; Li, Boyang; Xiao, Chao; Li, Miao

doi:10.3390/rs18091397

Open AccessArticle

Semantic Density-Guided ResNet for Dense Infrared Small Target Detection

by

Xin Zhang

,

Wei An

,

Xinyi Ying

,

Ruojing Li

,

Nuo Chen

,

Boyang Li

,

Chao Xiao

and

Miao Li

^*

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(9), 1397; https://doi.org/10.3390/rs18091397

Submission received: 2 February 2026 / Revised: 12 March 2026 / Accepted: 24 March 2026 / Published: 1 May 2026

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A Semantic Density-Guided ResNet (SDG-ResNet) is proposed to explicitly exploit high-level semantic density for improving infrared small target detection.
The proposed method consistently improves detection performance, especially in dense target scenarios while maintaining competitive results in sparse scenes.

What are the implications of the main findings?

High-level semantic density information can serve as an effective global prior to guide low-level feature refinement for dense infrared target detection.
The proposed SDG-ResNet can be seamlessly integrated into existing transformer-based detectors, offering a practical and lightweight solution for space-based remote sensing applications.

Abstract

Dense infrared small target detection (ISTD) in long-range remote sensing is critical for multi-target surveillance, yet existing benchmarks mostly contain only sparsely distributed targets and rarely reflect dense scenes. To address this limitation, we construct a new dense satellite ISTD dataset, IR-SatDense, by compositing small targets onto real satellite infrared backgrounds and partitioning it into subsets using the Average Minimum Inter-Target Distance (AMID) to explicitly control target density. By visualizing multi-stage backbone features, we observe that in dense scenes the deepest stage naturally forms compact, high-response target clusters in the semantic feature maps, while low- and middle-level features remain heavily cluttered. This motivates us to treat high-level semantic density as a global prior to guide low-level feature enhancement. Therefore, we propose Semantic Density-Guided ResNet (SDG-ResNet), a plug-in backbone that attaches a lightweight semantic density head to the deepest stage and injects the predicted density map into intermediate layers through Semantic Density-Guided Refine (SDGR) blocks with residual spatial gating. Integrated into representative transformer-based detectors, including Deformable DETR, DETA, and DINO, SDG-ResNet consistently improves the probability of detection (PD) at comparable false alarm (FA) levels on IR-SatDense while maintaining competitive performance on the sparse dataset IRSTD-1K.

Keywords:

infrared small target detection; dense infrared scenes; semantic density-guided ResNet; density-aware backbone

1. Introduction

Infrared imaging does not rely on external illumination, enabling all-day and all-weather operation, strong penetration through smoke and haze, and high robustness under complex lighting conditions. Owing to these advantages, it has been widely adopted in long-distance target detection, environmental perception, and animal protection [1,2,3,4]. Among these applications, small infrared target detection plays a critical role in remote sensing situational awareness, long-range object monitoring, and maritime search and rescue and has emerged as a core research topic in intelligent infrared image analysis [5,6].

In recent remote sensing missions, a single infrared frame often contains numerous small targets that are extremely tiny, exhibit very low signal-to-noise ratios (SNRs), and are densely distributed amid complex background clutter. Under such dense conditions, conventional infrared small target detection methods typically suffer from degraded detection accuracy, increased miss rates, and elevated false alarm levels [7,8,9]. In particular, when the minimum inter-target distance becomes comparable to or smaller than the target size, responses from adjacent targets tend to overlap, and fine-grained target details are progressively lost during backbone downsampling, making accurate detection and reliable separation of neighboring targets especially challenging [10].

To better understand how deep backbones behave in dense and sparse scenes, Figure 1 visualizes feature responses at low, middle, and high stages for a dense target cluster and for a single-target scene. In the dense case, the low- and middle-level feature maps contain both clutter and target responses, whereas the high-level feature map shows a clear, compact semantic focus around the dense target cluster. In contrast, for the single-target case, the high-level semantic responses are much weaker and more diffuse. This phenomenon suggests that, in dense infrared scenes, dense target clusters naturally generate strong, well-aggregated semantic information at high levels, which can serve as a reliable prior to enhance low-level target details while suppressing background noise. However, most existing infrared small target detection networks still rely on generic multi-scale fusion or attention mechanisms built on ResNet features [1,2,11,12], where all spatial locations are processed in a largely density-agnostic manner. Although these methods aggregate multi-level semantics, they do not explicitly construct or reuse a high-level semantic density map of dense target clusters as a global prior to guide the refinement of low-level features in dense regions.

Existing infrared small target detection models, therefore, exhibit several limitations in dense scenes:

(1) Local spatial saliency limitation: Most current approaches are built upon local spatial saliency, enhancing targets by exploiting intensity contrast between a target and its surrounding background. When multiple adjacent targets are densely distributed in the same region, these methods struggle to distinguish subtle energy differences between targets and background, resulting in target adhesion, blurred responses, and incomplete separation in the detection maps.

(2) Density-aware semantic guidance deficiency: Although recent deep models employ powerful backbones and transformer-based heads, the backbone feature extraction process is typically bottom-up and does not explicitly encode target density. Features at different spatial locations are processed in a homogeneous manner, regardless of whether they belong to dense target clusters or mostly background. As illustrated in Figure 1, dense target clusters already induce strong, aggregated responses in high-level semantic features, but these cues are not reused to guide lower layers. Consequently, mid- and low-level features in dense regions are easily dominated by clutter-like responses, causing a sharp drop in probability of detection (PD) and a rise in false alarms (FAs) when the Average Minimum Inter-Target Distance (AMID) becomes small.

(3) Dense-target structural limitation: Traditional sparse-target models do not explicitly model the structural hierarchy of densely distributed targets in the spatial domain. In multi-target scenes with high spatial density, they often produce ambiguous spatial structures and mutual feature interference, which leads to degraded localization accuracy and unstable detection performance.

To overcome the above limitations in dense-target detection, this paper proposes a Semantic Density-Guided (SDG) backbone that explicitly leverages high-level semantic density to guide low-level feature enhancement. Instead of introducing complex attention blocks or modifying the detection head, SDG estimates a semantic density map from the deepest backbone stage and reuses it as a global prior to refine intermediate features. Concretely, a lightweight semantic density head predicts a one-channel semantic density map from the highest-level feature, and a set of Semantic Density-Guided Refine (SDGR) blocks injects this prior into mid- and low-level feature maps via residual spatial gating. In this way, dense target clusters that are already prominent in high-level semantics are used to enhance fine-grained target details in dense regions while suppressing noise responses in complex background areas. As a result, the backbone can respond differently in dense and sparse regions while preserving the benefits of existing transformer-based detectors.

The main contributions of this paper can be summarized as follows:

(1) Algorithmic contribution: We propose a Semantic Density-Guided backbone (SDG-ResNet) that augments a standard ResNet with a semantic density head and lightweight SDGR blocks. The deepest-stage feature is compressed into a semantic density map, which is then reused to modulate intermediate features through residual spatial gating. This design explicitly exploits high-level semantic density from dense target clusters to enhance low-level target details, providing density-aware semantic guidance for dense infrared small target detection while introducing only negligible additional parameters and FLOPs.

(2) Dataset contribution: We construct a novel simulated dataset named IR-SatDense (Infrared Satellite-Based Dense Small Target Dataset). Built on real satellite-based infrared backgrounds, IR-SatDense comprehensively simulates diverse target densities, signal-to-noise ratios (SNRs), and morphologies and is organized into subsets according to the Average Minimum Inter-Target Distance (AMID). This dataset provides controllable and realistic experimental scenarios for systematically evaluating detection algorithms under complex dense infrared conditions.

(3) Experimental contribution: We integrate the proposed SDG-ResNet into several transformer-based detectors, including DINO, Deformable DETR, and DETA, and conduct extensive experiments on the IR-SatDense dataset as well as on the sparse-target benchmark IRSTD-1K. Experimental results demonstrate that SDG-based detectors consistently improve PD at comparable FA levels, with particularly large gains in small-AMID (high-density) regimes, while maintaining strong performance on sparse infrared small target datasets.

The structure of this paper is as follows. Section 2 reviews existing infrared small target detection datasets and methods. Section 3 describes the construction and statistical analysis of the proposed IR-SatDense dataset. Section 4 presents the Semantic Density-Guided ResNet (SDG-ResNet) backbone and its integration into DETR-style detectors. Section 5 reports implementation details, benchmark comparisons, ablation studies, and visual analyses on IR-SatDense and IRSTD-1K. Finally, Section 6 concludes the paper and discusses future research directions.

2. Related Works

2.1. Existing Infrared Small Target Detection Datasets

Most publicly available datasets for infrared small target detection (ISTD) have been primarily designed for sparse-target scenarios, where each frame contains only a few isolated targets and the minimum inter-target distance is relatively large. Although these datasets have significantly advanced early ISTD research, they still fail to represent the spatial distribution characteristics of dense targets under complex backgrounds. In recent years, researchers have increasingly shifted their attention toward dense infrared small target detection (dense ISTD), leading to the creation of several new datasets tailored for this task. Overall, existing datasets can be broadly divided into two categories: sparse-target datasets and dense-target datasets.

(1) Sparse infrared small target datasets: Representative sparse datasets include NUDT-SIRST [1], SIRST and its extensions (SIRSTv2, SIRST-AUG) [11,13,14], and IRSTD-1K [2]. These datasets mainly consist of single-frame images containing a few, independently distributed targets, with most images including only one or two targets. For instance, NUDT-SIRST contains approximately 1000 images with an average of 1.2 targets per frame, SIRST provides 427 images (about 90% single-target), and IRSTD-1K offers 1000 pixel-level annotated images with multiple target categories and higher annotation precision. However, these datasets remain focused on sparse-target conditions and do not reflect scenarios where multiple small targets appear densely within a single frame.

(2) Dense infrared small target datasets: To overcome the above limitations, several dense or multi-target datasets have been developed, including DMIST-60/100 [15] and DenseSIRST [16]. The DMIST series provides multi-frame sequences containing varying numbers of targets to simulate realistic detection and tracking scenarios, while DenseSIRST is a single-frame dataset featuring densely distributed targets and pixel-level semantic background annotations, enabling studies on how semantic priors contribute to dense ISTD. Nevertheless, even in these datasets, the Average Minimum Inter-Target Distance is still several times larger than the target size, indicating that their density is insufficient to represent highly crowded distributions where targets are almost contiguous.

Compared with the aforementioned datasets, our IR-SatDense dataset exhibits significantly higher target density and stronger realism. It is synthesized using the proposed Dense Single-Frame Target Dataset Generator (DSTDGen) based on real satellite infrared backgrounds and is organized into multiple subsets corresponding to different density levels measured by the Average Minimum Inter-Target Distance (AMID). This dataset provides a comprehensive and realistic benchmark for evaluating ISTD algorithms in complex and densely populated scenarios and is particularly suitable for analyzing how detection performance changes as AMID decreases from sparse to extremely dense regimes.

2.2. Infrared Small Target Detection Methods

Infrared small target detection (ISTD), as a core component of infrared sensing systems, remains a fundamental yet challenging research topic. According to their underlying mechanisms, existing ISTD approaches can be broadly classified into two categories: model-driven methods and data-driven methods.

(1) Model-driven methods: These methods rely on handcrafted features and prior assumptions to distinguish targets from background clutter. Feature-based approaches, including LCM, ILCM, NLCM, and RLCM [17,18,19,20], enhance local contrast between targets and background by exploiting intensity differences. Background modeling methods, such as top-hat filtering and max–median filtering [21,22], leverage local background consistency to suppress low-frequency components and highlight potential targets. Low-rank sparse decomposition (LRSD)-based models, including IPI and PSTNN [23,24], assume a low-rank background and sparse targets to achieve separation. Although these model-driven algorithms have achieved considerable progress, they often struggle in complex or dense-target scenes where the structural hierarchy of multiple targets and background interactions is not explicitly modeled.

(2) Data-driven methods: With the rapid development of deep learning, data-driven approaches have become the dominant paradigm in ISTD. These methods use neural networks to automatically learn multi-level features and complex mappings between targets and backgrounds, offering stronger robustness and adaptability than traditional handcrafted approaches. Recent works are mostly segmentation-based, where small targets are localized via pixel-wise prediction. Representative methods include SANet [25], ACM [13], AGPCNet [11], DNANet [1], IAANet [26], ISNet [2], and UIUNet [12]. Although these models achieve impressive performance in sparse-target scenes, they often fail in dense scenes where closely spaced targets tend to merge in segmentation masks, leading to degraded localization accuracy.

To address these challenges, anchor-based detection frameworks have been introduced as more explicit and discriminative alternatives. By directly regressing bounding-box coordinates and class labels, anchor-based detectors effectively separate adjacent targets and support both single-frame and multi-frame detection. Recently, transformer-based end-to-end detectors (e.g., DETR, Deformable DETR, DAB-DETR, DN-DETR, and DINO) [27,28,29,30,31] have further advanced this field by incorporating deformable attention, dynamic anchors, and denoising strategies to jointly improve accuracy and convergence speed. Overall, anchor-based and transformer-based detectors demonstrate superior discriminative capability and spatial precision in dense infrared small target detection. However, their backbones typically extract features in a purely bottom-up and density-agnostic manner, treating all spatial locations uniformly without explicit semantic modeling of target density. This limits their robustness when targets become extremely dense (small AMID) and motivates the semantic density-guided backbone proposed in this work.

3. IR-SatDense Dataset Synthesis

In this section, we describe the proposed IR-SatDense dataset in detail.

3.1. Dataset Construction

(1) Data Collection: To construct a dense infrared small target dataset that reflects realistic on-orbit imaging conditions with complex backgrounds, a total of 2154 background images were collected from public satellite infrared imagery sources. These images cover diverse Earth observation scenarios and represent various background characteristics commonly encountered in infrared imaging tasks. All background images were cropped and preprocessed to a uniform resolution of

512 \times 512

pixels. According to background complexity, the dataset is categorized into four levels: (i) easy, (ii) medium, (iii) complex, and (iv) extremely complex scenes. Representative examples and the statistical distribution of each background level are shown in Figure 2.

(2) Targets and Annotations: To construct the proposed IR-SatDense dataset, a Dense Single-Frame Target Dataset Generator (DSTDGen) algorithm is developed. The pseudocode is presented in Algorithm 1, which consists of four main stages designed to automatically generate infrared small target samples with controllable density distributions and precise annotations on diverse backgrounds.

Step 1: Parameter and Template Initialization. The input includes a background infrared image

I \in R^{H \times W}

and a set of preselected small-target templates

{T_{1}, T_{2}, \dots, T_{n}}

. The mean target number

μ_{a}

and its variance

σ_{a}

determine the number of targets N generated on each image. The mean nearest inter-target distance d controls spatial density, while the SNR range

[S N R_{min}, S N R_{max}]

defines target intensity. Each image initializes a random start position

p_{1} = (y_{1}, x_{1})

and selects N random templates for subsequent placement (corresponding to lines 1–5 in Algorithm 1).

Step 2: Target Placement under Spatial Constraints. Targets are iteratively placed on the background while satisfying the mean nearest distance d. The first target is randomly rotated and placed at

p_{1}

. For each subsequent target j, its position

p_{j}

is sampled within the convex envelope of previously placed targets

{P_{1}, \dots, P_{j - 1}}

. If the mean inter-target distance deviates from d by more than 0.5 pixels, a local coordinate adjustment

(\pm Δ x, \pm Δ y)

is applied to fine-tune

p_{j}

. If a valid position is found, the image and statistics are updated; otherwise, a fail counter is increased to maintain stability. This process ensures controllable inter-target spacing and stable dense distribution (corresponding to lines 6–22 in Algorithm 1).

Step 3: Target Composition with Controllable Signal-to-Noise Ratio (SNR). To simulate the radiometric properties of real infrared point targets, each template

T_{i}

is normalized and enhanced according to a sampled SNR within

[S N R_{min}, S N R_{max}]

. For each target position, the local background mean

μ_{b g}

and standard deviation

σ_{b g}

are calculated, and target brightness is adjusted as

T_{i}^{'} = S N R_{i} \cdot σ_{b g} \cdot \frac{T_{i}}{max (T_{i})} .

(1)

The adjusted target is then superimposed on the background image to form a composite result, while generating a binary mask and coordinate file to record precise pixel positions. This ensures that synthetic targets exhibit realistic contrast and radiometric characteristics consistent with real infrared imagery (corresponding to lines 9–15 in Algorithm 1).

Step 4: Output Generation and Statistical Annotation. After all N targets are added, the algorithm outputs four components: (1) The synthesized infrared image

I^{'}

, (2) Its corresponding binary mask

M^{'}

, (3) The coordinate file

C^{'}

containing pixel positions, and (4) The statistical table

S^{'}

with

{SNR, μ_{b g}, σ_{b g}, peak, size, d}

. Together, these outputs constitute one complete IR-SatDense sample with controllable density, brightness, and SNR (corresponding to line 23 in Algorithm 1).

Algorithm 1 Pseudocode of DSTDGen Algorithm

Input: Background image $I \in R^{H \times W}$ , candidate target set ${T_{1}, \dots, T_{n}}$ , mean target number $μ_{a}$ , variance $σ_{a}$ , mean nearest distance d, SNR range $[S N R_{min}, S N R_{max}]$ .

1:: Initialize output folders and parameters.
2:: for each background image I in IR-SatDense do
3:: $N \leftarrow random_normal (μ_{a}, σ_{a})$
4:: Choose start position $p_{1}$ , select N templates ${T_{i}}$
5:: Initialize $S T A T S = \emptyset$ , $o u t p u t = I$
6:: for $j = 1$ to N do
7:: $T_{j}^{'} \leftarrow random_rotate (T_{j})$
8:: if $j = 1$ then
9:: $(I_{j}, P_{j}, i n f o) \leftarrow add_target_SNR (o u t p u t, T_{j}^{'}, p_{j})$
10:: $V_{j} \leftarrow expand_envelope (P_{j}, d)$
11:: else
12:: $p_{j} \leftarrow random_point (V_{j - 1})$
13:: $(I_{j}, P_{j}, i n f o) \leftarrow add_target_SNR (o u t p u t, T_{j}^{'}, p_{j})$
14:: if $| mean_distance ({P_{1}, \dots, P_{j}}) - d | > 0.5$ then
15:: Adjust $p_{j}$ within $(\pm Δ x, \pm Δ y)$ ; update $I_{j}$ if valid, else increase fail counter.
16:: end if
17:: $V_{j} \leftarrow expand_envelope (P_{j}, d)$
18:: end if
19:: Append $i n f o$ to $S T A T S$
20:: $o u t p u t \leftarrow I_{j}$
21:: end for
22:: Save $I^{'}$ , $M^{'}$ , $C^{'}$ , and $S^{'}$
23:: end for

Output: Synthetic image $I^{'}$ , mask $M^{'}$ , coordinate file $C^{'}$ , and statistics $S^{'}$ .

3.2. Statistical Analysis

To comprehensively evaluate the representativeness and effectiveness of the proposed IR-SatDense dataset, a statistical comparison is conducted against several widely used benchmark datasets for infrared small target detection (ISTD). In dense-target scenarios, conventional metrics such as target count or area are insufficient to describe the degree of spatial compactness. Therefore, we introduce the Average Minimum Inter-Target Distance (AMID) metric to quantitatively characterize the density of target distributions within each image.

Definition 1.

Average Minimum Inter-Target Distance (AMID). For the i-th image

I_{i}

in the dataset containing

M_{i}

targets

{T_{j}}_{j = 1}^{M_{i}}

, the minimum Euclidean distance

d_{j}

from the j-th target to all other targets is defined as

d_{j} = min_{k \neq j} min_{y_{k} \in T_{k}, y_{j} \in T_{j}} {∥ y_{k} - y_{j} ∥}_{2} .

(2)

The image-level average minimum distance is then given by

A M I D_{i} = \frac{1}{M_{i}} \sum_{j = 1}^{M_{i}} d_{j} .

(3)

Finally, the overall dataset-level AMID is computed as

A M I D_{a l l} = \frac{1}{N} \sum_{i = 1}^{N} A M I D_{i},

(4)

where N denotes the total number of images in the dataset. A smaller AMID value indicates stronger spatial compactness among targets, corresponding to a higher-density and more challenging detection scenario. Representative visual examples for different AMID ranges are shown in Figure 3.

As summarized in Table 1, the proposed IR-SatDense dataset contains 2154 images, divided into 50% for training, 25% for validation, and 25% for testing. To facilitate more detailed performance evaluation across different spatial density levels, the test subset is further partitioned based on the AMID metric, allowing a systematic analysis of model robustness under varying target compactness.

Compared with existing datasets, IR-SatDense exhibits substantially higher target density and smaller average target size, while simultaneously covering multiple background complexity levels. Specifically, its average target area is approximately 10.42 pixels (corresponding to an average target width of about 3 pixels), which accurately reflects the small-scale and low-intensity nature of real infrared small targets. Furthermore, the dataset’s AMID value of 1.51 is significantly smaller than that of previous dense-target datasets, indicating much closer inter-target spacing and a higher degree of detection difficulty. Figure 2b illustrates the proportional distribution of target density levels across images with different background complexities, demonstrating that IR-SatDense provides a comprehensive benchmark for dense-target infrared detection research and for studying how detection performance degrades as AMID decreases.

4. Proposed Baseline

In this section, we present the proposed Semantic Density-Guided ResNet (SDG-ResNet) backbone and its integration into a DINO-based detector for dense infrared small target detection. The core idea is to estimate a semantic density map from the deepest ResNet stage and use it as a global prior to refine mid- and low-level features via lightweight residual gating.

4.1. Motivation

To quantitatively assess the influence of target density on infrared small target detection, we employ DINO to derive PD and FA curves under varying Average Minimum Inter-Target Distance (AMID) and IoU thresholds. In this context, AMID characterizes the average distance to the nearest neighboring target in the image plane, where a smaller AMID corresponds to a denser target distribution.

As shown in Figure 4, when the AMID decreases from sparse to dense intervals, the PD of existing detectors consistently drops, especially under stricter IoU thresholds. At the same time, the FA curves show the opposite trend: dense scenes (small AMID) yield significantly more false alarms than sparse scenes. These observations clearly demonstrate that current architectures are not robust enough in dense infrared small-target scenarios, even if they perform well when targets are relatively sparse.

Although different detectors adopt different backbones and heads, most of them share a common design philosophy: backbone features are extracted in a purely bottom-up manner, and the notion of “how dense the targets are” is not explicitly encoded in the feature representation. All spatial locations are essentially treated in the same way, regardless of whether they belong to dense target regions or mostly background. As a result, when many small targets appear in close proximity, mid- and low-level features tend to be dominated by clutter-like responses, and the detector has difficulty maintaining high PD and low FA in such dense regimes.

Motivated by these observations, we aim to endow the backbone with an explicit awareness of target density and a simple mechanism to adapt its feature responses in dense regions. Instead of relying solely on the detection head, we introduce a Semantic Density-Guided ResNet (SDG-ResNet). In SDG-ResNet, the deepest ResNet stage is used to estimate a semantic density map that reflects the spatial distribution of potential target clusters, and this map is then employed to refine intermediate features through lightweight residual gating. In this way, the backbone can respond differently in dense and sparse areas, with the specific goal of improving PD and suppressing FA under small-AMID, dense infrared small-target scenarios.

4.2. Network Overview

Given an input infrared image I,

I \in R^{3 \times H \times W},

(5)

a standard ResNet-50 backbone extracts three feature maps

\{\begin{matrix} X_{3} & \in R^{C_{3} \times H_{3} \times W_{3}}, \\ X_{4} & \in R^{C_{4} \times H_{4} \times W_{4}}, \\ X_{5} & \in R^{C_{5} \times H_{5} \times W_{5}} . \end{matrix}

(6)

where

(C_{3}, C_{4}, C_{5}) = (512, 1024, 2048)

for ResNet-50. In dense infrared scenes, the deepest feature

X_{5}

encodes strong semantic responses of target clusters, while

X_{3}

and

X_{4}

contain detailed structures but are heavily contaminated by background clutter.

To exploit this property, we augment the backbone with two components:

A semantic density head attached to $X_{5}$ that predicts a one-channel semantic density map $D$ ;
Two Semantic Density-Guided Refine (SDGR) blocks that use $D$ to refine $X_{4}$ and $X_{3}$ in a residual manner.

As illustrated in Figure 5, the proposed SDG-ResNet consists of an overall detection framework, a semantic density head, and SDGR blocks for cross-stage feature refinement. Formally, the refined feature maps are given by

\{\begin{matrix} {\tilde{X}}_{4} & = SDGR (X_{4}, D), \\ {\tilde{X}}_{3} & = SDGR (X_{3}, D), \\ {\tilde{X}}_{5} & = X_{5} . \end{matrix}

(7)

The set

{{\tilde{X}}_{3}, {\tilde{X}}_{4}, {\tilde{X}}_{5}}

is then fed into a ChannelMapper neck and a DINO transformer head, which remain unchanged with respect to the baseline detector. We refer to the resulting detector as SDG DINO.

4.3. Semantic Density Head

4.3.1. Architecture

The goal of the semantic density head is to compress the high-level feature

X_{5}

into a scalar field that reflects the spatial distribution of targets or target clusters. The head consists of a

1 \times 1

convolution for channel reduction, followed by a

3 \times 3

convolution for local context aggregation.

Given

X_{5} \in R^{C_{5} \times H_{5} \times W_{5}}

, the intermediate feature

F_{5}

and the semantic density map

D

are jointly computed as

\{\begin{matrix} F_{5} & = ReLU ({BN}_{1} (W_{1} * X_{5})), \\ D & = σ (W_{2} * F_{5}), \end{matrix}

(8)

where

W_{1} \in R^{C_{m} \times C_{5} \times 1 \times 1}

is a

1 \times 1

convolution kernel with

C_{m} = 256

,

W_{2} \in R^{1 \times C_{m} \times 3 \times 3}

is a

3 \times 3

convolution kernel, * denotes convolution,

{BN}_{1} (\cdot)

is batch normalization,

ReLU (\cdot)

is the rectified linear unit, and

σ (\cdot)

is the sigmoid function.

The output

D \in {[0, 1]}^{1 \times H_{5} \times W_{5}}

can be interpreted as a cluster-aware objectness prior:

D (u, v) \approx p_{cluster} (u, v | X_{5}),

(9)

where

p_{cluster} (\cdot)

denotes the probability that location

(u, v)

belongs to a target or target cluster.

4.3.2. Design Rationale

The semantic density head is intentionally shallow and linear in the channel dimension. It does not attempt to re-learn complex patterns but rather projects the existing high-level semantics into a single-channel prior. This design has three advantages:

It preserves the original $X_{5}$ for the detection head, avoiding interference with high-level semantics.
It provides an interpretable, spatially dense prior that can be reused across multiple backbone stages.
It adds only a negligible number of parameters and FLOPs.

4.4. Semantic Density-Guided Refine Block

The Semantic Density-Guided Refine Block (SDGR) injects the semantic prior

D

into a low-level feature map

X_{ℓ}

(

ℓ \in {3, 4}

) to suppress background clutter and selectively enhance responses near dense semantic regions. Intuitively,

D

plays the role of a high-level gating signal that indicates where small-target clusters are likely to appear, while

X_{ℓ}

provides fine-grained local texture and contrast information.

4.4.1. Density Upsampling and Embedding Alignment

Because

D

is defined at the spatial resolution of

X_{5}

, it is first upsampled to match the resolution of

X_{ℓ}

:

D_{ℓ} = U (D; H_{ℓ}, W_{ℓ}),

(10)

where

U (\cdot)

denotes bilinear interpolation and

D_{ℓ} \in {[0, 1]}^{1 \times H_{ℓ} \times W_{ℓ}}

.

To reduce computational cost and to learn a compact joint representation, both

X_{ℓ}

and

D_{ℓ}

are projected into a shared low-dimensional embedding space with

r = C_{ℓ} / R

channels (we set

R = 4

):

\{\begin{matrix} Z_{ℓ} & = W_{low}^{(ℓ)} * X_{ℓ}, \\ Z_{d} & = W_{d} * D_{ℓ}, \end{matrix}

(11)

where

Z_{ℓ}, Z_{d} \in R^{r \times H_{ℓ} \times W_{ℓ}}

, and both

W_{low}^{(ℓ)}

and

W_{d}

are

1 \times 1

kernels corresponding to the low_proj and d_proj layers in the network implementation.

This step is analogous to the linear projections used in attention gates [32], where low-level and high-level features are first mapped into a common intermediate space before computing attention coefficients.

4.4.2. Feature–Density Fusion and Gate Prediction

The two embeddings are concatenated along the channel dimension and fused by a

3 \times 3

convolution with batch normalization and ReLU:

Z_{f} = ReLU ({BN}_{2} (W_{f} * [Z_{ℓ}, Z_{d}])),

(12)

where

[Z_{ℓ}, Z_{d}]

denotes channel-wise concatenation and

W_{f}

is a

3 \times 3

kernel. This fusion stage allows the network to jointly reason about local appearance (from

Z_{ℓ}

) and semantic density (from

Z_{d}

) within a

3 \times 3

neighborhood, instead of making gating decisions based on a single pixel.

A spatial gate is then generated by a

1 \times 1

convolution followed by a sigmoid function:

G_{ℓ} = σ (W_{g} * Z_{f}),

(13)

where

W_{g}

is a

1 \times 1

kernel and

G_{ℓ} \in {[0, 1]}^{1 \times H_{ℓ} \times W_{ℓ}}

. The gate value

G_{ℓ} (x, y)

measures how strongly the low-level feature at position

(x, y)

should be preserved or suppressed, conditioned jointly on local appearance and high-level semantic density.

Although the SDGR block produces a spatial gating map, it is fundamentally different from conventional spatial attention mechanisms. Typical attention modules (e.g., CBAM [33] or attention gates [32]) estimate attention weights directly from the same feature map that is being refined, focusing on local saliency or channel interactions. In contrast, our SDGR block is driven by an explicitly constructed semantic density prior derived from the deepest backbone stage. The gating signal is therefore not computed from the low-level feature itself but projected from high-level semantic clustering responses that encode global target-density information. This cross-stage prior injection enables density-aware modulation of intermediate features, rather than generic saliency reweighting. Consequently, SDG-ResNet explicitly models spatial target density as a structural property of dense scenes, instead of treating attention as a purely local feature recalibration mechanism.

4.4.3. Residual Refinement

Finally, we refine

X_{ℓ}

using a residual gating formulation:

\{\begin{matrix} {\tilde{X}}_{ℓ} & = X_{ℓ} + γ_{ℓ} (X_{ℓ} ⊙ G_{ℓ} - X_{ℓ}), \\ = X_{ℓ} ⊙ (1 + γ_{ℓ} (G_{ℓ} - 1)), \end{matrix}

(14)

where ⊙ denotes element-wise multiplication,

1

is an all-ones map broadcastable to the shape of

G_{ℓ}

, and

γ_{ℓ}

is a learnable scalar parameter initialized to

10^{- 2}

.

At the beginning of training,

γ_{ℓ}

is close to zero, and the SDGR block behaves almost as an identity mapping, which stabilizes optimization and preserves the benefits of ImageNet pre-training. As training proceeds,

γ_{ℓ}

is automatically adjusted such that features in high-density regions (where

G_{ℓ}

is close to 1) are preserved or slightly enhanced, while responses in low-density regions (where

G_{ℓ}

tends to be smaller) are progressively suppressed. This residual formulation thus realizes a semantic density-guided, spatially adaptive modulation of low-level features while avoiding aggressive modifications that could harm the backbone representation in ambiguous areas.

It should be clarified that the semantic density head is not designed as an independent density regression branch. Instead, the predicted density map serves as an intermediate prior for feature modulation and is optimized implicitly through the overall detection objective. During backpropagation, gradients from the detection loss propagate through the SDGR blocks to the density head, enabling it to learn density-aware representations without requiring explicit density annotations. This implicit supervision mechanism is consistent with many attention-based modules that are trained end-to-end without auxiliary losses.

4.5. Integration with DINO and Complexity Analysis

The proposed SDG-ResNet is integrated into a DINO-style transformer detector without modifying the detection head. The refined feature maps

{{\tilde{X}}_{3}, {\tilde{X}}_{4}, {\tilde{X}}_{5}}

are first converted by a ChannelMapper to a unified channel dimension, and then fed into the DINO encoder–decoder. Let

L_{\det}

denote the original DINO detection loss, which combines classification, bounding-box regression, and IoU/GIoU terms, including auxiliary losses for intermediate layers. We do not introduce any extra loss terms, and the overall training objective is

L_{total} = L_{\det} .

(15)

Therefore, the semantic density head and the SDGR blocks are supervised implicitly through the detection objective.

In terms of complexity, SDG-ResNet introduces:

One $1 \times 1$ and one $3 \times 3$ convolution on $X_{5}$ for semantic density estimation;
For each of $X_{3}$ and $X_{4}$ , two $1 \times 1$ convolutions, one $3 \times 3$ convolution, and one $1 \times 1$ gating convolution;
Two scalar parameters $γ_{3}$ and $γ_{4}$ .

All the added operations act on feature maps that are already computed by the backbone, and the extra FLOPs are negligible compared with the ResNet and transformer encoder–decoder. This makes SDG-ResNet a practical and efficient backbone for dense infrared small target detection.

5. Experiments

In this section, we introduce the evaluation metrics, experimental settings, comparisons with state-of-the-art (SOTA) methods, and ablation studies.

5.1. Implementation Details

(1) Dataset: Experiments are conducted on the IR-SatDense dataset, which contains a large number of small infrared targets distributed across diverse background scenes. The targets are very small (average width about 3 pixels) and have low signal-to-noise ratios, enabling evaluation under complex and cluttered infrared conditions. According to the Average Minimum Inter-Target Distance (AMID), the test set is divided into multiple subsets to assess detection performance under varying density levels, with particular focus on the challenging dense regime (

AMID \leq 3

).

(2) Implementation: All detectors are implemented within the DINO framework using ResNet-50 or SDG-ResNet as the backbone. For the proposed variants (SDG Deformable DETR, SDG DETA, SDG DINO), we simply replace the standard ResNet-50 backbone with SDG-ResNet while keeping all other hyperparameters unchanged to ensure a fair comparison. The AdamW optimizer is adopted with a learning rate of

1 \times 10^{- 4}

and a batch size of 2 for 180,000 iterations. Input images are normalized to match DINO’s default preprocessing pipeline. All experiments are conducted on a single NVIDIA RTX 4090 GPU.

(3) Evaluation Metrics: To comprehensively evaluate detection performance, we adopt three metrics: probability of detection (PD), false alarm rate (FA), and FLOPs/Params. PD and FA measure detection capability and robustness, while FLOPs and Params characterize computational complexity.

All methods are evaluated under a unified box-level protocol. For anchor-based detectors (DETR, Deformable DETR, DETA, DINO and our SDG variants), the network directly predicts a set of bounding boxes

{B_{p}}

. For segmentation-based ISTD methods (e.g., ACM, ALCNet, RDIAN, DNA_Net, ISTDU-Net, UIUNet, U-Net, ResUNet), the network outputs a binary mask for each test image. We first extract all connected components from the predicted mask and, for each component, compute its tight axis-aligned enclosing rectangle. These rectangles are treated as the predicted boxes

B_{p}

. Ground-truth annotations are also represented as axis-aligned bounding boxes. In this way, both detection and segmentation methods are evaluated with exactly the same box-based criteria.

The matching between a predicted box

B_{p}

and a ground-truth box

B_{g}

is determined by the intersection over union (IoU):

I o U = \frac{| B_{p} \cap B_{g} |}{| B_{p} \cup B_{g} |} .

(16)

A prediction is counted as a true positive (TP) if

I o U \geq T_{I o U}

. A one-to-one matching strategy is adopted: each ground-truth target is matched to at most one prediction (the one with the highest IoU), and unmatched predictions are treated as false alarms. For small targets, IoU is highly sensitive to positional offsets, so the IoU threshold is uniformly set to

T_{I o U} = 0.50,

(17)

which provides a reasonable balance between localization precision and tolerance.

All predicted boxes whose confidence scores exceed a fixed threshold are counted as detections. Let

N_{gt}

denote the total number of ground-truth targets in the test set, and let

N_{\det}

denote the total number of predicted boxes. Among all detections,

N_{T P}

are matched as true positives, and the rest belong to the false alarm set

F

. The probability of detection (PD) and the false alarm rate (FA) are defined as

P D = \frac{N_{T P}}{N_{gt}}, F A = \frac{\sum_{k \in F} | B_{k} |}{A_{total}},

(18)

where

| B_{k} |

is the area (in pixels) of the k-th false-alarm box and

A_{total}

is the total image area over the whole test set (i.e., the sum of the pixel numbers of all test images).

In other words, PD measures the fraction of correctly detected targets among all ground-truth targets, while FA measures the proportion of image area occupied by false-alarm boxes.

5.2. Benchmark Results

Table 2 reports the detection performance at

T_{I o U} = 0.50

on the IR-SatDense test set, including both the overall results (All) and the three density intervals defined by AMID. Overall, integrating the proposed SDG-ResNet backbone into DETR-style detectors leads to consistent PD improvements across most density regimes while keeping FA at a comparable or even lower level than the corresponding baselines.

It is worth noting that the Average Minimum Inter-Target Distance (AMID) is inversely related to the target density in the scene. A smaller AMID value indicates that targets are more densely distributed, while a larger AMID corresponds to relatively sparse target configurations. Therefore, the AMID-based evaluation provides a quantitative analysis of the detector performance under different target density conditions.

For DINO, replacing the vanilla ResNet-50 with SDG-ResNet yields a PD increase from 86.44% to 86.82% on the whole test set (+0.38%), with FA changing only slightly from

1.30

to

1.34 \times 10^{- 4}

. In terms of density-specific results, SDG DINO improves PD in all AMID intervals: from 85.55% to 85.71% (+0.16%) for

AMID \leq 1

, from 87.43% to 87.51% (+0.08%) for

1 < AMID \leq 2

, and from 86.84% to 87.41% (+0.57%) for

2 < AMID \leq 3

.

For Deformable DETR, the baseline model performs poorly on IR-SatDense, achieving only 5.49% PD overall. After introducing SDG-ResNet, SDG Deformable DETR improves the overall PD to 7.13% (+1.64%), with similar trends across all three AMID intervals (e.g., from 6.42% to 7.90% for

AMID \leq 1

). Although the absolute PD values remain modest, this relative gain demonstrates that SDG can noticeably strengthen the detection capability of weaker DETR-style baselines.

Table 3 further shows that the SDG-equipped detectors consistently achieve higher PD across IoU thresholds from 0.3 to 0.7 while maintaining comparable FA. This confirms that the performance gain is not limited to IoU = 0.50 but reflects improved localization robustness.

In summary, the benchmark results confirm that semantic density guidance provides clear and consistent improvements for DETR-style detectors on IR-SatDense under different density conditions.

5.3. Comparison with State-of-the-Art Methods

To further evaluate the effectiveness and efficiency of the proposed SDG-ResNet, we compare SDG-enhanced detectors with representative segmentation-based ISTD networks and DETR-style detectors on IR-SatDense. Table 4 summarizes the probability of detection (PD), false alarm rate (FA), and the model complexity in terms of parameters and FLOPs.

Among all compared methods, DINO already provides a very strong baseline, achieving 86.44% PD and

1.30 \times 10^{- 4}

FA with 47.5M parameters and 178.5G FLOPs. After inserting SDG-ResNet, SDG DINO further improves PD to 86.82% (+0.38%) with only a small increase in model size and computation (49.8M params, +4.8%; 186.1G FLOPs, +4.3%), while FA remains at a similar level (

1.34 \times 10^{- 4}

, +0.04). This shows that SDG brings measurable accuracy gains at a very modest additional cost.

For Deformable DETR, the baseline obtains 5.49% PD and

9.37 \times 10^{- 4}

FA with 40.0M parameters and 123.3G FLOPs. The SDG version, SDG D-DETR, increases PD to 7.13% (+1.64%) with 42.4M parameters (+6.0%) and 130.8G FLOPs (+6.1%), while FA remains on the same order (

9.72 \times 10^{- 4}

). Although the absolute performance is still lower than that of DINO, this relative improvement verifies that SDG can noticeably strengthen the detection capability of weaker DETR-style baselines in dense small-target scenarios.

For DETA, introducing SDG-ResNet brings clear gains in both accuracy and robustness. The overall PD increases from 63.70% to 64.75% (+1.05%), while FA is reduced from

1.15

to

1.08 \times 10^{- 4}

(

- 0.07

). This improvement is achieved with a moderate overhead in complexity: the number of parameters grows from 48.3 M to 50.6 M (about +4.8%), and FLOPs from 182.0 G to 189.5 G (about +4.1%). These results indicate that even for a head-optimized detector like DETA, semantic density guidance at the backbone level can still provide a favorable accuracy–complexity trade-off.

Compared with the segmentation-based ISTD methods (e.g., DNA_Net, ISTDU-Net, UIUNet), SDG DINO achieves the highest PD on IR-SatDense while maintaining a competitive FA, despite having a larger model size. Taken together, these results demonstrate that the proposed SDG-ResNet is a lightweight yet effective plug-in backbone for dense infrared small target detection: it yields clear PD improvements for strong DETR-based detectors at a negligible cost in parameters and FLOPs and can be seamlessly integrated into existing architectures.

5.4. Ablation Study

We conduct ablation experiments on IR-SatDense based on the DINO detector at

T_{I o U} = 0.50

. The test set is divided into three density ranges according to AMID (

A M I D \leq 1

,

1 < A M I D \leq 2

,

2 < A M I D \leq 3

) plus the overall set (All). We compare four variants: the original DINO (baseline), SDG@Res4 (only an SDGR block on Res4), SDG@Res3 (only on Res3), and SDG (full), which inserts SDGR blocks at both Res3 and Res4.

As shown in Table 5, all SDG variants improve PD over the DINO baseline (86.44% PD,

1.30 \times 10^{- 4}

FA) on the whole test set. SDG@Res4 and SDG@Res3 increase PD to 86.50% (+0.06%) and 86.67% (+0.23%), while slightly reducing FA to

1.26

and

1.22 \times 10^{- 4}

, respectively. The full SDG configuration achieves the highest PD of 86.82% (+0.38%) with a marginal FA change to

1.34 \times 10^{- 4}

(+0.04).

The gains are most evident in the dense regime (

A M I D \leq 1

), where many targets are tightly clustered: PD increases from 85.55% (baseline) to 85.96%, 85.78%, and 85.71% for SDG@Res4, SDG@Res3, and full SDG, with FA staying around the baseline level. In the densest practical regime

2 < A M I D \leq 3

, PD improves from 86.84% to 86.97%, 87.12%, and 87.41% (+0.57% for full SDG). Overall, these results show that injecting a shared semantic density prior into Res3/Res4 consistently enhances detection performance, and jointly refines both stages (full SDG) provides the best PD–FA trade-off.

Although SDG-ResNet introduces additional parameters and computational overhead, the increase remains modest relative to the baseline backbone. For example, when integrated into DINO, the number of parameters increases from 47.5M to 49.8M (approximately +4.8%), and the FLOPs increase from 178.5G to 186.1G (approximately +4.3%). Compared with the overall computational scale of transformer-based detectors, this additional cost is relatively small and does not affect practical deployment feasibility. Meanwhile, SDG-ResNet consistently improves PD across dense scenarios. These results indicate a favorable performance–efficiency trade-off for dense infrared small target detection.

To examine whether the proposed SDG module affects optimization stability, we compare the training loss curves between baseline detectors and their SDG-equipped variants. As shown in Figure 6, all SDG-equipped models exhibit smooth convergence behavior that closely follows their corresponding baselines. No noticeable oscillation or divergence is observed during training. Moreover, the convergence speed remains comparable across all models, indicating that the introduced density-guided refinement does not adversely impact training stability.

5.5. Comparison Results on Sparse Target Dataset IRSTD-1K

To further verify the generalization ability of SDG on conventional sparse infrared small targets, we also conduct experiments on the IRSTD-1K dataset [2]. The quantitative results are summarized in Table 6. We can observe that classical segmentation-based methods (ISTDU-Net, UIUNet, U-Net, etc.) already achieve very high PD values above 80% on this sparse benchmark, while DINO attains a strong trade-off with 85.46% PD and the lowest FA of

0.17 \times 10^{- 4}

. Introducing SDG-ResNet into DINO further improves PD slightly to 85.81% while keeping FA unchanged.

These results indicate that IRSTD-1K is relatively easy in terms of target density and that transformer-based detectors such as DINO remain highly competitive even without explicit density modeling. The performance improvement on IRSTD-1K is relatively modest compared with that on IR-SatDense. This is expected because IRSTD-1K is primarily a sparse-target dataset, where most images contain only one or a few isolated targets with relatively large inter-target distances. In such scenarios, dense semantic clustering rarely occurs at high-level feature maps, and the predicted density prior tends to be spatially diffuse. As a result, the SDGR blocks behave close to identity mappings, leading to stable but limited gains. Importantly, SDG-ResNet does not degrade performance in sparse scenes, indicating that the density-guided mechanism remains compatible with conventional sparse-target detection settings.

5.6. Performance Under Different Background Complexity

To further analyze the robustness of the proposed semantic density guidance (SDG) mechanism under different background conditions, we evaluate the detection performance on the predefined complexity subsets of the IR-SatDense dataset.

The results are summarized in Table 7. Overall, the proposed SDG brings consistent improvements for Deformable DETR and DETA across all background complexity levels. For DINO, SDG also improves performance in most cases, especially under easy scene, medium scene, and complex scene conditions, while only a marginal fluctuation is observed under the most challenging extremely complex scene condition.

In particular, the improvements are more noticeable for relatively weaker baselines such as Deformable DETR and DETA, indicating that the proposed semantic density guidance effectively enhances the robustness of dense infrared small target detection under varying background complexities.

5.7. Performance Under Different Target Sizes

To further analyze the detection capability for extremely small infrared targets, we evaluate the detection probability (PD) under different target size intervals. The bounding-box areas are divided into three groups:

A \leq 3 \times 3

,

3 \times 3 < A \leq 4 \times 4

, and

A > 4 \times 4

pixels. The proportions of ground-truth targets in these intervals are 38.16%, 41.76%, and 20.08%, respectively.

As shown in Table 8, detection performance increases with target size for all methods, indicating that extremely small targets remain the most challenging scenario in infrared imagery. Nevertheless, the proposed SDG module consistently improves detection performance across different size intervals. In particular, more noticeable improvements are observed for extremely small targets (

A \leq 3 \times 3

), demonstrating that the semantic density guidance mechanism effectively enhances the representation of weak target signals.

5.8. Visual Analysis

To further illustrate the detection effectiveness of the proposed SDG-ResNet, we conduct a qualitative comparison on the four background complexity levels defined in Figure 2, namely easy, medium, complex, and extremely complex on-orbit scenes. The visualization results are shown in Figure 7, where each column corresponds to one background level, and the rows compare the baseline DINO with DINO + (equipped with SDG-ResNet).

Across all four background types, the baseline DINO either misses part of the densely distributed targets or produces spurious responses in cluttered non-target regions. By contrast, DINO + detects more true targets within dense clusters and effectively suppresses false alarms on background structures. This confirms that introducing SDG-ResNet can simultaneously enhance dense-target detection and reduce false alarms under diverse on-orbit background conditions.

6. Conclusions and Further Analysis

In this paper, we addressed the challenging problem of dense infrared small target detection, where tiny low-SNR targets appear in highly crowded configurations. We first constructed a new satellite dense infrared dataset, IR-SatDense, in which target density, inter-target spacing, and SNR can be flexibly controlled. Based on the proposed AMID metric, IR-SatDense reveals that the probability of detection (PD) of existing detectors degrades sharply while the false alarm rate (FA) increases as targets become more densely packed.

To mitigate this density-induced degradation, we proposed a Semantic Density-Guided ResNet (SDG-ResNet) backbone. SDG-ResNet predicts a semantic density map from the deepest ResNet stage and reuses it as a global prior to refine mid- and low-level features via lightweight Semantic Density-Guided Refine (SDGR) blocks. Integrated into representative DETR-like detectors such as Deformable DETR, DETA, and DINO, SDG consistently improves PD at comparable FA, especially in the most challenging small-AMID regime, while introducing only negligible additional parameters and FLOPs.

Experiments on both the dense IR-SatDense and the sparse IRSTD-1K datasets demonstrate that SDG-ResNet enhances robustness in dense-target regimes without sacrificing performance in sparse scenarios. In future work, we plan to extend semantic density guidance to multi-frame and multi-scale settings and to explore joint modeling of temporal density evolution and long-range motion patterns in satellite infrared sensing.

Author Contributions

Conceptualization, X.Z. and W.A.; methodology, X.Z. and X.Y.; software, X.Z.; validation, X.Z., X.Y. and N.C.; formal analysis, X.Z., X.Y. and B.L.; investigation, B.L., R.L., C.X. and M.L.; resources, W.A.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z., X.Y., N.C., B.L. and W.A.; visualization, X.Z.; supervision, M.L. and W.A.; project administration, M.L. and W.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 12503098.

Data Availability Statement

The proposed IR-SatDense dataset and implementation code are publicly available at https://github.com/Lucifer094/SDG (accessed on 20 March 2026). All other datasets used in this study (e.g., NUDT-SIRST, SIRST, IRSTD-1K, DenseSIRST, DMIST) are public and cited in the corresponding references.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Lin, J.; Li, S.; Zhang, L.; Yang, X.; Yan, B.; Meng, Z. IR-TransDet: Infrared dim and small target detection with IR-transformer. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5004813. [Google Scholar] [CrossRef]
Chen, N.; Li, B.; Wang, Y.; Ying, X.; Wang, L.; Zhang, C.; Guo, Y.; Li, M.; An, W. Motion and Appearance Decoupling Representation for Event Cameras. IEEE Trans. Image Process. 2025, 34, 5964–5977. [Google Scholar] [CrossRef] [PubMed]
Li, R.; An, W.; Xiao, C.; Li, B.; Wang, Y.; Li, M.; Guo, Y. Direction-coded temporal U-shape module for multiframe infrared small target detection. IEEE Trans. Neural Netw. Learn. Syst. 2023, 36, 555–568. [Google Scholar] [CrossRef]
Li, R.; An, W.; Ying, X.; Wang, Y.; Dai, Y.; Wang, L.; Li, M.; Guo, Y.; Liu, L. Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better. arXiv 2025, arXiv:2506.12766. [Google Scholar] [CrossRef]
Ying, X.; Xiao, C.; An, W.; Li, R.; He, X.; Li, B.; Cao, X.; Li, Z.; Wang, Y.; Hu, M.; et al. Visible-thermal tiny object detection: A benchmark dataset and baselines. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6088–6096. [Google Scholar] [CrossRef]
Ying, X.; Liu, L.; Lin, Z.; Shi, Y.; Wang, Y.; Li, R.; Cao, X.; Li, B.; Zhou, S.; An, W. Infrared small target detection in satellite videos: A new dataset and a novel recurrent feature refinement framework. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5002818. [Google Scholar] [CrossRef]
Ying, X.; Liu, L.; Wang, Y.; Li, R.; Chen, N.; Lin, Z.; Sheng, W.; Zhou, S. Mapping degeneration meets label evolution: Learning infrared small target detection with single point supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; p. 15528. [Google Scholar]
Li, B.; Wang, L.; Wang, Y.; Wu, T.; Lin, Z.; Li, M.; An, W.; Guo, Y. Mixed-precision network quantization for infrared small target segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5000812. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 950–959. [Google Scholar]
Dai, Y.; Li, X.; Zhou, F.; Qian, Y.; Chen, Y.; Yang, J. One-stage cascade refinement networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000917. [Google Scholar] [CrossRef]
Chen, S.; Ji, L.; Zhu, S.; Ye, M.; Ren, H.; Sang, Y. Towards dense moving infrared small target detection: New datasets and baseline. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005513. [Google Scholar] [CrossRef]
Xiao, M.; Dai, Q.; Zhu, Y.; Guo, K.; Wang, H.; Shu, X.; Yang, J.; Dai, Y. Background semantics matter: Cross-task feature exchange network for clustered infrared small target detection with sky-annotated dataset. arXiv 2024, arXiv:2407.20078. [Google Scholar]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A robust infrared small target detection algorithm based on human visual system. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
Qin, Y.; Li, B. Effective infrared small target detection utilizing a novel local contrast method. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1890–1894. [Google Scholar] [CrossRef]
Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets 1999; SPIE: Bellingham, WA, USA, 1999; Volume 3809, pp. 74–83. [Google Scholar]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Zhu, J.; Chen, S.; Li, L.; Ji, L. Sanet: Spatial attention network with global average contrast learning for infrared small target detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior attention-aware network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002013. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000513. [Google Scholar] [CrossRef]
Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared Small-Target Detection U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7506205. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar]
Xiong, Z.; Zhou, F.; Wu, F.; Yuan, S.; Fu, M.; Peng, Z.; Yang, J.; Dai, Y. DRPCA-Net: Make robust PCA great again for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5005516. [Google Scholar] [CrossRef]
Ouyang-Zhang, J.; Cho, J.H.; Zhou, X.; Krähenbühl, P. Nms strikes back. arXiv 2022, arXiv:2212.06137. [Google Scholar] [CrossRef]

Figure 1. Visualization of backbone feature responses at low-, middle-, and high-level stages for dense and single-target infrared scenes. The left column shows a dense target cluster, where the high-level feature map exhibits strong and compact semantic responses. The right column shows a single-target scene, where the high-level semantic responses are weaker and more diffuse. This observation suggests that dense target clusters naturally concentrate semantic information at high levels, which can be exploited to guide low-level feature enhancement and suppress noise in non-target regions.

Figure 2. Different background complexities under realistic on-orbit imaging conditions. (a) Representative examples of four levels: (i) easy, (ii) medium, (iii) complex, and (iv) extremely complex scenes; (b) Statistical distribution of each background level in the dataset.

Figure 3. Visualization examples of infrared images with different target density intervals based on the Average Minimum Inter-Target Distance (AMID). From left to right: (a)

A M I D \leq 1

, (b)

1 < A M I D \leq 2

, and (c)

2 < A M I D \leq 3

. The red boxes indicate annotated target regions.

Figure 3. Visualization examples of infrared images with different target density intervals based on the Average Minimum Inter-Target Distance (AMID). From left to right: (a)

A M I D \leq 1

, (b)

1 < A M I D \leq 2

, and (c)

2 < A M I D \leq 3

. The red boxes indicate annotated target regions.

Figure 4. Experimental results of the DINO model trained on the IR-SatDense dataset and evaluated on subsets with different target density intervals. (a) Probability of detection (PD) curves with respect to the intersection over union (IoU) threshold under different AMID levels; (b) False alarm rate (FA) curves with respect to the IoU threshold under different AMID levels.

Figure 5. Overview of the proposed SDG-ResNet and its integration into a DINO-based detector. (a) Overall framework: The deepest feature

X_{5}

is converted by the semantic density head into a density map

D

, which guides SDGR blocks to refine

X_{4}

and

X_{3}

; all three features are then fed to the DETR-style head for final detection. (b) Semantic density head: A lightweight

1 \times 1

–

3 \times 3

convolution stack compresses

X_{5}

into a single-channel semantic density map

D

. (c) Semantic Density-Guided Refine Block (SDGR): The upsampled density map is fused with intermediate features to predict a spatial gate, and the gated response is added residually to obtain refined features.

Figure 5. Overview of the proposed SDG-ResNet and its integration into a DINO-based detector. (a) Overall framework: The deepest feature

X_{5}

is converted by the semantic density head into a density map

D

, which guides SDGR blocks to refine

X_{4}

and

X_{3}

; all three features are then fed to the DETR-style head for final detection. (b) Semantic density head: A lightweight

1 \times 1

–

3 \times 3

convolution stack compresses

X_{5}

into a single-channel semantic density map

D

. (c) Semantic Density-Guided Refine Block (SDGR): The upsampled density map is fused with intermediate features to predict a spatial gate, and the gated response is added residually to obtain refined features.

Figure 6. Training loss curves of baseline detectors and their SDG-equipped variants. All models exhibit smooth and stable convergence behavior, and the convergence speed remains comparable after introducing the SDG module.

Figure 7. Qualitative comparison of baseline detectors and SDG-ResNet variants on four background complexity levels in IR-SatDense. From left to right, each column corresponds to one background type: (i) easy, (ii) medium, (iii) complex, and (iv) extremely complex scenes under realistic on-orbit imaging conditions. For each detector, the first row shows the detection results of the baseline model, and the second row shows the corresponding results when equipped with the SDG-ResNet backbone. The detectors include Deformable DETR, DETA, and DINO. Red bounding boxes indicate ground-truth targets, and green bounding boxes indicate detected targets. Compared with their respective baselines, the SDG-equipped variants recover more densely distributed targets and suppress false alarms in non-target regions across all background levels, leading to clearer and more reliable detection in dense infrared scenes.

Table 1. Main characteristics of representative infrared small target detection datasets. Dens.: indicates whether the dataset includes explicit stratification of target density levels. AutoG.: automatically generated dataset using algorithmic synthesis. Pix.: pixel-level annotations. BBox: bounding-box annotations. Pt.: point-level annotations. ImgNum: number of images in the dataset. TgtAvg.: average number of targets per image.

Dataset	AMID	Area	Dens.	AutoG.	Pix.	BBox	Pt.	ImgNum	TgtAvg.
NUDT-SIRST [1]	–	30.95	No	No	Yes	No	No	1000	1.40
SIRST [13]	–	33.03	No	No	Yes	No	No	427	1.25
IRSTD-1K [2]	–	23.00	No	No	Yes	No	No	1000	1.50
SIRST-AUG [11]	–	83.00	No	No	Yes	No	No	9070	1.02
SIRSTv2 [14]	–	19.00	No	No	Yes	Yes	Yes	1024	0.68
DMIST-60 [15]	16.01	36.46	No	Yes	No	Yes	No	13,779	61.0
DMIST-100 [15]	12.26	36.28	No	Yes	No	Yes	No	13,779	101.0
DenseSIRST [16]	11.01	11.52	No	Yes	Yes	Yes	Yes	1024	13.38
IR-SatDense (Ours)	1.51	10.42	Yes	Yes	Yes	Yes	Yes	2154	58.04

Table 2. Quantitative comparison of detection performance at IoU = 0.50 across different density intervals. The probability of detection (PD) and false alarm rate (FA) are reported for the entire test set (All) and for subsets divided by AMID ranges. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on the whole test set. ↑ indicates higher is better, and ↓ indicates lower is better.

Methods	Backbone	All		AMID $\leq 1$		1 < AMID $\leq 2$		2 < AMID $\leq 3$
Methods	Backbone	PD (%) ↑	FA ( $10^{- 4}$ ) ↓	PD (%) ↑	FA ( $10^{- 4}$ ) ↓	PD (%) ↑	FA ( $10^{- 4}$ ) ↓	PD (%) ↑	FA ( $10^{- 4}$ ) ↓
ACM [13]	–	28.96	22.01	5.89	30.53	26.54	22.29	52.83	12.65
ALCNet [34]	–	44.90	15.14	12.18	24.48	49.70	13.12	71.05	7.01
RDIAN [35]	–	75.72	5.99	62.72	8.24	82.22	4.50	81.53	4.39
DNA_Net [1]	–	80.43	4.76	70.07	7.35	84.84	3.57	85.97	3.17
ISTDU-Net [36]	–	79.11	5.03	66.98	7.45	84.47	3.85	86.47	3.58
UIUNet [12]	–	77.86	3.96	66.17	5.63	82.52	3.07	84.64	3.27
U-Net [37]	–	79.69	4.54	64.62	7.14	84.31	3.74	85.05	3.38
ResUNet [38]	–	79.11	4.81	67.89	7.41	84.31	3.74	85.04	3.39
DRPCA-Net [39]	–	34.83	13.14	14.32	18.14	41.22	11.14	48.66	9.59
DETR [27]	ResNet50	0	0	0	0	0	0	0	0
DN-DETR [30]	ResNet50	0	0	0	0	0	0	0	0
Deformable DETR [28]	ResNet50	5.49	9.37	6.42	8.09	5.48	9.43	5.25	10.92
DETA [40]	ResNet50	63.70	1.15	54.23	1.57	64.15	1.12	70.28	0.93
DINO [31]	ResNet50	86.44	1.30	85.55	1.56	87.43	1.26	86.84	1.26
Deformable DETR+	SDG-ResNet50	7.13 (+1.64)	9.72 (+0.35)	7.90 (+1.48)	8.26 (+0.17)	7.11 (+1.63)	9.69 (+0.26)	7.34 (+2.09)	11.12 (+0.20)
DETA+	SDG-ResNet50	64.75 (+1.05)	1.08 (−0.07)	55.76 (+1.53)	1.42 (−0.15)	65.49 (+1.34)	1.03 (−0.09)	71.55 (+1.27)	0.86 (−0.07)
DINO+	SDG-ResNet50	86.82 (+0.38)	1.34 (+0.04)	85.71 (+0.16)	1.51 (−0.05)	87.51 (+0.08)	1.31 (+0.05)	87.41 (+0.57)	1.36 (+0.10)

Table 3. PD/FA comparison under different IoU thresholds on IR-SatDense. Improvements over the baselines are shown in parentheses. Higher PD indicates better detection performance, while lower FA indicates fewer false alarms.

Methods	IoU = 0.3		IoU = 0.4		IoU = 0.5		IoU = 0.6		IoU = 0.7
Methods	PD (%)	FA ( $10^{- 4}$ )	PD (%)	FA ( $10^{- 4}$ )	PD (%)	FA ( $10^{- 4}$ )	PD (%)	FA ( $10^{- 4}$ )	PD (%)	FA ( $10^{- 4}$ )
Deformable DETR	26.93	5.85	13.09	7.99	5.49	9.37	2.16	10.07	0.66	10.41
DETA	75.02	0.34	70.26	0.64	63.70	1.15	50.99	2.44	29.90	5.21
DINO	94.57	0.44	92.45	0.66	86.44	1.30	71.35	3.14	42.96	7.08
Deformable DETR+	30.38 (+3.45)	5.99 (+0.14)	15.72 (+2.63)	8.25 (+0.26)	7.13 (+1.64)	9.72 (+0.35)	2.93 (+0.77)	10.52 (+0.45)	0.98 (+0.32)	10.94 (+0.53)
DETA+	76.12 (+1.10)	0.32 (−0.02)	71.48 (+1.22)	0.57 (−0.07)	64.75 (+1.05)	1.08 (−0.07)	51.87 (+0.88)	2.39 (−0.05)	30.47 (+0.57)	5.14 (−0.07)
DINO+	94.93 (+0.36)	0.48 (+0.04)	92.80 (+0.35)	0.69 (+0.03)	86.82 (+0.38)	1.34 (+0.04)	71.44 (+0.09)	3.24 (+0.10)	43.10 (+0.14)	7.20 (+0.12)

Table 4. Comparison of different methods on the IR-SatDense dataset in terms of probability of detection (PD), false alarm rate (FA), number of parameters (Params), and computational complexity (FLOPs). PD and FA are expressed in percentage (%) and

10^{- 4}

scale, respectively. Params denote the total number of learnable parameters, while FLOPs are computed based on a

512 \times 512

input image. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on the IR-SatDense test set. Higher PD indicates better detection performance, while lower FA, Params, and FLOPs indicate better efficiency.

Table 4. Comparison of different methods on the IR-SatDense dataset in terms of probability of detection (PD), false alarm rate (FA), number of parameters (Params), and computational complexity (FLOPs). PD and FA are expressed in percentage (%) and

10^{- 4}

scale, respectively. Params denote the total number of learnable parameters, while FLOPs are computed based on a

512 \times 512

input image. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on the IR-SatDense test set. Higher PD indicates better detection performance, while lower FA, Params, and FLOPs indicate better efficiency.

Methods	PD (%) ↑	FA ( $10^{- 4}$ ) ↓	Params ↓	FLOPs ↓
ACM [13]	28.96	22.01	0.4 M	0.4 G
ALCNet [34]	44.90	15.14	0.4 M	0.4 G
RDIAN [35]	75.72	5.99	0.2 M	3.7 G
DNA_Net [1]	80.43	4.76	4.7 M	14.3 G
ISTDU-Net [36]	79.11	5.03	2.8 M	7.9 G
UIUNet [12]	77.86	3.96	2.8 M	7.9 G
U-Net [37]	79.69	4.54	2.8 M	7.9 G
ResUNet [38]	79.11	4.81	2.8 M	7.9 G
DRPCA-Net [39]	34.83	13.14	1.17 M	74.36 G
DETR [27]	0.00	0.00	41.5 M	60.5 G
DN-DETR [30]	0.00	0.00	43.6 M	65.3 G
Deformable DETR [28]	5.49	9.37	40.0 M	123.3 G
DETA [40]	63.70	1.15	48.3 M	182.0 G
DINO [31]	86.44	1.30	47.5 M	178.5 G
Deformable DETR+	7.13 (+1.64)	9.72 (+0.35)	42.4 M	130.8 G
DETA+	64.75 (+1.05)	1.08 (−0.07)	50.6 M	189.5 G
DINO+	86.82 (+0.38)	1.34 (+0.04)	49.8 M	186.1 G

Table 5. Ablation study of SDG-ResNet on the IR-SatDense dataset. Probability of detection (PD) and false alarm rate (FA) are reported at

T_{I o U} = 0.50

for the whole test set (All) and for three density intervals defined by AMID. SDG consistently improves PD over the DINO baseline, especially in the most dense regime (

A M I D \leq 1

), while keeping FA at a comparable level. ↑ indicates higher is better, and ↓ indicates lower is better.

Table 5. Ablation study of SDG-ResNet on the IR-SatDense dataset. Probability of detection (PD) and false alarm rate (FA) are reported at

T_{I o U} = 0.50

for the whole test set (All) and for three density intervals defined by AMID. SDG consistently improves PD over the DINO baseline, especially in the most dense regime (

A M I D \leq 1

), while keeping FA at a comparable level. ↑ indicates higher is better, and ↓ indicates lower is better.

Methods	All		AMID $\leq 1$		1 < AMID $\leq 2$		2 < AMID $\leq 3$
Methods	PD (%) ↑	FA ( $10^{- 4}$ ) ↓	PD (%) ↑	FA ( $10^{- 4}$ ) ↓	PD (%) ↑	FA ( $10^{- 4}$ ) ↓	PD (%) ↑	FA ( $10^{- 4}$ ) ↓
DINO (baseline)	86.44	1.30	85.55	1.56	87.43	1.26	86.84	1.26
SDG@Res4	86.50	1.26	85.96	1.44	87.47	1.20	86.97	1.20
SDG@Res3	86.67	1.22	85.78	1.41	87.29	1.17	87.12	1.17
SDG (full)	86.82	1.34	85.71	1.51	87.51	1.31	87.41	1.36

Table 6. Comparison of detection performance on the sparse infrared small target dataset IRSTD-1K. Probability of detection (PD) and false alarm rate (FA) are reported on the full test set. “Deformable DETR +”, “DETA +” and “DINO +” denote the corresponding baseline detectors equipped with the proposed SDG-ResNet backbone; the PD values in parentheses indicate gains over the baselines on IRSTD-1K. ↑ indicates higher is better, and ↓ indicates lower is better.

Methods	PD (%) ↑	FA ( $10^{- 4}$ ) ↓
ACM [13]	82.49	0.97
ALCNet [34]	84.18	0.78
RDIAN [35]	80.81	0.30
DNA_Net [1]	82.83	0.26
ISTDU-Net [36]	87.21	0.44
UIUNet [12]	86.20	0.60
U-Net [37]	84.58	0.49
ResUNet [38]	82.83	0.32
DETR [27]	0.00	0.00
DN-DETR [30]	0.00	0.00
Deformable DETR [28]	77.18	0.18
DETA [40]	85.65	0.17
DINO [31]	85.46	0.17
Deformable DETR+	78.46 (+1.28)	0.21 (+0.03)
DETA+	86.22 (+0.57)	0.10 (−0.07)
DINO+	85.81 (+0.35)	0.17 (−0.00)

Table 7. PD/FA comparison under different background complexity levels on IR-SatDense at IoU = 0.5. Improvements over the baselines are shown in parentheses. Higher PD indicates better detection performance, while lower FA indicates fewer false alarms.

Methods	Easy		Medium		Complex		Extreme
Methods	PD (%)	FA ( $10^{- 4}$ )	PD (%)	FA ( $10^{- 4}$ )	PD (%)	FA ( $10^{- 4}$ )	PD (%)	FA ( $10^{- 4}$ )
Deformable DETR	4.80	9.17	5.28	9.71	5.67	9.20	6.21	9.41
DETA	58.22	1.17	63.34	1.27	65.12	1.16	67.65	1.04
DINO	80.75	1.48	86.76	1.35	88.05	1.17	89.92	1.21
Deformable DETR + SDG	6.90 (+2.10)	9.40 (+0.23)	7.44 (+2.16)	9.80 (+0.08)	6.98 (+1.30)	9.67 (+0.47)	7.22 (+1.01)	10.05 (+0.64)
DETA + SDG	59.21 (+0.99)	1.18 (+0.01)	64.86 (+1.53)	1.12 (−0.15)	66.63 (+1.50)	1.06 (−0.09)	67.96 (+0.31)	0.99 (−0.05)
DINO + SDG	81.74 (+0.98)	1.50 (+0.02)	86.98 (+0.22)	1.33 (−0.02)	88.59 (+0.54)	1.22 (+0.05)	89.74 (−0.18)	1.35 (+0.14)

Table 8. PD comparison under different target size intervals on IR-SatDense at IoU = 0.5. The proportions of ground-truth targets in each size interval are reported in parentheses. Improvements over the corresponding baselines are shown in parentheses. Higher PD indicates better detection performance.

Methods	$A \leq 3 \times 3$ (38.16%)	$3 \times 3 < A \leq 4 \times 4$ (41.76%)	$A > 4 \times 4$ (20.08%)
Deformable DETR	0.13	6.88	15.74
DETA	41.29	72.71	88.16
DINO	77.13	92.02	92.34
Deformable DETR + SDG	0.69 (+0.56)	9.54 (+2.66)	17.22 (+1.48)
DETA + SDG	43.05 (+1.76)	73.61 (+0.90)	88.23 (+0.07)
DINO + SDG	77.10 (−0.03)	92.72 (+0.70)	92.84 (+0.50)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; An, W.; Ying, X.; Li, R.; Chen, N.; Li, B.; Xiao, C.; Li, M. Semantic Density-Guided ResNet for Dense Infrared Small Target Detection. Remote Sens. 2026, 18, 1397. https://doi.org/10.3390/rs18091397

AMA Style

Zhang X, An W, Ying X, Li R, Chen N, Li B, Xiao C, Li M. Semantic Density-Guided ResNet for Dense Infrared Small Target Detection. Remote Sensing. 2026; 18(9):1397. https://doi.org/10.3390/rs18091397

Chicago/Turabian Style

Zhang, Xin, Wei An, Xinyi Ying, Ruojing Li, Nuo Chen, Boyang Li, Chao Xiao, and Miao Li. 2026. "Semantic Density-Guided ResNet for Dense Infrared Small Target Detection" Remote Sensing 18, no. 9: 1397. https://doi.org/10.3390/rs18091397

APA Style

Zhang, X., An, W., Ying, X., Li, R., Chen, N., Li, B., Xiao, C., & Li, M. (2026). Semantic Density-Guided ResNet for Dense Infrared Small Target Detection. Remote Sensing, 18(9), 1397. https://doi.org/10.3390/rs18091397

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Density-Guided ResNet for Dense Infrared Small Target Detection

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Existing Infrared Small Target Detection Datasets

2.2. Infrared Small Target Detection Methods

3. IR-SatDense Dataset Synthesis

3.1. Dataset Construction

3.2. Statistical Analysis

4. Proposed Baseline

4.1. Motivation

4.2. Network Overview

4.3. Semantic Density Head

4.3.1. Architecture

4.3.2. Design Rationale

4.4. Semantic Density-Guided Refine Block

4.4.1. Density Upsampling and Embedding Alignment

4.4.2. Feature–Density Fusion and Gate Prediction

4.4.3. Residual Refinement

4.5. Integration with DINO and Complexity Analysis

5. Experiments

5.1. Implementation Details

5.2. Benchmark Results

5.3. Comparison with State-of-the-Art Methods

5.4. Ablation Study

5.5. Comparison Results on Sparse Target Dataset IRSTD-1K

5.6. Performance Under Different Background Complexity

5.7. Performance Under Different Target Sizes

5.8. Visual Analysis

6. Conclusions and Further Analysis

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI