1. Introduction
Infrared imaging does not rely on external illumination, enabling all-day and all-weather operation, strong penetration through smoke and haze, and high robustness under complex lighting conditions. Owing to these advantages, it has been widely adopted in long-distance target detection, environmental perception, and animal protection [
1,
2,
3,
4]. Among these applications, small infrared target detection plays a critical role in remote sensing situational awareness, long-range object monitoring, and maritime search and rescue and has emerged as a core research topic in intelligent infrared image analysis [
5,
6].
In recent remote sensing missions, a single infrared frame often contains numerous small targets that are extremely tiny, exhibit very low signal-to-noise ratios (SNRs), and are densely distributed amid complex background clutter. Under such dense conditions, conventional infrared small target detection methods typically suffer from degraded detection accuracy, increased miss rates, and elevated false alarm levels [
7,
8,
9]. In particular, when the minimum inter-target distance becomes comparable to or smaller than the target size, responses from adjacent targets tend to overlap, and fine-grained target details are progressively lost during backbone downsampling, making accurate detection and reliable separation of neighboring targets especially challenging [
10].
To better understand how deep backbones behave in dense and sparse scenes,
Figure 1 visualizes feature responses at low, middle, and high stages for a dense target cluster and for a single-target scene. In the dense case, the low- and middle-level feature maps contain both clutter and target responses, whereas the high-level feature map shows a clear, compact semantic focus around the dense target cluster. In contrast, for the single-target case, the high-level semantic responses are much weaker and more diffuse. This phenomenon suggests that, in dense infrared scenes, dense target clusters naturally generate strong, well-aggregated semantic information at high levels, which can serve as a reliable prior to enhance low-level target details while suppressing background noise. However, most existing infrared small target detection networks still rely on generic multi-scale fusion or attention mechanisms built on ResNet features [
1,
2,
11,
12], where all spatial locations are processed in a largely density-agnostic manner. Although these methods aggregate multi-level semantics, they do not explicitly construct or reuse a high-level semantic density map of dense target clusters as a global prior to guide the refinement of low-level features in dense regions.
Existing infrared small target detection models, therefore, exhibit several limitations in dense scenes:
(1) Local spatial saliency limitation: Most current approaches are built upon local spatial saliency, enhancing targets by exploiting intensity contrast between a target and its surrounding background. When multiple adjacent targets are densely distributed in the same region, these methods struggle to distinguish subtle energy differences between targets and background, resulting in target adhesion, blurred responses, and incomplete separation in the detection maps.
(2) Density-aware semantic guidance deficiency: Although recent deep models employ powerful backbones and transformer-based heads, the backbone feature extraction process is typically bottom-up and does not explicitly encode target density. Features at different spatial locations are processed in a homogeneous manner, regardless of whether they belong to dense target clusters or mostly background. As illustrated in
Figure 1, dense target clusters already induce strong, aggregated responses in high-level semantic features, but these cues are not reused to guide lower layers. Consequently, mid- and low-level features in dense regions are easily dominated by clutter-like responses, causing a sharp drop in probability of detection (PD) and a rise in false alarms (FAs) when the Average Minimum Inter-Target Distance (AMID) becomes small.
(3) Dense-target structural limitation: Traditional sparse-target models do not explicitly model the structural hierarchy of densely distributed targets in the spatial domain. In multi-target scenes with high spatial density, they often produce ambiguous spatial structures and mutual feature interference, which leads to degraded localization accuracy and unstable detection performance.
To overcome the above limitations in dense-target detection, this paper proposes a Semantic Density-Guided (SDG) backbone that explicitly leverages high-level semantic density to guide low-level feature enhancement. Instead of introducing complex attention blocks or modifying the detection head, SDG estimates a semantic density map from the deepest backbone stage and reuses it as a global prior to refine intermediate features. Concretely, a lightweight semantic density head predicts a one-channel semantic density map from the highest-level feature, and a set of Semantic Density-Guided Refine (SDGR) blocks injects this prior into mid- and low-level feature maps via residual spatial gating. In this way, dense target clusters that are already prominent in high-level semantics are used to enhance fine-grained target details in dense regions while suppressing noise responses in complex background areas. As a result, the backbone can respond differently in dense and sparse regions while preserving the benefits of existing transformer-based detectors.
The main contributions of this paper can be summarized as follows:
(1) Algorithmic contribution: We propose a Semantic Density-Guided backbone (SDG-ResNet) that augments a standard ResNet with a semantic density head and lightweight SDGR blocks. The deepest-stage feature is compressed into a semantic density map, which is then reused to modulate intermediate features through residual spatial gating. This design explicitly exploits high-level semantic density from dense target clusters to enhance low-level target details, providing density-aware semantic guidance for dense infrared small target detection while introducing only negligible additional parameters and FLOPs.
(2) Dataset contribution: We construct a novel simulated dataset named IR-SatDense (Infrared Satellite-Based Dense Small Target Dataset). Built on real satellite-based infrared backgrounds, IR-SatDense comprehensively simulates diverse target densities, signal-to-noise ratios (SNRs), and morphologies and is organized into subsets according to the Average Minimum Inter-Target Distance (AMID). This dataset provides controllable and realistic experimental scenarios for systematically evaluating detection algorithms under complex dense infrared conditions.
(3) Experimental contribution: We integrate the proposed SDG-ResNet into several transformer-based detectors, including DINO, Deformable DETR, and DETA, and conduct extensive experiments on the IR-SatDense dataset as well as on the sparse-target benchmark IRSTD-1K. Experimental results demonstrate that SDG-based detectors consistently improve PD at comparable FA levels, with particularly large gains in small-AMID (high-density) regimes, while maintaining strong performance on sparse infrared small target datasets.
The structure of this paper is as follows.
Section 2 reviews existing infrared small target detection datasets and methods.
Section 3 describes the construction and statistical analysis of the proposed IR-SatDense dataset.
Section 4 presents the Semantic Density-Guided ResNet (SDG-ResNet) backbone and its integration into DETR-style detectors.
Section 5 reports implementation details, benchmark comparisons, ablation studies, and visual analyses on IR-SatDense and IRSTD-1K. Finally,
Section 6 concludes the paper and discusses future research directions.
3. IR-SatDense Dataset Synthesis
In this section, we describe the proposed IR-SatDense dataset in detail.
3.1. Dataset Construction
(1)
Data Collection: To construct a dense infrared small target dataset that reflects realistic on-orbit imaging conditions with complex backgrounds, a total of 2154 background images were collected from public satellite infrared imagery sources. These images cover diverse Earth observation scenarios and represent various background characteristics commonly encountered in infrared imaging tasks. All background images were cropped and preprocessed to a uniform resolution of
pixels. According to background complexity, the dataset is categorized into four levels: (i) easy, (ii) medium, (iii) complex, and (iv) extremely complex scenes. Representative examples and the statistical distribution of each background level are shown in
Figure 2.
(2) Targets and Annotations: To construct the proposed IR-SatDense dataset, a Dense Single-Frame Target Dataset Generator (DSTDGen) algorithm is developed. The pseudocode is presented in Algorithm 1, which consists of four main stages designed to automatically generate infrared small target samples with controllable density distributions and precise annotations on diverse backgrounds.
Step 1: Parameter and Template Initialization. The input includes a background infrared image and a set of preselected small-target templates . The mean target number and its variance determine the number of targets N generated on each image. The mean nearest inter-target distance d controls spatial density, while the SNR range defines target intensity. Each image initializes a random start position and selects N random templates for subsequent placement (corresponding to lines 1–5 in Algorithm 1).
Step 2: Target Placement under Spatial Constraints. Targets are iteratively placed on the background while satisfying the mean nearest distance d. The first target is randomly rotated and placed at . For each subsequent target j, its position is sampled within the convex envelope of previously placed targets . If the mean inter-target distance deviates from d by more than 0.5 pixels, a local coordinate adjustment is applied to fine-tune . If a valid position is found, the image and statistics are updated; otherwise, a fail counter is increased to maintain stability. This process ensures controllable inter-target spacing and stable dense distribution (corresponding to lines 6–22 in Algorithm 1).
Step 3: Target Composition with Controllable Signal-to-Noise Ratio (SNR). To simulate the radiometric properties of real infrared point targets, each template
is normalized and enhanced according to a sampled SNR within
. For each target position, the local background mean
and standard deviation
are calculated, and target brightness is adjusted as
The adjusted target is then superimposed on the background image to form a composite result, while generating a binary mask and coordinate file to record precise pixel positions. This ensures that synthetic targets exhibit realistic contrast and radiometric characteristics consistent with real infrared imagery (corresponding to lines 9–15 in Algorithm 1).
Step 4: Output Generation and Statistical Annotation. After all
N targets are added, the algorithm outputs four components: (1) The synthesized infrared image
, (2) Its corresponding binary mask
, (3) The coordinate file
containing pixel positions, and (4) The statistical table
with
. Together, these outputs constitute one complete IR-SatDense sample with controllable density, brightness, and SNR (corresponding to line 23 in Algorithm 1).
| Algorithm 1 Pseudocode of DSTDGen Algorithm |
Input: Background image , candidate target set , mean target number , variance , mean nearest distance d, SNR range .
- 1:
Initialize output folders and parameters. - 2:
for each background image I in IR-SatDense do - 3:
- 4:
Choose start position , select N templates - 5:
Initialize , - 6:
for to N do - 7:
- 8:
if then - 9:
- 10:
- 11:
else - 12:
- 13:
- 14:
if then - 15:
Adjust within ; update if valid, else increase fail counter. - 16:
end if - 17:
- 18:
end if - 19:
Append to - 20:
- 21:
end for - 22:
Save , , , and - 23:
end for
|
3.2. Statistical Analysis
To comprehensively evaluate the representativeness and effectiveness of the proposed IR-SatDense dataset, a statistical comparison is conducted against several widely used benchmark datasets for infrared small target detection (ISTD). In dense-target scenarios, conventional metrics such as target count or area are insufficient to describe the degree of spatial compactness. Therefore, we introduce the Average Minimum Inter-Target Distance (AMID) metric to quantitatively characterize the density of target distributions within each image.
Definition 1. Average Minimum Inter-Target Distance (AMID). For the i-th image in the dataset containing targets , the minimum Euclidean distance from the j-th target to all other targets is defined as The image-level average minimum distance is then given by Finally, the overall dataset-level AMID is computed aswhere N denotes the total number of images in the dataset. A smaller AMID value indicates stronger spatial compactness among targets, corresponding to a higher-density and more challenging detection scenario. Representative visual examples for different AMID ranges are shown in Figure 3. As summarized in
Table 1, the proposed IR-SatDense dataset contains 2154 images, divided into 50% for training, 25% for validation, and 25% for testing. To facilitate more detailed performance evaluation across different spatial density levels, the test subset is further partitioned based on the AMID metric, allowing a systematic analysis of model robustness under varying target compactness.
Compared with existing datasets, IR-SatDense exhibits substantially higher target density and smaller average target size, while simultaneously covering multiple background complexity levels. Specifically, its average target area is approximately 10.42 pixels (corresponding to an average target width of about 3 pixels), which accurately reflects the small-scale and low-intensity nature of real infrared small targets. Furthermore, the dataset’s AMID value of 1.51 is significantly smaller than that of previous dense-target datasets, indicating much closer inter-target spacing and a higher degree of detection difficulty.
Figure 2b illustrates the proportional distribution of target density levels across images with different background complexities, demonstrating that IR-SatDense provides a comprehensive benchmark for dense-target infrared detection research and for studying how detection performance degrades as AMID decreases.
4. Proposed Baseline
In this section, we present the proposed Semantic Density-Guided ResNet (SDG-ResNet) backbone and its integration into a DINO-based detector for dense infrared small target detection. The core idea is to estimate a semantic density map from the deepest ResNet stage and use it as a global prior to refine mid- and low-level features via lightweight residual gating.
4.1. Motivation
To quantitatively assess the influence of target density on infrared small target detection, we employ DINO to derive PD and FA curves under varying Average Minimum Inter-Target Distance (AMID) and IoU thresholds. In this context, AMID characterizes the average distance to the nearest neighboring target in the image plane, where a smaller AMID corresponds to a denser target distribution.
As shown in
Figure 4, when the AMID decreases from sparse to dense intervals, the PD of existing detectors consistently drops, especially under stricter IoU thresholds. At the same time, the FA curves show the opposite trend: dense scenes (small AMID) yield significantly more false alarms than sparse scenes. These observations clearly demonstrate that current architectures are not robust enough in dense infrared small-target scenarios, even if they perform well when targets are relatively sparse.
Although different detectors adopt different backbones and heads, most of them share a common design philosophy: backbone features are extracted in a purely bottom-up manner, and the notion of “how dense the targets are” is not explicitly encoded in the feature representation. All spatial locations are essentially treated in the same way, regardless of whether they belong to dense target regions or mostly background. As a result, when many small targets appear in close proximity, mid- and low-level features tend to be dominated by clutter-like responses, and the detector has difficulty maintaining high PD and low FA in such dense regimes.
Motivated by these observations, we aim to endow the backbone with an explicit awareness of target density and a simple mechanism to adapt its feature responses in dense regions. Instead of relying solely on the detection head, we introduce a Semantic Density-Guided ResNet (SDG-ResNet). In SDG-ResNet, the deepest ResNet stage is used to estimate a semantic density map that reflects the spatial distribution of potential target clusters, and this map is then employed to refine intermediate features through lightweight residual gating. In this way, the backbone can respond differently in dense and sparse areas, with the specific goal of improving PD and suppressing FA under small-AMID, dense infrared small-target scenarios.
4.2. Network Overview
Given an input infrared image
I,
a standard ResNet-50 backbone extracts three feature maps
where
for ResNet-50. In dense infrared scenes, the deepest feature
encodes strong semantic responses of target clusters, while
and
contain detailed structures but are heavily contaminated by background clutter.
To exploit this property, we augment the backbone with two components:
As illustrated in
Figure 5, the proposed SDG-ResNet consists of an overall detection framework, a semantic density head, and SDGR blocks for cross-stage feature refinement. Formally, the refined feature maps are given by
The set is then fed into a ChannelMapper neck and a DINO transformer head, which remain unchanged with respect to the baseline detector. We refer to the resulting detector as SDG DINO.
4.3. Semantic Density Head
4.3.1. Architecture
The goal of the semantic density head is to compress the high-level feature into a scalar field that reflects the spatial distribution of targets or target clusters. The head consists of a convolution for channel reduction, followed by a convolution for local context aggregation.
Given
, the intermediate feature
and the semantic density map
are jointly computed as
where
is a
convolution kernel with
,
is a
convolution kernel, * denotes convolution,
is batch normalization,
is the rectified linear unit, and
is the sigmoid function.
The output
can be interpreted as a cluster-aware objectness prior:
where
denotes the probability that location
belongs to a target or target cluster.
4.3.2. Design Rationale
The semantic density head is intentionally shallow and linear in the channel dimension. It does not attempt to re-learn complex patterns but rather projects the existing high-level semantics into a single-channel prior. This design has three advantages:
It preserves the original for the detection head, avoiding interference with high-level semantics.
It provides an interpretable, spatially dense prior that can be reused across multiple backbone stages.
It adds only a negligible number of parameters and FLOPs.
4.4. Semantic Density-Guided Refine Block
The Semantic Density-Guided Refine Block (SDGR) injects the semantic prior into a low-level feature map () to suppress background clutter and selectively enhance responses near dense semantic regions. Intuitively, plays the role of a high-level gating signal that indicates where small-target clusters are likely to appear, while provides fine-grained local texture and contrast information.
4.4.1. Density Upsampling and Embedding Alignment
Because
is defined at the spatial resolution of
, it is first upsampled to match the resolution of
:
where
denotes bilinear interpolation and
.
To reduce computational cost and to learn a compact joint representation, both
and
are projected into a shared low-dimensional embedding space with
channels (we set
):
where
, and both
and
are
kernels corresponding to the
low_proj and
d_proj layers in the network implementation.
This step is analogous to the linear projections used in attention gates [
32], where low-level and high-level features are first mapped into a common intermediate space before computing attention coefficients.
4.4.2. Feature–Density Fusion and Gate Prediction
The two embeddings are concatenated along the channel dimension and fused by a
convolution with batch normalization and ReLU:
where
denotes channel-wise concatenation and
is a
kernel. This fusion stage allows the network to jointly reason about local appearance (from
) and semantic density (from
) within a
neighborhood, instead of making gating decisions based on a single pixel.
A spatial gate is then generated by a
convolution followed by a sigmoid function:
where
is a
kernel and
. The gate value
measures how strongly the low-level feature at position
should be preserved or suppressed, conditioned jointly on local appearance and high-level semantic density.
Although the SDGR block produces a spatial gating map, it is fundamentally different from conventional spatial attention mechanisms. Typical attention modules (e.g., CBAM [
33] or attention gates [
32]) estimate attention weights directly from the same feature map that is being refined, focusing on local saliency or channel interactions. In contrast, our SDGR block is driven by an explicitly constructed semantic density prior derived from the deepest backbone stage. The gating signal is therefore not computed from the low-level feature itself but projected from high-level semantic clustering responses that encode global target-density information. This cross-stage prior injection enables density-aware modulation of intermediate features, rather than generic saliency reweighting. Consequently, SDG-ResNet explicitly models spatial target density as a structural property of dense scenes, instead of treating attention as a purely local feature recalibration mechanism.
4.4.3. Residual Refinement
Finally, we refine
using a residual gating formulation:
where ⊙ denotes element-wise multiplication,
is an all-ones map broadcastable to the shape of
, and
is a learnable scalar parameter initialized to
.
At the beginning of training, is close to zero, and the SDGR block behaves almost as an identity mapping, which stabilizes optimization and preserves the benefits of ImageNet pre-training. As training proceeds, is automatically adjusted such that features in high-density regions (where is close to 1) are preserved or slightly enhanced, while responses in low-density regions (where tends to be smaller) are progressively suppressed. This residual formulation thus realizes a semantic density-guided, spatially adaptive modulation of low-level features while avoiding aggressive modifications that could harm the backbone representation in ambiguous areas.
It should be clarified that the semantic density head is not designed as an independent density regression branch. Instead, the predicted density map serves as an intermediate prior for feature modulation and is optimized implicitly through the overall detection objective. During backpropagation, gradients from the detection loss propagate through the SDGR blocks to the density head, enabling it to learn density-aware representations without requiring explicit density annotations. This implicit supervision mechanism is consistent with many attention-based modules that are trained end-to-end without auxiliary losses.
4.5. Integration with DINO and Complexity Analysis
The proposed SDG-ResNet is integrated into a DINO-style transformer detector without modifying the detection head. The refined feature maps
are first converted by a ChannelMapper to a unified channel dimension, and then fed into the DINO encoder–decoder. Let
denote the original DINO detection loss, which combines classification, bounding-box regression, and IoU/GIoU terms, including auxiliary losses for intermediate layers. We do not introduce any extra loss terms, and the overall training objective is
Therefore, the semantic density head and the SDGR blocks are supervised implicitly through the detection objective.
In terms of complexity, SDG-ResNet introduces:
One and one convolution on for semantic density estimation;
For each of and , two convolutions, one convolution, and one gating convolution;
Two scalar parameters and .
All the added operations act on feature maps that are already computed by the backbone, and the extra FLOPs are negligible compared with the ResNet and transformer encoder–decoder. This makes SDG-ResNet a practical and efficient backbone for dense infrared small target detection.
5. Experiments
In this section, we introduce the evaluation metrics, experimental settings, comparisons with state-of-the-art (SOTA) methods, and ablation studies.
5.1. Implementation Details
(1) Dataset: Experiments are conducted on the IR-SatDense dataset, which contains a large number of small infrared targets distributed across diverse background scenes. The targets are very small (average width about 3 pixels) and have low signal-to-noise ratios, enabling evaluation under complex and cluttered infrared conditions. According to the Average Minimum Inter-Target Distance (AMID), the test set is divided into multiple subsets to assess detection performance under varying density levels, with particular focus on the challenging dense regime ().
(2) Implementation: All detectors are implemented within the DINO framework using ResNet-50 or SDG-ResNet as the backbone. For the proposed variants (SDG Deformable DETR, SDG DETA, SDG DINO), we simply replace the standard ResNet-50 backbone with SDG-ResNet while keeping all other hyperparameters unchanged to ensure a fair comparison. The AdamW optimizer is adopted with a learning rate of and a batch size of 2 for 180,000 iterations. Input images are normalized to match DINO’s default preprocessing pipeline. All experiments are conducted on a single NVIDIA RTX 4090 GPU.
(3) Evaluation Metrics: To comprehensively evaluate detection performance, we adopt three metrics: probability of detection (PD), false alarm rate (FA), and FLOPs/Params. PD and FA measure detection capability and robustness, while FLOPs and Params characterize computational complexity.
All methods are evaluated under a unified box-level protocol. For anchor-based detectors (DETR, Deformable DETR, DETA, DINO and our SDG variants), the network directly predicts a set of bounding boxes . For segmentation-based ISTD methods (e.g., ACM, ALCNet, RDIAN, DNA_Net, ISTDU-Net, UIUNet, U-Net, ResUNet), the network outputs a binary mask for each test image. We first extract all connected components from the predicted mask and, for each component, compute its tight axis-aligned enclosing rectangle. These rectangles are treated as the predicted boxes . Ground-truth annotations are also represented as axis-aligned bounding boxes. In this way, both detection and segmentation methods are evaluated with exactly the same box-based criteria.
The matching between a predicted box
and a ground-truth box
is determined by the intersection over union (IoU):
A prediction is counted as a true positive (TP) if
. A one-to-one matching strategy is adopted: each ground-truth target is matched to at most one prediction (the one with the highest IoU), and unmatched predictions are treated as false alarms. For small targets, IoU is highly sensitive to positional offsets, so the IoU threshold is uniformly set to
which provides a reasonable balance between localization precision and tolerance.
All predicted boxes whose confidence scores exceed a fixed threshold are counted as detections. Let
denote the total number of ground-truth targets in the test set, and let
denote the total number of predicted boxes. Among all detections,
are matched as true positives, and the rest belong to the false alarm set
. The probability of detection (PD) and the false alarm rate (FA) are defined as
where
is the area (in pixels) of the
k-th false-alarm box and
is the total image area over the whole test set (i.e., the sum of the pixel numbers of all test images).
In other words, PD measures the fraction of correctly detected targets among all ground-truth targets, while FA measures the proportion of image area occupied by false-alarm boxes.
5.2. Benchmark Results
Table 2 reports the detection performance at
on the IR-SatDense test set, including both the overall results (All) and the three density intervals defined by AMID. Overall, integrating the proposed SDG-ResNet backbone into DETR-style detectors leads to consistent PD improvements across most density regimes while keeping FA at a comparable or even lower level than the corresponding baselines.
It is worth noting that the Average Minimum Inter-Target Distance (AMID) is inversely related to the target density in the scene. A smaller AMID value indicates that targets are more densely distributed, while a larger AMID corresponds to relatively sparse target configurations. Therefore, the AMID-based evaluation provides a quantitative analysis of the detector performance under different target density conditions.
For DINO, replacing the vanilla ResNet-50 with SDG-ResNet yields a PD increase from 86.44% to 86.82% on the whole test set (+0.38%), with FA changing only slightly from to . In terms of density-specific results, SDG DINO improves PD in all AMID intervals: from 85.55% to 85.71% (+0.16%) for , from 87.43% to 87.51% (+0.08%) for , and from 86.84% to 87.41% (+0.57%) for .
For Deformable DETR, the baseline model performs poorly on IR-SatDense, achieving only 5.49% PD overall. After introducing SDG-ResNet, SDG Deformable DETR improves the overall PD to 7.13% (+1.64%), with similar trends across all three AMID intervals (e.g., from 6.42% to 7.90% for ). Although the absolute PD values remain modest, this relative gain demonstrates that SDG can noticeably strengthen the detection capability of weaker DETR-style baselines.
Table 3 further shows that the SDG-equipped detectors consistently achieve higher PD across IoU thresholds from 0.3 to 0.7 while maintaining comparable FA. This confirms that the performance gain is not limited to IoU = 0.50 but reflects improved localization robustness.
In summary, the benchmark results confirm that semantic density guidance provides clear and consistent improvements for DETR-style detectors on IR-SatDense under different density conditions.
5.3. Comparison with State-of-the-Art Methods
To further evaluate the effectiveness and efficiency of the proposed SDG-ResNet, we compare SDG-enhanced detectors with representative segmentation-based ISTD networks and DETR-style detectors on IR-SatDense.
Table 4 summarizes the probability of detection (PD), false alarm rate (FA), and the model complexity in terms of parameters and FLOPs.
Among all compared methods, DINO already provides a very strong baseline, achieving 86.44% PD and FA with 47.5M parameters and 178.5G FLOPs. After inserting SDG-ResNet, SDG DINO further improves PD to 86.82% (+0.38%) with only a small increase in model size and computation (49.8M params, +4.8%; 186.1G FLOPs, +4.3%), while FA remains at a similar level (, +0.04). This shows that SDG brings measurable accuracy gains at a very modest additional cost.
For Deformable DETR, the baseline obtains 5.49% PD and FA with 40.0M parameters and 123.3G FLOPs. The SDG version, SDG D-DETR, increases PD to 7.13% (+1.64%) with 42.4M parameters (+6.0%) and 130.8G FLOPs (+6.1%), while FA remains on the same order (). Although the absolute performance is still lower than that of DINO, this relative improvement verifies that SDG can noticeably strengthen the detection capability of weaker DETR-style baselines in dense small-target scenarios.
For DETA, introducing SDG-ResNet brings clear gains in both accuracy and robustness. The overall PD increases from 63.70% to 64.75% (+1.05%), while FA is reduced from to (). This improvement is achieved with a moderate overhead in complexity: the number of parameters grows from 48.3 M to 50.6 M (about +4.8%), and FLOPs from 182.0 G to 189.5 G (about +4.1%). These results indicate that even for a head-optimized detector like DETA, semantic density guidance at the backbone level can still provide a favorable accuracy–complexity trade-off.
Compared with the segmentation-based ISTD methods (e.g., DNA_Net, ISTDU-Net, UIUNet), SDG DINO achieves the highest PD on IR-SatDense while maintaining a competitive FA, despite having a larger model size. Taken together, these results demonstrate that the proposed SDG-ResNet is a lightweight yet effective plug-in backbone for dense infrared small target detection: it yields clear PD improvements for strong DETR-based detectors at a negligible cost in parameters and FLOPs and can be seamlessly integrated into existing architectures.
5.4. Ablation Study
We conduct ablation experiments on IR-SatDense based on the DINO detector at . The test set is divided into three density ranges according to AMID (, , ) plus the overall set (All). We compare four variants: the original DINO (baseline), SDG@Res4 (only an SDGR block on Res4), SDG@Res3 (only on Res3), and SDG (full), which inserts SDGR blocks at both Res3 and Res4.
As shown in
Table 5, all SDG variants improve PD over the DINO baseline (86.44% PD,
FA) on the whole test set. SDG@Res4 and SDG@Res3 increase PD to 86.50% (+0.06%) and 86.67% (+0.23%), while slightly reducing FA to
and
, respectively. The full SDG configuration achieves the highest PD of 86.82% (+0.38%) with a marginal FA change to
(+0.04).
The gains are most evident in the dense regime (), where many targets are tightly clustered: PD increases from 85.55% (baseline) to 85.96%, 85.78%, and 85.71% for SDG@Res4, SDG@Res3, and full SDG, with FA staying around the baseline level. In the densest practical regime , PD improves from 86.84% to 86.97%, 87.12%, and 87.41% (+0.57% for full SDG). Overall, these results show that injecting a shared semantic density prior into Res3/Res4 consistently enhances detection performance, and jointly refines both stages (full SDG) provides the best PD–FA trade-off.
Although SDG-ResNet introduces additional parameters and computational overhead, the increase remains modest relative to the baseline backbone. For example, when integrated into DINO, the number of parameters increases from 47.5M to 49.8M (approximately +4.8%), and the FLOPs increase from 178.5G to 186.1G (approximately +4.3%). Compared with the overall computational scale of transformer-based detectors, this additional cost is relatively small and does not affect practical deployment feasibility. Meanwhile, SDG-ResNet consistently improves PD across dense scenarios. These results indicate a favorable performance–efficiency trade-off for dense infrared small target detection.
To examine whether the proposed SDG module affects optimization stability, we compare the training loss curves between baseline detectors and their SDG-equipped variants. As shown in
Figure 6, all SDG-equipped models exhibit smooth convergence behavior that closely follows their corresponding baselines. No noticeable oscillation or divergence is observed during training. Moreover, the convergence speed remains comparable across all models, indicating that the introduced density-guided refinement does not adversely impact training stability.
5.5. Comparison Results on Sparse Target Dataset IRSTD-1K
To further verify the generalization ability of SDG on conventional sparse infrared small targets, we also conduct experiments on the IRSTD-1K dataset [
2]. The quantitative results are summarized in
Table 6. We can observe that classical segmentation-based methods (ISTDU-Net, UIUNet, U-Net, etc.) already achieve very high PD values above 80% on this sparse benchmark, while DINO attains a strong trade-off with 85.46% PD and the lowest FA of
. Introducing SDG-ResNet into DINO further improves PD slightly to 85.81% while keeping FA unchanged.
These results indicate that IRSTD-1K is relatively easy in terms of target density and that transformer-based detectors such as DINO remain highly competitive even without explicit density modeling. The performance improvement on IRSTD-1K is relatively modest compared with that on IR-SatDense. This is expected because IRSTD-1K is primarily a sparse-target dataset, where most images contain only one or a few isolated targets with relatively large inter-target distances. In such scenarios, dense semantic clustering rarely occurs at high-level feature maps, and the predicted density prior tends to be spatially diffuse. As a result, the SDGR blocks behave close to identity mappings, leading to stable but limited gains. Importantly, SDG-ResNet does not degrade performance in sparse scenes, indicating that the density-guided mechanism remains compatible with conventional sparse-target detection settings.
5.6. Performance Under Different Background Complexity
To further analyze the robustness of the proposed semantic density guidance (SDG) mechanism under different background conditions, we evaluate the detection performance on the predefined complexity subsets of the IR-SatDense dataset.
The results are summarized in
Table 7. Overall, the proposed SDG brings consistent improvements for Deformable DETR and DETA across all background complexity levels. For DINO, SDG also improves performance in most cases, especially under easy scene, medium scene, and complex scene conditions, while only a marginal fluctuation is observed under the most challenging extremely complex scene condition.
In particular, the improvements are more noticeable for relatively weaker baselines such as Deformable DETR and DETA, indicating that the proposed semantic density guidance effectively enhances the robustness of dense infrared small target detection under varying background complexities.
5.7. Performance Under Different Target Sizes
To further analyze the detection capability for extremely small infrared targets, we evaluate the detection probability (PD) under different target size intervals. The bounding-box areas are divided into three groups: , , and pixels. The proportions of ground-truth targets in these intervals are 38.16%, 41.76%, and 20.08%, respectively.
As shown in
Table 8, detection performance increases with target size for all methods, indicating that extremely small targets remain the most challenging scenario in infrared imagery. Nevertheless, the proposed SDG module consistently improves detection performance across different size intervals. In particular, more noticeable improvements are observed for extremely small targets (
), demonstrating that the semantic density guidance mechanism effectively enhances the representation of weak target signals.
5.8. Visual Analysis
To further illustrate the detection effectiveness of the proposed SDG-ResNet, we conduct a qualitative comparison on the four background complexity levels defined in
Figure 2, namely easy, medium, complex, and extremely complex on-orbit scenes. The visualization results are shown in
Figure 7, where each column corresponds to one background level, and the rows compare the baseline DINO with DINO + (equipped with SDG-ResNet).
Across all four background types, the baseline DINO either misses part of the densely distributed targets or produces spurious responses in cluttered non-target regions. By contrast, DINO + detects more true targets within dense clusters and effectively suppresses false alarms on background structures. This confirms that introducing SDG-ResNet can simultaneously enhance dense-target detection and reduce false alarms under diverse on-orbit background conditions.
6. Conclusions and Further Analysis
In this paper, we addressed the challenging problem of dense infrared small target detection, where tiny low-SNR targets appear in highly crowded configurations. We first constructed a new satellite dense infrared dataset, IR-SatDense, in which target density, inter-target spacing, and SNR can be flexibly controlled. Based on the proposed AMID metric, IR-SatDense reveals that the probability of detection (PD) of existing detectors degrades sharply while the false alarm rate (FA) increases as targets become more densely packed.
To mitigate this density-induced degradation, we proposed a Semantic Density-Guided ResNet (SDG-ResNet) backbone. SDG-ResNet predicts a semantic density map from the deepest ResNet stage and reuses it as a global prior to refine mid- and low-level features via lightweight Semantic Density-Guided Refine (SDGR) blocks. Integrated into representative DETR-like detectors such as Deformable DETR, DETA, and DINO, SDG consistently improves PD at comparable FA, especially in the most challenging small-AMID regime, while introducing only negligible additional parameters and FLOPs.
Experiments on both the dense IR-SatDense and the sparse IRSTD-1K datasets demonstrate that SDG-ResNet enhances robustness in dense-target regimes without sacrificing performance in sparse scenarios. In future work, we plan to extend semantic density guidance to multi-frame and multi-scale settings and to explore joint modeling of temporal density evolution and long-range motion patterns in satellite infrared sensing.