IR-ADMDet: An Anisotropic Dynamic-Aware Multi-Scale Network for Infrared Small Target Detection

Li, Ning; Wei, Daozhi

doi:10.3390/rs17101694

Open AccessArticle

IR-ADMDet: An Anisotropic Dynamic-Aware Multi-Scale Network for Infrared Small Target Detection

by

Ning Li

^*,† and

Daozhi Wei

^†

Air Defense and Antimissile School, Air Force Engineering University, Xi’an 710051, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(10), 1694; https://doi.org/10.3390/rs17101694

Submission received: 1 April 2025 / Revised: 5 May 2025 / Accepted: 9 May 2025 / Published: 12 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

Infrared small target detection in complex environments remains a significant challenge due to low signal-to-noise ratios (SNRs), background clutter, and target scale variations. To address these issues, we propose an Anisotropic Dynamic-aware Multi-scale Network for Infrared Small Target Detection (IR-ADMDet). The core of IR-ADMDet is a Dual-Path Hybrid Feature Extractor Network (DPHFENet). This network effectively synergizes local residual learning with global context modeling. It enhances faint target signatures while suppressing interference. Additionally, a Hierarchical Adaptive Fusion Framework (HAFF) is utilized. HAFF integrates bidirectional gating, recursive graph enhancement, and interlink fusion. This framework optimally refines features across multiple scales. The entire architecture is optimized for efficiency using dynamic feature recalibration. Extensive experiments were conducted on benchmark datasets including SIRSTv2, IRSTD-1k, and NUDT-SIRST. These experiments demonstrate the superiority of IR-ADMDet. It achieves state-of-the-art (SOTA) results, such as 0.96 AP50 and 0.95 F1-score on SIRSTv2. This performance is achieved with significantly fewer parameters, only 5.77 M, compared to existing methods. This shows remarkable robustness in low-contrast, high-noise scenarios. IR-ADMDet also outperforms contemporary segmentation-based approaches.

Keywords:

infrared small target detection; anisotropic dynamic-aware network; hierarchical adaptive fusion; dual-path feature extraction; real-time infrared surveillance

1. Introduction

Infrared imaging represents a versatile sensing technology with far-reaching applications across numerous disciplines. Beyond its well-established role in defense systems, infrared imaging has become indispensable in healthcare for non-invasive diagnostics, building inspection for energy efficiency assessment, firefighting for locating victims in smoke-filled environments, wildlife conservation for nocturnal species monitoring, industrial process control for equipment fault detection, and astronomical observation for celestial body identification [1,2]. Within this broad technological landscape, the detection of small targets in infrared imagery constitutes a particularly crucial capability with extensive applications spanning military surveillance, security monitoring, search and rescue operations, environmental conservation, and early warning systems. Despite significant advancements, this field continues to encounter formidable challenges stemming from the inherent characteristics of small infrared targets. These challenges manifest when targets appear at considerable distances from sensors, resulting in minimal pixel representation—typically under

9 \times 9

pixels in standard

256 \times 256

imagery, sometimes reduced to a single pixel in extreme cases. The sparse distribution of these targets creates a substantial quantitative imbalance between target pixels (positive samples) and background pixels (negative samples). Furthermore, the operational environment introduces additional complexities: fixed field-of-view sensors produce imagery plagued by substantial background noise, where targets exhibit minimal shape and texture differentiation, possess low signal-to-noise ratios (SNRs), and frequently become obscured within background clutter [3]. The minimal contrast between targets and their immediate surroundings, coupled with interference elements that mimic target characteristics, further complicates detection processes, as illustrated in Figure 1. Consequently, advancing methodologies for infrared small target detection holds paramount importance for enhancing detection system capabilities across multiple domains, from disaster response to environmental monitoring and beyond.

The research community has developed two primary approach categories to address these challenges: model-driven and data-driven methodologies. Model-driven approaches encompass filtering techniques, local contrast mechanisms, and tensor representation strategies. Filtering approaches enhance target features while suppressing background elements, but their efficacy diminishes in complex backgrounds, revealing limitations in robustness. Local contrast mechanisms function by evaluating differential characteristics between targets and surrounding regions, yet struggle with weak target detection when target–background differentiation is minimal. Tensor representation approaches simultaneously reconstruct background and target imagery through tensor decomposition, though they frequently generate false alarms when confronted with morphological variations in background or target elements.

These model-driven approaches predominantly rely on manual feature extraction, which introduces inherent limitations: insensitivity to scale and texture variations in infrared small targets leads to detection instability. Additionally, long-range infrared imaging produces substantial noise clutter, complicating effective background extraction with traditional methods and resulting in elevated false alarm rates.

In contrast, data-driven deep learning methodologies facilitate surface feature learning, progressively transforming raw data into high-level semantic information through multi-layered feature extraction. Unlike their model-driven counterparts, these approaches eliminate the necessity for manually designed features, instead autonomously extracting relevant characteristics with enhanced efficiency and accuracy, thereby elevating target identification capabilities through complex feature learning from available datasets.

The introduction of Region-based Convolutional Neural Networks (R-CNNs) [4] pioneered data-driven object detection for visible light imagery, establishing a foundation for methods that have since demonstrated exceptional performance across benchmark datasets, including ImageNet, COCO, and PASCAL VOC. The emergence of infrared small target datasets has prompted increased exploration of deep learning applications in this specialized field, with researchers developing architectures specifically optimized for infrared small target detection.

Contemporary research has yielded several innovative approaches addressing infrared small target detection challenges. Hao et al. [1] integrated super-resolution techniques with deep learning, developing YOLO-SR with Transformer blocks to capture long-range dependencies, though challenges persist with extremely low-contrast targets. Tong et al. [2] proposed the Target-focused Enhancement Network (TENet) featuring dense long-distance constraint modules to manage complex environments with pronounced background effects, despite limitations in thermal crossover scenarios. Ma et al. [3] approached complex environment challenges through high- and low-frequency semantic reconstruction (HLSR-net), effectively mitigating background clutter through frequency characteristic differentiation, achieving notable accuracy improvements over previous methods.

Addressing domain disparities between synthetic and real infrared target data, Chi et al. [5] developed a semantic domain adaptive framework (SDAISTD) that effectively reduces domain gaps and improves training outcomes. Concurrently, Lin et al. [6] focused on contrast-enhanced shape-biased representations through specialized encoder–decoder architectures, successfully addressing challenges related to diminished textures, reduced contrast, and morphological variations in infrared small targets.

Broader object detection research in challenging imaging conditions offers valuable insights for infrared small target detection. For SAR imagery, Sun et al. [7] introduced the Multi-Scale Dynamic Feature Fusion Network (MSDFF-Net), addressing multi-scale imbalance and irregular scattering characteristics. Zhang et al. [8] proposed a framework incorporating dynamic feature discrimination and center-aware calibration for cross-sensor scenarios to manage feature distribution variations.

Remote sensing applications with comparable challenges have inspired several methodological innovations. Li et al. [9] developed synergistic attention modules for concurrent modeling of spatial and channel affinity, enhancing feature discrimination. Li et al. [10] introduced attention-attended modules refining multi-head self-attention mechanisms by eliminating irrelevant contextual information. Li et al. [11] created geometric prior-guided interactive networks, capturing local and global contexts through dual-branch structures. Li et al. [12] proposed cross-domain coupling networks for simultaneous refinement of frequency and spatial domain representations, while Li et al. [13] developed frequency decoupling networks that enhance feature representation through independent refinement of high- and low-frequency components.

Despite their effective capture of target details and contextual information through efficient feature extraction, deep learning methodologies face significant challenges when applied directly from visible light domains to infrared target detection contexts [14]. Infrared imagery exhibits reduced color information and diminished target texture characteristics compared to visible light imagery, limiting extractable features for target recognition [3]. The combination of low resolution, minimal target dimensions, and the absence of distinctive shape features further complicates target recognition and localization [1,15]. Complex background interference, reduced target gray values, and occlusion phenomena introduce additional complications to the infrared target detection process [2].

A particularly significant limitation in current methodologies lies in their difficulty distinguishing small targets from background clutter. This challenge emerges from inadequate association between shallow–deep, low–high, and local–global features, facilitating false alarm generation [5,6]. Many contemporary approaches either concentrate exclusively on spatial domain features [13] or implement feature fusion sequentially rather than through integrated mechanisms [9]. Furthermore, existing methodologies frequently struggle with computational efficiency, limiting their suitability for real-time applications in resource-constrained operational environments [12].

To address these multifaceted challenges, we introduce IR-ADMDet, a novel one-stage detector specifically engineered for infrared small target detection in single-frame imagery with irregular background noise. Our research contributions are summarized as follows:

(1): We propose IR-ADMDet, a specialized one-stage detector for small target detection in infrared imagery characterized by irregular background noise, designed to maximize detection accuracy while minimizing missed detection instances.
(2): The backbone DPHFENet incorporates Hybrid Feature Extractor Block (HFEBlock), combining CNN and Transformer architectural strengths to enhance feature extraction precision and efficiency, reduce parametric and computational requirements, and enrich contextual information representation.
(3): The Neck HAFF implements the Bidirectional Gated Feature Symbiosis (BGFS) module for effective integration of local and global contextual features, complemented by Recurrent Graph-Enhanced Fusion (RGEF) and Interlink Fusion Core (IFC) modules that optimize model complexity and enhance detection performance through lightweight convolutions, reparameterization techniques, and attention mechanisms.
(4): Comprehensive experimental evaluation demonstrated the superior performance of IR-ADMDet compared to state-of-the-art object detectors across benchmark datasets, including SIRSTv2, IRSTD-1k, and NUDT-SIRST, validating its enhanced detection capabilities.

2. Relate Work

2.1. Bounding Box-Based Infrared Small Target Detection Methods

Redmon et al. [16] first proposed the end-to-end single-stage object detection network YOLO (You Only Look Once). It achieved excellent detection speed and accuracy on the VOC dataset. With the evolution of the YOLO series from YOLOv2 to YOLOv12, detection accuracy, speed, and network efficiency have been significantly improved. However, these bounding box-based methods mainly target larger or medium-sized objects in natural images. They have limitations in detecting small targets in infrared images with low contrast, high noise, and complex backgrounds. Qiu et al. [17] reconstructed the YOLOv2 backbone network for anti-interference detection of infrared imaging missiles. Their method achieved high accuracy in jamming environments but lacked robustness when processing low-contrast small targets. Zhang et al. [18] enhanced YOLOv3 by introducing a 4× downsampled residual block, which improved recall and detection accuracy for small targets. However, its stability in complex backgrounds needs further improvement. Lin et al. [19] reduced model weight using GhostNet and combined it with a coordinate attention mechanism and depth-wise separable convolutions to design a YOLOv4-based network. Although this significantly reduced the number of parameters, its localization ability in extremely low-contrast infrared images remains limited. Li et al. [20] proposed YOLO-FIR and YOLO-FIRI, which achieved multi-scale detection by extending the CSP module and improving the attention mechanism. They showed some improvement on the KAIST and FLIR datasets, but they still do not capture the edge features of infrared small targets sufficiently. Ciocarlan et al. [21] introduced an inverse decision criterion in YOLOv7-tiny to exploit the unexpected nature of small targets and improve accuracy. However, there is still a risk of missed detections in some complex scenarios. Cao et al. [22] proposed YOLO-TSL based on YOLOv8n. This method uses a triple attention mechanism and a Slim-Neck architecture for lightweight detection, yet there remains room for improvement in the fine recognition of extremely small targets. Betti et al. [23] developed the YOLO-S network, which performs well in terms of speed and accuracy on mobile and edge devices, but its overall architecture is still limited by the inherent drawbacks of bounding box-based methods. Hou et al. [24] proposed YOLO-B, which improved detection performance through enhanced feature extraction and bounding box regression. However, its fine-grained localization of infrared small targets is still insufficient. Jawaharlalnehru et al. [25] improved the YOLO algorithm for UAV aerial images, increasing accuracy and reducing false detection rates, but its performance in low signal-to-noise infrared images was limited. Liu et al. [26] introduced a deformable convolution module based on YOLOv8. Although this improved detection speed and accuracy, it did not fully address the challenges of detecting infrared small targets in complex backgrounds.

In summary, although bounding box-based detection methods have achieved initial success in some infrared target detection tasks, most are not specifically designed for infrared small targets. Under conditions of low contrast, noise interference, and complex backgrounds, these methods still suffer from missed detections and false positives. This calls for more targeted improvements.

2.2. Segmentation-Based Infrared Small Target Detection Methods

The development of segmentation-based methods for infrared small target detection has progressed through several key innovations addressing three fundamental challenges: multi-scale feature representation, computational efficiency, and noise robustness. Early work by Dai et al. established attention mechanisms through ACM [27] networks, demonstrating effective background suppression but revealing computational complexity limitations. Subsequent research introduced asymmetric local–global contexts in ALCNet [28] and dense nested interactions in DNANet [29], improving multi-scale adaptation while highlighting the need for extensive annotated training data.

The introduction of SIRST datasets marked a significant advancement, enabling data-driven approaches like ISTDU-Net [30] to implement novel feature map grouping and global perception fields within U-Net architectures. These methods enhanced target saliency through weighted feature allocation but maintained computational redundancy from multi-branch processing. More recent work has focused on optimizing the efficiency–accuracy balance, with RDIAN [31] achieving noise robustness through multi-receptive field extraction, and OSCAR [32] introducing cascaded refinement with adaptive pseudo-boxes for precise localization.

Despite these advancements, three critical limitations persist across current segmentation-based methods. First, the inherent conflict between model complexity and real-time processing requirements remains unresolved, particularly in dual-path architectures. Second, excessive reliance on synthetic training data continues to limit generalization to real-world operational scenarios. Third, performance degradation under extreme noise conditions persists due to insufficient noise-adaptive learning mechanisms.

2.3. Feature Fusion Network

In deep neural networks, features from different layers provide multi-scale information. Deep features contain high-level semantics and spatial details, which are crucial for the precise localization of small targets. In contrast, shallow features include edges, textures, and local information. By fusing these features, the network can integrate this information to enhance the representation of infrared images, improve detection accuracy, capture image complexity and diversity, and boost robustness against input variations while maintaining computational efficiency.

In recent years, many researchers have proposed efficient feature fusion methods for infrared small target detection. Tong et al. [33] used ASPPM and DAM for feature extraction and enhancement, respectively, and then fused them through concatenation. However, their fixed fusion strategy may not adapt well to dynamic changes in local details. Shi et al. [34] proposed CAFFNet, which captures both low-level and high-level features and employs a coordinate attention mechanism to reduce false positives, although there remains some risk of missed detections in complex backgrounds. Xu et al. [35] introduced MMRFF-Net, which utilizes a feature extraction module and grid resampling to achieve multi-scale fusion. Despite its effectiveness, the method has high computational complexity, and its real-time performance needs improvement.

Fang et al. [36] and Zhang et al. [37] used depth-wise separable residual blocks, multi-scale fusion, and dynamic weight convolution to enhance detection accuracy and reduce false positives. However, in practice, some loss of detail may affect overall performance. Zhang et al. [38] enhanced boundary attention by leveraging image curvature features and half-level fusion blocks, but their adaptive fusion capability for different scale features still requires improvement. Zuo et al. [39] designed AFFPN, which uses both top-down and bottom-up attention mechanisms for feature fusion. Although it improves detection performance to a certain extent, its ability to restore fine details of extremely small targets is still limited.

Liu et al. [40] developed an image-enhancement network that achieves gradual interaction between high-level and low-level features; however, its robustness under complex backgrounds is only modestly improved. Chen et al. [41] proposed MTUNet, which employs dilated convolution and adaptive global average pooling for multi-scale feature fusion, yet it faces some limitations in lightweight design. Zhao et al. [42] introduced TBCNet, which leverages a lightweight design and a joint loss function to enhance detection accuracy, but the overall fusion effect and adaptive capability still need further validation.

In summary, although multi-scale feature fusion and attention mechanisms have shown significant potential in enhancing infrared small target detection, current methods still exhibit clear deficiencies when handling low contrast, severe noise, and complex backgrounds. These challenges provide a direction for further research.

3. Method

3.1. Overall Architecture

The overall architecture is shown in Figure 2. In the backbone DPHFENet, the input infrared image

X

\in R^{H \times W \times 1}

is first downsampled by two successive

3 \times 3

convolutions with a stride of 2, producing feature maps S1 and S2 at

(H / 2, W / 2)

and

(H / 4, W / 4)

resolutions. S2 is then processed by a C2f module repeated three times to extract refined features. A subsequent convolutional layer with a stride of 2 further downsamples the output to a

(H / 8, W / 8)

resolution feature map S3, which is enhanced by another C2f module, repeated six times. Then, a convolutional layer downsamples the output to a

(H / 16, W / 16)

resolution feature map S4, which is processed by HFEBlock and repeated six times to yield enhanced features. This is followed by an additional convolutional layer with a stride of 2 that generates a

(H / 32, W / 32)

resolution feature map S5, which is processed by another HFEBlock, repeated three times, and finalized with an SPPF module that outputs the backbone’s final feature representation.

In the Neck HAFF, a BGFS module first fuses the output P4 from HFEBlock at

(H / 16, W / 16)

resolution with the output from the

(H / 32, W / 32)

branch. The result is processed by an IFC module, which outputs refined features that are later merged with outputs from a shallower backbone branch using another BGFS module. This fusion is followed by an RGEF module. Similarly, another BGFS module fuses features from an even shallower backbone branch, which are then refined by another RGEF module.The outputs of these sequential BGFS and RGEF operations continue in parallel branches until four distinct feature maps are generated, corresponding to the features that were originally processed from layers P2, P3, P4, and P5. Finally, these four feature maps are combined in the Detect Head to produce the final detection results. Notably, we have added an additional detection head to the original three detection heads, specifically designed to process the shallow feature map at

(H / 4, W / 4)

resolution, which contains rich information about small targets, thereby enhancing the model’s capability to detect small objects.

3.2. Dual-Path Hybrid Feature Extractor Network

In infrared images, small targets often have low contrast and noise interference. This causes traditional CNNs to ignore important target details. Liu et al. [43] have demonstrated that in infrared small target detection, global contextual feature information is key. However, using only Transformers to extract global context information brings a high computational burden. To address this, we propose DPHFENet. DPHFENet is the backbone network in Figure 2. The key component of DPHFENet is HFEBlock, based on the Dual-Path Feature Enhancer (DPFE). It achieves accurate detection of small targets in infrared images while maintaining computational efficiency. The structure of HFEBlock is shown in Figure 3. DPHFENet utilizes the C2f and SPPF structures from YOLOv8, as shown in Figure 4.

First, let the input feature map of HFEBlock be denoted as

X_{1} \in R^{c_{1 i n} \times H_{1} \times W_{1}}

. The output feature map of HFEBlock is denoted as

O_{1} \in R^{c_{1 o u t} \times H_{1} \times W_{1}}

. HFEBlock maps the feature map to a higher-dimensional feature space using an initial convolution layer, as shown in Equation (1).

F = C (c_{1}, 2 c; 1, 1) (X_{1})

(1)

Here,

C (\cdot)

represents the convolution operation, which integrates batch normalization (BN) and an activation function to better capture the initial information of the image, as shown in Equation (2). The number of output channels is

c = 0.5 c_{1 o u t}

.

C (x) = σ (BN (Conv 2 D (c_{in}, c_{out}, k, s) (x)))

(2)

Here,

σ

represents the SiLU activation function.

Then, the feature map

F

is evenly split into two parts, as shown in Equation (3).

{F_{1}, F_{2}} = Split (F)

(3)

Here,

F_{1}

is reserved as the skip connection for later fusion, while

F_{2}

is used for deep feature enhancement. Next,

F_{2}

is fed into multiple DPFE modules. In each DPFE module, the input feature

ξ

is split into a local path and a global path for parallel processing. In the Local Residual Learning Path (LRLP), let the local branch input be

ξ_{loc}

, and the intermediate number of channels is first calculated, as shown in Equation (4).

\tilde{c} = [(c - τ) \cdot e^{'}], e^{'} = 0.5

(4)

Subsequently, local features are extracted using a convolution operation and enhanced through residual connections, as shown in Equations (5) and (6).

ζ = C (c - τ, \tilde{c}; k_{1}, 1) (ξ_{loc})

(5)

Φ_{loc} = ξ_{loc} + C (\tilde{c}, c - τ; k_{2}, 1) (ζ)

(6)

Here,

k_{1} = 3

and

k_{2} = 3

represent the sizes of the convolution kernels,

τ

denotes the number of channels in the global path, and

Φ_{loc}

denotes the local feature’s output. The value of

τ

is determined by the channel factor s. Kou et al. [14] found that as the number of network layers increases, background information related to small targets gradually disappears. Therefore, in the HFEBlock of the sixth layer of the backbone,

τ = 0.25 c

and

s = 0.25

were chosen, and in the HFEBlock of the eighth layer,

τ = 0.5 c

and

s = 0.5

were selected.

Correspondingly, in the Global Context-aware Path (GCP), let the input of the global branch be

ξ_{glob}

. First, an Adaptive Gating Mechanism (AGM) is employed to dynamically adjust the global features. The calculation process is shown in Equations (7)–(11).

η = C (τ, 2 \tilde{τ}; 1, 1) (ξ_{glob})

(7)

{η^{(1)}, η^{(2)}} = Split (η)

(8)

{\hat{η}}^{(1)} = DWConv (η^{(1)})

(9)

γ = {\hat{η}}^{(1)} ⊙ η^{(2)}

(10)

AGM (ξ_{glob}) = ξ_{glob} + C (\tilde{τ}, τ; 1, 1) (γ)

(11)

Here, ⊙ denotes element-wise multiplication, and

\tilde{τ}

represents the intermediate number of channels. Next, a Self-Attention Interaction Mechanism (SAIM) is used to capture long-range dependencies, as shown in Equations (12)–(17).

QKV = C (τ, H; 1, 1) (ξ_{glob})

(12)

{Q, K, V} = Reshape (QKV)

(13)

A = softmax (\frac{Q^{⊤} κ}{\sqrt{δ}})

(14)

ν_{attn} = {VA}^{⊤}

(15)

ν_{loc} = C (τ, τ; 3, 1) (V)

(16)

SAIM (ξ_{glob}) = C (τ, τ; 1, 1) (ν_{attn} + ν_{loc}) .

(17)

where

H

represents the expanded number of channels, and

δ

is the scaling factor. Finally, the global path is fused using Equation (18) to obtain the enhanced output.

\begin{matrix} Φ_{glob} & = ξ_{glob} + D (C (τ, τ; 1, 1) (SAIM (ξ_{glob}) \\ + AGM (ξ_{glob}))) \end{matrix}

(18)

where

D

denotes the DropPath regularization operation. After processing through both the local and global paths, the two sets of features are fused via channel-level concatenation, and then a convolution operation is applied to obtain the output of the current DPFE module, as shown in Equation (19).

DPFE (ξ) = C (c, c; 1, 1) (Concat (Φ_{loc}, Φ_{glob}))

(19)

After processing through n consecutive DPFE modules, multiple enhanced features

V_{1}, \dots, V_{n}

are generated. Finally, the initial branches

F_{1}

and

F_{2}

are concatenated with the features output by each DPFE module, yielding Equation (20).

Y = Concat (F_{1}, F_{2}, V_{1}, \dots, V_{n})

(20)

The number of channels in the concatenated feature map is

(2 + n) \cdot c

. Subsequently, a convolution layer maps the concatenated result to the final output channel number, as shown in Equation (21).

O_{1} = HFEBlock (X_{1}) = C ((2 + n) c, c_{2}; 1, 1) (Y)

(21)

The final output feature map

O_{1} \in R^{c_{12} \times H_{1} \times W_{1}}

not only contains fine-grained local information but also preserves global semantics, providing high-quality feature support for subsequent infrared small target detection tasks.

Detailed information about the backbone is shown in Table 1.

Among them, Number indicates the number of modules. Conv refers to a convolution operation. It consists of a standard 2D convolution, a BN layer, and an activation function.

3.3. Hierarchical Adaptive Fusion Framework

Infrared small targets in nighttime scenes are vulnerable to feature degradation during multi-scale fusion. This issue arises from two main factors. First, redundant features from different scales are aggregated indiscriminately, which amplifies background noise. Second, fixed fusion strategies cannot adaptively preserve critical shallow details or suppress directional interference. To address these issues, we propose the HAFF, which consists of three coordinated modules. The first module, BGFS, dynamically filters cross-scale features using learnable gating weights to reduce redundancy. The second module, RGEF, employs recursive graph convolutions to model directional noise patterns in high-resolution feature maps. The third module, IFC, enhances mid-level semantic contrast through spatial-channel co-attention mechanisms. Together, these modules progressively refine target features across hierarchical scales.

3.3.1. Bidirectional Gated Feature Symbiosis Module

To reduce feature redundancy and inconsistency during the fusion of mixed features for infrared small targets, enhance the information exchange between adjacent scale features, and achieve bidirectional fusion of deep and shallow feature maps, we propose the BGFS module. The structure of BGFS is shown in Figure 5.

Let

H \in R^{c_{H} \times h_{H} \times w_{H}}

denote the high-resolution input feature and

L \in R^{c_{L} \times h_{L} \times w_{L}}

denote the low-resolution input feature. Here,

c_{H}

and

c_{L}

represent the number of channels for the high-resolution and low-resolution features, respectively, and

(h_{H}, w_{H})

and

(h_{L}, w_{L})

denote their spatial dimensions, with

h_{H} = 2 h_{L}

and

w_{H} = 2 w_{L}

. First, standard convolution is used to project them to a common intermediate channel number

c_{mid} = \frac{c_{out}}{2}

, as shown in Equation (22).

\begin{matrix} H^{'} & = Conv 2 D (c_{in}, c_{mid}, 1, 1) (H), \\ L^{'} & = Conv 2 D (c_{in}, c_{mid}, 1, 1) (L) \end{matrix}

(22)

Next, apply a Sigmoid activation function to the projected features separately to generate dynamic gating weights, as shown in Equation (23).

G_{H} = σ (H^{'}), G_{L} = σ (L^{'})

(23)

Here,

σ

denotes the Sigmoid activation function.

Based on this, a bidirectional residual mechanism is used to achieve complementary fusion of high-resolution and low-resolution features. Specifically, the enhanced representation of the low-resolution feature

L_{enh}

is given in Equation (24).

L_{enh} = L^{'} + L^{'} ⊙ G_{L} + (1 - G_{L}) ⊙ I (G_{H} ⊙ H^{'})

(24)

The enhanced representation of high-resolution features

H_{enh}

is given in Equation (25).

H_{enh} = H^{'} + H^{'} ⊙ G_{H} + (1 - G_{H}) ⊙ I (G_{L} ⊙ L^{'})

(25)

Here,

I (\cdot)

denotes the bilinear interpolation upsampling operation with a scale factor of 2, which is used to match the spatial resolution.

To achieve final feature aggregation, the upsampled high-resolution features are first adjusted to the same spatial scale as the low-resolution features. Then, the two are concatenated along the channel dimension, as shown in Equation (26).

X_{cat} = Concat (I (H_{enh}), L_{enh})

(26)

Finally, a convolution operation is applied to map the concatenated fused features to the output channel number

c_{out}

, yielding the final output features, as shown in Equation (27).

O_{f} = C (c_{out}, c_{out}; 3, 1) (X_{cat})

(27)

The BGFS module not only achieves adaptive fusion of features at different scales but also employs a bidirectional gating mechanism to balance high-level abstract information with low-level detail features. This design ensures stability during gradient backpropagation and enhances the overall robustness and discriminative ability of the feature representation.

3.3.2. Recurrent Graph-Enhanced Fusion Module

To address the challenge of the foreground–background feature imbalance in infrared small target detection, we propose a new lightweight hybrid feature extraction module called the RGEF module. The structure of the RGEF module is shown in Figure 6.

Given the input feature

X_{2} \in R^{c_{2 i n} \times H_{2} \times W_{2}}

, the RGEF module first maps

x

to twice the hidden channels via a standard convolution operation, as defined in Equation (28). The output feature map of the RGEF module is

O_{2} \in R^{c_{2 o u t} \times H_{2} \times W_{2}}

:

z = C (c_{2 i n}, 2 c; 1, 1) (x)

(28)

where

c = 0.5 c_{2 o u t}

.

Next, the module evenly splits z along the channel dimension into two parts, denoted as

z^{(1)}

and

z^{(2)}

. Here,

z^{(1)}

retains the original fine-grained information, while

z^{(2)}

proceeds into the subsequent recursive feature extraction process. For

z^{(2)}

, an intermediate feature

{\tilde{z}}^{(2)}

is obtained via the re-parameterization convolution operation, as shown in Equation (29). The structure of the re-parameterization convolution is illustrated in Figure 6.

{\tilde{z}}^{(2)} = Repconv (z^{(2)})

(29)

Here, the output channel number of Repconv is m.

Subsequently,

{\tilde{z}}^{(2)}

is processed through

n - 1

consecutive recursive convolution layers, with the output of each layer obtained as specified by Equation (30).

z_{i} = C (m, m; 3, 1) (z_{i - 1}), i = 3, 4, \dots, n + 1

(30)

Here, the initial output is

z_{2} = {\tilde{z}}^{(2)}

. After completing these recursive convolution layers, the module further processes the final output with a

1 \times 1

convolution, as shown in Equation (31).

z_{n + 2} = C (m, m; 1, 1) (z_{n + 1})

(31)

Subsequently, RGEF concatenates the initially preserved information

z^{(1)}

with all the features obtained through recursive processing

{\tilde{z}}^{(2)}, z_{3}, \dots, z_{n + 2}

along the channel dimension to form the concatenated feature Z, as shown in Equation (32). The number of feature channels in Z is given by

c + m \cdot (n + 1)

.

Z = concat (z^{(1)}, {\tilde{z}}^{(2)}, z_{3}, \dots, z_{n + 2})

(32)

Finally, a mapping convolution is applied to map the fused features to the target channel number

c_{2}

, as shown in Equation (33), thereby providing enhanced and multi-scale information representation for subsequent detection modules.

O_{2} = C (c + m \cdot (n + 1), c_{2}; 1, 1) (Z)

(33)

RGEF achieves efficient integration of fine-grained spatial information through initial feature transformation, recursive extraction, and multi-branch fusion, significantly enhancing the robustness of small target detection in complex backgrounds.

3.3.3. Interlink Fusion Core Module

The P4 feature map loses some low-level fine-grained directional information. As a result, the RGEF module is less sensitive in capturing directional background noise, which affects the separation between small targets and the background. Inspired by Ma et al. [44], we utilize channel-spatial collaborative attention on intermediate-level feature maps. This approach efficiently fuses global and local information, preserving small target details while enhancing the contrast between targets and the background. We designed the IFC module to address these issues. To balance performance and computational cost, we only replace the RGEF module at the P4 level with this new module. The IFC module consists of two key components: Interlink Fusion Block (IFBlock) and Bifocal Fusion Attention (BFA), as shown in Figure 7.

Assume the input feature map is

X_{3} \in R^{c_{3 i n} \times H_{3} \times W_{3}}

. The output feature map is

O_{3} \in R^{c_{3 o u t} \times H_{3} \times W_{3}}

. First,

X_{3}

is split into two parts by an initial convolution. Specifically, the convolution operation

C (c_{in}, c_{out}; 1, 1)

maps the input to

2 c

channels, and then the output is evenly split along the channel dimension into

X_{31}

and

X_{32}

, satisfying

c_{31} + c_{32} = c

, where

c_{1} = c_{2} = \frac{c}{2}

and

c = 0.5 c_{3 o u t}

.

The branch

X_{31}

is further processed by IFBlock. In each IFBlock, the input

X_{31}

is first processed by a depth-wise separable convolution (DWConv) with a kernel size of 7 and the number of groups equal to the channel number c, as shown in Equation (34).

X_{dw} = DWConv (X_{31}; k = 7, g = c)

(34)

Subsequently, two parallel convolution operations

C (c, 3 c; 1, 1)

are applied to

X_{dw}

to obtain two feature branches,

X_{dw 1}

and

X_{dw 2}

. After applying the ReLU6 activation, these two branches are fused by element-wise multiplication, as shown in Equation (35).

X_{int} = ReLU 6 (X_{dw 1}) ⊙ X_{dw 2}

(35)

On this basis, the BFA module processes the features. First, a

7 \times 7

average pooling is applied to

X_{int}

. Then, the pooled features are passed through the convolution

C (3 c, 3 c; 1, 1)

followed by sequential horizontal and vertical convolution operations to obtain the channel attention factor

A_{ch}

, as shown in Equations (36) and (37).

x = (C (3 c, 3 c; 1, 1) (AvgPool 2 d (7, 1, 3) (X_{int}))

(36)

A_{ch} = σ (C (3 c, 3 c; 1, 1) (Conv 2 D (Conv 2 D (x))))

(37)

Meanwhile, the spatial attention factor

A_{sp}

is obtained via adaptive average pooling to capture global spatial information, and it is computed using a

1 \times 1

convolution

(Conv 2 D (3 c, 1; 1, 1)

and a Sigmoid activation function, as shown in Equation (38).

A_{sp} = σ (Conv 2 D (3 c, 1; 1, 1) (Γ (X_{int})))

(38)

Here,

Γ (\cdot)

denotes the adaptive average pooling operation. These two attention factors are applied to weight

X_{int}

along the channel and spatial dimensions, respectively, and then added together to form the final attention features, as shown in Equation (39).

X_{att} = (A_{sp} \otimes X_{int}) + (A_{ch} \otimes X_{int})

(39)

The receptive field of IFC under different parameters is shown in Figure 8. It can be observed that when

k = 5

, the distribution of high-contribution pixels in the IFC module is more uniform and the effective receptive field is larger. Therefore, we chose

k = 5

.

The final output

X_{out}

of FBlock is obtained from Equations (40) and (41).

{\tilde{X}}_{31} = DWConv (BN (Conv 2 D (1, 1)) (X_{att}))

(40)

X_{out} = X_{31} + D ({\tilde{X}}_{31})

(41)

After all IFBlocks are completed, the output features are concatenated with the other branch,

X_{32}

.

Then, these concatenated features are processed by

C (3 c, c_{3 o u t}; 1, 1)

to obtain the final output feature

O_{3}

, as shown in Equation (42).

O_{3} = C (3 c, c_{3 o u t}; 1, 1) (concat (X_{31}, X_{32}, X_{out}))

(42)

3.4. Loss Function

The complete-intersection over union (CIoU) loss is used as the loss function:

C I o U = 1 - I o U + \frac{{∥(p^{p}, p^{g t})∥}^{2}}{l^{2}} + φ M

(43)

M = \frac{4}{π^{2}} {(arctan \frac{w^{g t}}{h^{g t}} - arctan \frac{w^{p}}{h^{p}})}^{2}

(44)

φ = \{\begin{matrix} 0, & if IoU < 0.5 \\ \frac{M}{(1 - I o U) + M}, & if IoU \geq 0.5 \end{matrix}

(45)

I o U = \frac{U^{g t} \cap U^{p}}{U^{g t} \cup U^{p}}

(46)

where

p^{P} = {[x^{P}, y^{P}]}^{T}

and

p^{g t} = {[x^{g t}, y^{g t}]}^{T}

are the central points of the predicted and ground truth boxes, respectively. l is the diagonal length of the smallest rectangle that can encompass both the predicted box and the ground truth box.

{∥\cdot∥}^{2}

is specified as the Euclidean distance. w and h are the width and height of the box. Intersection over Union (IoU) denotes the degree of overlap between a predicted box

U^{p}

and a ground truth box

U^{g t}

.

4. Experiments and Analysis

In this section, we verify the effectiveness and reliability of the proposed method through a series of carefully designed experiments. First, we provide a comprehensive introduction to the specific settings of the experiment, which covers the dataset used in the experiment, the key indicators for evaluating performance, the detailed parameters of the network implementation, and the comparison methods selected. Next, ablation studies were conducted on each network module to verify the module’s effectiveness and necessity. Finally, we compared the proposed method with the current mainstream infrared small target detection methods through quantitative and qualitative analysis to fully demonstrate its significant advantages in detection accuracy and robustness. The experimental part aims to build a comprehensive and systematic body of evidence to fully demonstrate the proposed method’s superior performance and practical value in the field of infrared small target detection.

4.1. Experimental Setup

4.1.1. Dataset and Training Settings

In this study, we evaluate the proposed detector on three open datasets: SIRSTv2 [32], IRSTD-1k [45], and NUDT-SIRST [29].

The SIRSTv2 dataset, proposed by Dai Yimian from Nanjing University of Aeronautics and Astronautics, is a publicly available, single-frame infrared small object detection dataset. This dataset contains 1024 typical infrared images, most of which have a resolution of 1280 × 1024, making it one of the highest-resolution datasets available for infrared small target detection. The images in SIRSTv2 are extracted from real-world video sequences, capturing various complex scenarios. These include urban environments where background interference, such as cranes and streetlights, resembles the appearance of targets, making detection more challenging.

The dataset reflects the difficulties of infrared small object detection in realistic settings, where most targets are dim and difficult to distinguish from cluttered backgrounds due to low contrast and high noise. Unlike traditional approaches that rely on target saliency or low-rank sparse decomposition, SIRSTv2 requires a more advanced method that can leverage high-level semantic understanding of the entire image to effectively distinguish targets from non-target interference. The distribution of objects in the images is random, and many of the targets are not easily separable from the background, requiring robust contextual information for accurate detection.

The IRSTD-1k dataset, proposed by Zhang Mingjin and colleagues from Xidian University, is a publicly available benchmark dataset for infrared small object detection. The dataset covers a broader range of extreme conditions and challenging environments by integrating multi-source infrared imaging equipment and dynamic acquisitions from real-world scenarios.

IRSTD-1k contains 1000 infrared images with a resolution of 512 × 512, annotated with 1520 infrared small objects. The data are sourced from both static and dynamic acquisitions across four major environmental categories: sky backgrounds (static cloud layers; dynamic flying birds), urban backgrounds (building thermal radiation; vehicle exhaust), natural backgrounds (forests; deserts), and extreme weather conditions (haze; rain; snow). The imaging spectrum spans short-wave, mid-wave, and long-wave infrared bands to simulate the imaging characteristics of different sensors. To enhance data diversity, synthetic noise (Gaussian noise and Poisson noise) is added to a subset of images, reducing the SNR. Approximately 40% of the images exhibit an SNR below 2 dB, where background noise nearly overwhelms object signals. The dataset highlights multi-scale object characteristics: approximately 30% of objects are smaller than 3 × 3 pixels (occupying less than 0.01% of the 512 × 512 image area), with some objects being single-pixel points (pixel coverage: 0.02%). The minimum object size is 1 × 1 pixels, while the maximum reaches 15 × 15 pixels. Regarding intensity distribution, only 20% of the objects are the brightest regions globally, whereas 80% exhibit grayscale values similar to or lower than the background. The grayscale difference between objects and backgrounds is often below 10 in haze and desert scenarios.

The NUDT-SIRST dataset, developed by researchers from the National University of Defense Technology, is a publicly available synthesized single-frame infrared small target dataset. Unlike traditional datasets built from object motion sequences, NUDT-SIRST was generated using simulation techniques that ensure highly accurate pixel-level annotations while offering extensive control over target characteristics and background complexity. The dataset comprises 1327 infrared images of 256 × 256 resolution, encompassing a diverse range of target categories and sizes embedded in richly cluttered and noisy backgrounds. Similar to real infrared scenarios, the images contain significant noise and low SNR, with the small targets often occupying less than 0.02% of the total image area—that is, in a standard 256 × 256 image, the target pixels typically cover an area smaller than 4 × 4 pixels. Furthermore, only about 35% of the objects appear as the brightest regions in the image, while the remaining 65% exhibit grayscale values similar to or even lower than their surroundings.

This characteristic renders simple saliency-based approaches ineffective and necessitates the use of both global and local contextual information for reliable detection.

The experimental hardware configuration is shown in Table 2.

The method proposed in the paper was implemented using PyTorch 1.13.1. The training parameters were set as follows: the image size was uniformly adjusted to

640 \times 640

. The YOLO series network selects the SGD optimizer with high computational efficiency and fast convergence speed for network training. The RTDETR series uses the AdamW optimizer. The momentum was set to 0.937. The initial learning rate was 0.01. The decay factor was 0.0001, and the training epoch was 500. A learning rate preheating strategy is used to improve the stability and performance of model training and reduce the risk of oscillations and gradient explosion at the beginning of training. The warmup epoch was set to 3, the warmup momentum was set to 0.8, and the learning rate of the warmup bias parameter was set to 0.1.

The boundary box-based object detection method TOOD uses the SGD optimizer with an initial learning rate of 0.0001, final learning rate of 0.0001, decay factor of 0.0001, and momentum of 0.9. Sparse R-CNN uses the AdamW optimizer with an initial learning rate of 0.000025, a final learning rate of 0.0001, and a decay factor of 0.0001. Mask R-CNN uses the SGD optimizer with an initial learning rate of 0.0002, final learning rate of 0.0001, decay factor of 0.0001, and momentum of 0.9. DINO uses the AdamW optimizer with an initial learning rate of 0.0001, final learning rate of 0.0001, and decay factor of 0.0001. All four methods use the MultiStepLR learning rate scheduler.

The segmentation-based detection methods ACM, ALCNet, DNANet, ISTDU-Net, RDIAN, and OSCAR use the default parameters from the original code. ACM, ALCNet, ISTDU-Net, and RDIAN use the Adam optimizer with a learning rate of 0.0005, and the learning rate scheduler is MultiStepLR for all of them. DNANet uses the Adagrad optimizer with a learning rate of 0.05 and a learning rate scheduler of CosineAnnealingLR for training. OSCAR uses the AdamW optimizer with a learning rate of 0.002 and a decay factor of 0.05.

4.1.2. Evaluation Metrics

The SIRSTv2, IRSTD-1k, and NUDT-SIRST datasets contain only one infrared small target category type. Therefore, we used AP50 as the indicator for evaluating the accuracy of the network model, the number of parameters as an indicator for measuring the complexity of the model, and F1 as an indicator for balancing recall (R) and precision (P) to evaluate the overall performance of the model. In addition, precision–recall (PR) curves were used to compare the performance of different detection methods. AP indicates the average precision, and AP50 indicates the average precision when the IOU of the detection model is set to 0.5. The AP value can be obtained from the area under the PR curve. The model calculation unit is GFLOPs, and the unit of parameters is M. In general, the smaller the model calculation and parameters, the higher the FPS of the detection network, but this is not absolute. The F1-score is an important performance evaluation tool, especially in military early warning, where equal emphasis is placed on P and R.

We also used the target-level evaluation indicators, probability of detection (Pd), and false alarm rate (Fa) to comprehensively evaluate the detector’s performance from multiple aspects. Formally, the probability of detection is obtained using Equation (47):

P_{d} = \frac{L_{t p}}{L_{a l l}}

(47)

where

L_{t p}

is the number of correctly predicted targets, and

L_{a l l}

is the number of all targets.

The probability of a false alarm rate is obtained using Equation (48):

F_{a} = \frac{P_{f}}{P_{a l l}}

(48)

where

P_{f}

is the number of targets for the error prediction, and

P_{a l l}

is the target quantity of all predictions.

4.2. Ablation Study

IR-ADMDet contains four key modules: the backbone HFEBlock module, the Neck feature fusion BGFS module, the Neck lightweight fusion feature extraction RGEF module, and the Neck fusion feature extraction IFC module. Ablation studies of the modules were conducted on the SIRSTv2 dataset to verify the effectiveness of the proposed method. The effectiveness of the combined modules of BGFS and RGEF, as well as the HFEBlock and IFC modules, was analyzed. YOLOv8s was used as the baseline model. The experimental results are shown in Table 3. The method with the best performance is highlighted in bold.

Group 1: YOLOv8s (Baseline).
Group 2: YOLOv8s+P2.
Group 3: YOLOv8s+P2+HFEBlock.
Group 4: YOLOv8s+P2+HFEBlock+BGFS+RGEF.
Group 5: YOLOv8s+P2+HFEBlock+BGFS+RGEF+IFC (Ours).

Table 3. Test results for different modules. ✓ and × indicate the presence or absence of the corresponding module.

Group	P2	HFEBlock	BGFS+RGEF	IFC	F1	AP50	Parameters/M
1 (Baseline)	×	×	×	×	0.851	0.881	11.125971
2	✓	×	×	×	0.852	0.895	6.947563
3	✓	✓	×	×	0.894	0.937	6.273452
4	✓	✓	✓	×	0.912	0.936	5.462284
5 (Ours)	✓	✓	✓	✓	0.95	0.96	5.767821

The ablation study reveals progressive performance improvements with each added module. The baseline YOLOv8s achieves 0.851 F1 and 0.881 AP50 with 11.13M parameters. The addition of the P2 layer, combined with channel pruning, reduced parameters by 37.6% to 6.95 M while increasing AP50 to 0.895, demonstrating effective preservation of shallow features.

Incorporating HFEBlock brings the most significant boost, improving F1 to 0.894 and AP50 to 0.937 with 6.27M parameters. This confirms the effectiveness of anisotropic convolutions for weak target enhancement. The dual-path design successfully combines local detail extraction with global context modeling.

The BGFS and RGEF modules, together, achieve further optimization. They maintain high accuracy at 0.912 F1 and 0.936 AP50 while reducing parameters to 5.46 M. The dynamic gating mechanism enables efficient multi-scale fusion without substantial performance loss.

The complete model with the IFC module reaches a peak performance of 0.95 F1 and 0.96 AP50 at 5.77 M parameters. The attention-based feature recalibration proves particularly effective for handling complex multi-target scenarios. The progressive architecture demonstrates consistent improvements across all evaluation metrics.

These results validate the complementary nature of the proposed modules. HFEBlock provides fundamental feature enhancement, and the BGFS and RGEF modules enable efficient multi-scale processing. The IFC module offers final refinement for challenging cases. Together, they form a robust solution for infrared small target detection that balances accuracy and efficiency.

To provide a comprehensive analysis of the parameter k in the IFC module, we conducted experiments on the SIRSTv2 dataset to examine how different receptive field sizes affect detection performance. Table 4 presents the experimental results with varying k values.

As shown in Table 4, the value of parameter k significantly impacts detection performance. When

k = 5

, the model achieves the highest F1-score (0.950) and AP50 (0.960), indicating this is the optimal parameter choice for our architecture. The model with

k = 7

also performs well, with an F1-score of 0.943 and an AP50 of 0.957, close to the optimal values.

We observe that when k is too small (

k = 3

), although the precision is high (0.959), the recall is relatively low (0.886). This suggests that a small receptive field is insufficient to capture adequate contextual information for detecting infrared small targets. Conversely, when k exceeds 7 (

k \geq 9

), performance begins to decline, particularly at

k = 13

, where AP50 drops to 0.941. This performance degradation likely occurs because excessively large receptive fields introduce irrelevant background interference.

These experimental results demonstrate that the receptive field size of the IFC module significantly influences detection performance. The value

k = 5

provides the best balance between precision and recall in our experimental setup, validating that an appropriately sized receptive field better captures the features of infrared small targets while avoiding excessive background interference. Therefore, we selected

k = 5

as the default parameter in our proposed method.

4.3. Comparative Experiment

Compared with model-based methods (such as WSLCM, NRAM, and RIPT), deep learning methods show significant advantages in robustness and accuracy. This conclusion has been fully verified in [21]. Therefore, this study no longer used model-based methods as a comparison object but instead selected several typical, publicly available advanced deep learning methods for comparison, including the YOLOv6s to YOLOv11s series, TOOD, Sparsr R-CNN, Mask R-CNN, DINO, and RT-DETR methods.

4.3.1. Quantitative Analysis

The quantitative analysis of the experimental results is shown in Table 5. Table 5 highlights the best method in bold, and the second-best method is marked with bold italics.

The experimental results show that IR-ADMDet has better detection capabilities than other methods on the SIRSTv2 and IRSD-1k datasets. On the SIRSTv2 dataset, IR-ADMDet achieved F1 and AP50 metrics of 95% and 96%, respectively. On the IRSTD-1k dataset, IR-ADMDet achieved F1 and AP50 metrics of 82.7% and 85.2%, respectively. On the NUDT-SIRST dataset, the F1-score of IR-ADMDet is 2.4% lower than that of RT-DERT, and the AP50 score of IR-ADMDet is 0.2% lower than that of RT-DERT. However, the number of parameters of IR-ADMDet is reduced by 71.4% compared with RT-DERT, making IR-ADMDet more lightweight. In addition, the proposed method performs well on limited sample learning tasks.

In the larger NUDT-SIRST dataset, IR-ADMDet performs less well than RT-DETR and DINO. The DINO variant within the DETR family has demonstrated impressive performance on the NUDT-SIRST dataset, with its F1-score trailing closely behind RT-DETR by just 0.1%. Notably, DINO achieves an AP50 that is equivalent to that of IR-ADMDet. However, a significant drawback of DINO lies in its substantial parameter count, which is an exorbitant 8.24 times larger than that of IR-ADMDet. This result is due to the following reasons: infrared small targets have the characteristics of low SNR, indistinct target–background contrast, and weak infrared radiation. However, RT-DETR and DINO represent a Transformer-based object detection model that can efficiently extract global feature information from objects. In contrast, the IR-ADMDet detector, which relies on local and global contextual information, is slightly less effective when faced with these infrared small targets. In addition, the targets in the NUDT-SIRST dataset are larger than those in the SIRSTv2 and IRSD-1k datasets, and the AP50 value of all detection methods on this dataset is higher than 94%.

Figure 9 and Figure 10 show the PR and ROC curves of the three methods on the three datasets. The following conclusions can be drawn from the curve charts: (1) IR-ADMDet obtains the best results relatively quickly on all three datasets. (2) Lightweight IR-ADMDet achieves the best balance between model parameters, precision, and recall on all three datasets. (3) On all three datasets, lightweight IR-ADMDet can correctly detect more infrared small targets with less computational power.

4.3.2. Qualitative Analysis

In this section, we analyze detection results across various challenging scenarios, organized according to three fundamental detection challenges: target characteristic challenges, background complexity challenges, and SNR challenges.

In response to the challenges posed by target characteristics, we selected the Dim target scene, the Blurring scene, and the Multiple targets scene for analysis. Figure 11 illustrates the detection outcomes in scenarios featuring targets with extremely low illumination.

Our IR-ADMDet demonstrates remarkable effectiveness when detecting poorly illuminated small targets. This advantage stems primarily from the incorporated dual-path feature enhancement mechanism, which effectively combines local residual learning with broader contextual modeling. This approach not only preserves delicate edge details but also maintains comprehensive feature representation. The system generates distinct activation signals within target regions even under challenging low signal-to-noise ratio (SNR) conditions. Despite reaching a relatively modest confidence level of 0.32, this value provides sufficient discrimination capability under low-light circumstances. Additionally, IR-ADMDet employs dynamic gating to attenuate background interference, substantially improving target–background contrast and ensuring reliable preservation of weak target signals.

By comparison, RT-DETR incorrectly identified non-target elements with 0.36 confidence. This error can be attributed to its Transformer architecture’s vulnerability to low-frequency background noise in low SNR environments, where self-attention mechanisms inadvertently generate elevated responses in non-target areas.

DINO successfully located the target with higher confidence (0.827), but its pre-trained model exhibits limited adaptability to thermal imagery characteristics. This mismatch leads to overfitting during feature alignment, producing excessive activation (0.721) in non-target regions and resulting in false positives.

Sparse R-CNN exhibited diminished sensitivity to faint targets, with its query initialization strategy generating false detections (0.511 confidence) in background regions. Mask R-CNN incorrectly identified two non-targets (with 0.949 and 0.579 confidence), revealing inadequate capability for fine-grained feature discrimination against complex backgrounds.

Although YOLOv7 detected the target with 0.31 confidence, it simultaneously produced false positives (0.38 confidence), indicating substantial information loss during downsampling. The remaining YOLO variants and TOOD completely failed to detect the Dim target, demonstrating insufficient feature extraction capabilities under extremely challenging low-light and low-contrast conditions.

Detection outcomes for scenarios with motion and defocus blur are presented in Figure 12.

The examined methods exhibit notable performance differences when processing blurred imagery. IR-ADMDet successfully identifies the target with 0.89 confidence, showcasing the effectiveness of its blur-resistant feature enhancement approach. By integrating localized details with broader contextual information, this method enhances edge definition while minimizing interference from blurred backgrounds, enabling accurate identification despite combined motion and defocus blur effects.

In this scenario, RT-DETR erroneously identifies non-targets with 0.7 confidence, exposing a weakness in its global attention mechanism when handling blurred boundaries. The architecture proves susceptible to background interference when processing indistinct edges. DINO maintains strong performance with 0.889 confidence, benefiting from effective multi-scale feature processing. Similarly, Sparse R-CNN delivers capable results with 0.863 confidence.

Mask R-CNN displays remarkably high confidence of 0.997 in correct identifications, though such extreme values might obscure potential vulnerabilities in more challenging situations. TOOD achieves target detection with a relatively modest 0.411 confidence, suggesting limitations in its feature alignment capability under blurry conditions.

The YOLO family demonstrates varying performance levels, with confidence scores ranging from 0.71 to 0.88. Specifically, YOLOv7 and YOLOv10s reach 0.88 confidence, YOLOv9s achieves 0.82, while YOLOv6s and YOLOv11s register 0.79 and 0.78, respectively. YOLOv8s records the lowest score at 0.71. These variations reflect different trade-offs between lightweight architecture and detail preservation, though all variants demonstrate some capability in detecting blurred targets.

Figure 13 presents detection outcomes for environments containing multiple targets of varying sizes.

The analysis of multi-target detection reveals considerable variations in how different algorithms handle objects of varying dimensions. IR-ADMDet successfully located all three targets, achieving confidence scores of 0.82, 0.51, and 0.46, respectively. This comprehensive detection capability demonstrates how the model’s dual-path enhancement and adaptive fusion effectively capture both prominent and subtle targets, maintaining detection sensitivity even for smaller objects with weaker signal characteristics.

Conversely, RT-DETR only identified the largest target (0.72 confidence), overlooking smaller objects. This limitation suggests its Transformer-based global attention may prioritize more prominent features while disregarding less salient targets due to insufficient spatial resolution for weaker signals. DINO exhibited strong multi-scale capabilities, successfully detecting all three targets with confidence values of 0.783, 0.673, and 0.656, confirming its effective balance in processing differently sized objects.

Sparse R-CNN also detected all targets but displayed pronounced scale sensitivity—registering 0.925 confidence for the largest target versus substantially lower values of 0.385 and 0.456 for smaller objects. This pattern indicates its sparse query mechanism favors larger, more prominent features. Both Mask R-CNN (0.962 confidence) and TOOD (0.646) exclusively detected the largest target, highlighting limitations in their feature alignment and task-decoupling approaches when handling smaller objects in multi-target environments.

Among YOLO variants, all versions except YOLOv9s identified only the largest target (confidence range: 0.74–0.83), attributable to feature degradation during downsampling in these lightweight architectures. YOLOv9s showed moderate improvement by detecting two targets (0.83 and 0.32 confidence) but still missed the smallest object, indicating persistent challenges in preserving small object information despite architectural refinements.

In response to the background complexity challenges, we selected the Building scene, Cloud scene, and Bright Clutters scene for analysis. The detection results in environments featuring architectural structures are presented in Figure 14.

When examining performance against architectural backgrounds, we observed universal target detection across all methods, indicating that contemporary detectors can generally extract sufficient target information in high-contrast environments with well-defined boundaries. However, TOOD generated two false detections near structural elements, revealing vulnerabilities when processing complex architectural textures and geometric details.

Compared to scenes dominated by cloud formations (characterized by low contrast and indistinct edges), building environments present more pronounced boundaries between targets and backgrounds, enabling most methods to achieve accurate detection by leveraging these clear delineations. Nevertheless, intricate architectural edges and localized texture variations can introduce interference in feature alignment and region segmentation processes.

Specifically, the task-decoupled design of TOOD appears susceptible to misclassifying certain architectural features or decorative elements as potential targets. In contrast, methods such as IR-ADMDet, DINO, and Sparse R-CNN successfully avoid such misidentifications through more robust feature integration and effective background suppression mechanisms.

The detection challenges presented by building scenes differ fundamentally from those encountered in cloud environments. While cloud backgrounds typically feature attenuated target signals due to illumination deficiencies and edge indistinctness (making signal enhancement and noise reduction primary concerns), building environments present difficulties in accurately distinguishing targets from backgrounds within geometrically and texturally complex settings. These inherent environmental differences produce distinct performance patterns across the evaluation methods.

Figure 15 illustrates detection performance in environments characterized by cloud formations.

Analyzing detection against cloud backgrounds reveals varying capabilities in handling low-contrast environments. IR-ADMDet maintains stable detection performance with 0.8 confidence, demonstrating how its dual-path enhancement approach effectively suppresses background interference while preserving target features, even when contrast between targets and cloud backgrounds is minimal.

RT-DETR identifies the target with moderate 0.67 confidence, indicating certain limitations in its Transformer architecture’s ability to filter background noise when processing low-contrast cloud textures. DINO detects the target with higher 0.896 confidence but simultaneously generates false detections (0.745 confidence), suggesting that despite strong multi-scale capabilities, complex cloud patterns may lead to overfitting and erroneous background responses.

Sparse R-CNN maintains solid performance with 0.888 confidence, with its query-based approach effectively balancing target enhancement and background suppression. Mask R-CNN correctly identifies the target with high 0.984 confidence but produces two false detections (0.925 and 0.326 confidence), indicating potential overconfidence in boundary determination within cloud environments.

TOOD achieves only 0.427 confidence, reflecting adaptation difficulties in its feature alignment mechanisms when handling complex backgrounds. The YOLO family demonstrates confidence levels between 0.67 and 0.77. While all variants successfully detect the target, their relatively modest confidence scores may result from information degradation in complex, low-contrast scenarios, attributable to their efficient but information-compressing architectures.

The detection results in environments containing high-intensity background elements are presented in Figure 16.

All evaluated methods successfully detected the single target without false positives or missed detections in bright clutter environments, though their varying confidence levels reveal distinct capabilities in target discrimination and interference management under high-intensity conditions. IR-ADMDet achieves effective detection with 0.86 confidence, illustrating how its dynamic gating effectively filters localized high-intensity interference while preserving essential target characteristics, enabling reliable performance against uniformly bright backgrounds.

RT-DETR maintains adequate detection performance with 0.76 confidence, confirming its global attention mechanism can sufficiently distinguish target features in such challenging environments. DINO demonstrates particularly robust performance with 0.926 confidence, highlighting the effectiveness of its multi-scale attention approach in separating targets from bright background clutter. Sparse R-CNN shows strong results with 0.822 confidence, with its query-based design maintaining accurate localization despite intense brightness variations.

The exceptionally high 0.999 confidence of Mask R-CNN reflects the precise boundary modeling of its segmentation component, though such extreme values may mask potential vulnerabilities when handling noisy or ambiguous boundaries. Conversely, the substantially lower 0.337 confidence of TOOD indicates significant challenges in feature alignment under bright clutter conditions.

The YOLO variants (v6s–v11s) register confidence scores of 0.77, 0.8, 0.74, 0.88, 0.91, and 0.78, respectively, revealing architectural differences in their lightweight designs. YOLOv10s (0.91) and YOLOv9s (0.88) demonstrate superior feature preservation after network processing, while other versions appear more susceptible to high-intensity interference due to differences in downsampling strategies and parameter optimization approaches.

In response to the SNR challenges, we selected the High-contrast Boundary scene and Noise scene for analysis. Figure 17 presents detection outcomes in environments featuring sharp transitions between regions of significantly different intensities.

The High-contrast Boundary scene reveals substantial differences in how detection models handle pronounced edge features and suppress interference. IR-ADMDet achieves 0.74 confidence, showing reasonable capability to differentiate targets from backgrounds with sharp transitions. Its feature enhancement approach extracts critical edge information, though the moderate confidence suggests that abrupt boundary changes may still affect consistent feature extraction. By comparison, DINO and Sparse R-CNN demonstrate superior performance with confidence scores of 0.89 and 0.869, respectively. Their architectural designs, incorporating multi-scale attention and sparse query mechanisms, enable effective separation of target features from distracting high-contrast elements, successfully highlighting both overall shape and finer details despite challenging boundary conditions.

RT-DETR achieves only 0.46 confidence, indicating difficulties with significant edge interference, potentially due to its reliance on global attention that may inadequately focus on critical localized features around target boundaries. Similarly, the low 0.409 confidence of TOOD suggests challenges in reconciling its decoupled classification and regression tasks when boundary signals are intense and potentially misleading. The YOLO family shows diverse performance: variants including YOLOv6s, YOLOv7, YOLOv8s, YOLOv9s, and YOLOv10s register confidence between 0.3 and 0.58, suggesting their efficient designs and aggressive downsampling may sacrifice high-frequency edge details that are crucial in high-contrast environments. Notably, YOLOv11s fails entirely to detect the target, potentially due to architectural modifications that excessively compromise fine boundary information.

Mask R-CNN, with 0.821 confidence, benefits from its segmentation component’s ability to model target boundaries accurately. However, its confidence remains below that of DINO or Sparse R-CNN, indicating that even precise segmentation approaches face challenges from strong edge contrast that may introduce ambiguity in target–background differentiation.

Detection performance in environments characterized by significant signal noise is presented in Figure 18.

The Noise scene, containing three targets, reveals marked differences in how detection methods handle background interference. IR-ADMDet successfully identifies all three targets with confidence scores of 0.72, 0.54, and 0.49, respectively, highlighting its effective noise-suppression and feature-preservation capabilities that maintain comprehensive detection even in challenging conditions. YOLOv7 also detects all targets but with notably lower confidence values of 0.57, 0.33, and 0.27, indicating reduced feature discrimination under noisy circumstances.

RT-DETR and DINO each detect only two smaller targets. RT-DETR achieves 0.53 and 0.31 confidence, while DINO demonstrates higher performance with 0.875 and 0.871 confidence. These results reveal the enhanced sensitivity of DINO for smaller targets in noisy environments, though both methods fail to detect either the largest target or another object, suggesting limitations in their global feature integration under noise interference.

Several methods—including Sparse R-CNN, TOOD, and YOLO variants from v6s to v11s—detect only a single target, with confidence values ranging from 0.327 to 0.899. This collective limitation demonstrates their difficulties in suppressing background noise while capturing multiple targets. Mask R-CNN fails completely to detect any targets, further confirming inadequate robustness against noise interference.

The comparative analysis highlights the superior feature preservation of IR-ADMDet and YOLOv7, maintaining detection of all targets despite reduced confidence levels. RT-DETR and DINO show selective sensitivity to specific targets, while most other methods experience significant performance degradation when attempting to identify multiple targets under noisy conditions. These findings emphasize the critical importance of effective noise suppression and feature enhancement for reliable multi-target detection in challenging environments.

4.3.3. Comparison with Segmentation-Based Methods

In this section, we present a comprehensive comparison between the bounding box-based detection method and segmentation-based approaches for infrared small target detection. We first offer a quantitative evaluation of performance metrics across multiple challenging datasets, and then provide qualitative visualizations that further illustrate detection outcomes under diverse conditions. The following results include the performance values in Table 6 and the detection visualizations in Figure 19.

As shown in Table 6, IR-ADMDet demonstrates significant advantages on the SIRSTv2 dataset. It achieves superior performance metrics with a precision of 0.948, a recall of 0.952, and an F1-score of 0.950—all representing the highest values among all compared methods. Specifically, the precision of IR-ADMDet surpasses the second-best segmentation method, RDIAN (0.899), by approximately 5.4%, while its recall exceeds the second-best performance (0.863) of DNANet by about 10.3%. Most notably, its comprehensive F1-score outperforms DNANet (0.869) by roughly 9.3%.

These results indicate that IR-ADMDet not only identifies targets more accurately (high precision) but also detects them more comprehensively (high recall), with overall performance substantially exceeding traditional segmentation-based approaches. While some segmentation methods like ALCNet maintain acceptable precision (0.838), their significantly lower recall (0.665) reveals inherent difficulties in balancing these two critical metrics.

On the IRSTD-1k dataset, IR-ADMDet, again, demonstrates superior comprehensive performance, achieving leading metrics with 0.831 precision, 0.823 recall, and 0.827 F1-score. While RDIAN (0.828 precision) and DNANet (0.820 precision) approach the precision performance of IR-ADMDet, and ALCNet matches its recall (0.820), IR-ADMDet maintains a clear 7.4% F1-score advantage over the second-best performer, DNANet (0.770). These results confirm the robust performance balance of IR-ADMDet when handling the dataset’s more complex backgrounds and noise interference.

The comparative analysis reveals limitations in segmentation methods: DNANet shows significantly lower recall (0.726), while the relatively balanced precision (0.769) and recall (0.760) of OSCAR still yield a substantially inferior F1-score (0.764) compared to IR-ADMDet. This performance gap highlights the advanced capability of IR-ADMDet in maintaining metric equilibrium under challenging detection conditions.

On the NUDT-SIRST dataset, while the competition is more intense, IR-ADMDet still delivers outstanding performance. It achieves the highest precision among all methods at 0.963, outperforming DNANet (0.954) by approximately 0.9%. Although its recall of 0.938 slightly trails DNANet (0.959), IR-ADMDet maintains near-SOTA performance with an F1-score of 0.950, closely approaching DNANet (0.956). These results demonstrate the top-tier capabilities of IR-ADMDet even when handling extreme interference and specific challenges characteristic of this dataset.

The comparative analysis reveals an interesting trade-off: while DNANet shows marginally better recall, IR-ADMDet compensates with superior precision, resulting in nearly identical F1-scores that reflect excellent precision–recall balance. Other segmentation methods, including ISTDU-Net (0.944 F1-score) and OSCAR (0.913 F1-score), although performing respectably, fail to match the leading performance of both IR-ADMDet and DNANet. This pattern confirms the advanced capability of IR-ADMDet to maintain optimal equilibrium between detection accuracy and completeness across varying dataset challenges.

The comprehensive quantitative results across all three datasets demonstrate that IR-ADMDet, as a bounding box-based detection approach, consistently outperforms or matches SOTA segmentation-based methods in precision, recall, and F1-score metrics. This robust performance provides strong evidence for the effectiveness and reliability of the IR-ADMDet framework.

Compared with pixel-level segmentation methods, the bounding box-based detection paradigm appears to offer distinct advantages in several critical aspects: (1) better utilization of global structural information about targets, (2) more effective suppression of background noise, and (3) superior balance between precision and recall. These inherent strengths collectively contribute to the exceptional overall performance of IR-ADMDet in infrared small target detection tasks.

The findings suggest that the box-based detection paradigm, as implemented in IR-ADMDet, represents a more suitable approach for this challenging computer vision task, particularly when dealing with small targets in complex infrared scenarios where traditional segmentation methods often struggle to maintain optimal performance balance.

The visualization results in Figure 19 show that both IR-ADMDet and ISTDU-Net successfully detected all targets across all test scenarios. These methods demonstrate excellent detection robustness.

In Dim target scenes, they effectively extracted faint target signals through feature enhancement mechanisms. For Multiple Targets scenes, they accurately distinguished between different objects. In Noise environments, they successfully suppressed background interference.

By contrast, other methods exhibited clear limitations under varying scenarios.

The ACM method failed to detect any targets in Dim scenes. In Noise environments, it only detected two targets with one false alarm. These results indicate its weak feature extraction capability for faint targets.

The ALCNet missed real targets but generated false detections in Dim scenes. This demonstrates deficiencies in its noise suppression mechanism.

Although DNANet detected targets in Dim scenes, it produced false alarms. It also missed one target in Multiple Targets scenarios. These findings suggest its feature discrimination needs improvement in complex environments.

Both RDIAN and OSCAR missed targets in Dim scenes. They also failed to detect one target in Multiple Targets scenarios. This reveals their limited sensitivity for small and multiple target detection.

It is noteworthy that the outstanding performance of IR-ADMDet and ISTDU-Net validates the effectiveness of feature enhancement strategies.

IR-ADMDet employs a dual-path feature fusion mechanism. This approach combines local detail extraction with global context modeling. It maintains high detection rates while effectively suppressing false alarms.

ISTDU-Net likely enhances target saliency through multi-scale feature fusion or attention mechanisms.

These successful cases provide important references for infrared small target detection. They demonstrate that design principles balancing local feature enhancement and global context understanding can significantly improve detection performance.

The limitations of other methods primarily originate from three key constraints: insufficient response to weak targets in feature extraction layers, excessively aggressive noise suppression strategies that inadvertently eliminate target signals, and inadequate feature discrimination mechanisms for multiple-target scenarios. These visualization results exhibit strong consistency with quantitative performance metrics, collectively demonstrating that effective infrared small target detection in complex environments fundamentally depends on achieving an optimal balance between feature enhancement and noise suppression while simultaneously establishing robust multi-scale feature representation mechanisms.

5. Discussion

IR-ADMDet addresses fundamental challenges in infrared small target detection through two key innovations. First, the integration of anisotropic feature extraction with dynamic-aware processing enables adaptation to varying target characteristics across different scenes, significantly improving performance in complex backgrounds. Second, the dual-path architecture of DPHFENet effectively balances local and global information processing, addressing limitations in previous works where methods either lost sensitivity to small targets or generated excessive false positives.

Despite these advances, our approach shows limitations in extreme noise conditions, where enhancement mechanisms may amplify noise artifacts. Future research should explore incorporating temporal information across sequential frames to distinguish genuine targets from noise.

The lightweight nature of IR-ADMDet (5.77M parameters) offers significant practical advantages for deployment in resource-constrained environments without sacrificing performance. This efficiency–accuracy balance makes it particularly valuable for real-time applications in defense, humanitarian efforts, like disaster response, and environmental monitoring, including wildlife conservation and early forest fire detection.

6. Conclusions

IR-ADMDet presents an anisotropic dynamic-aware multi-scale network for infrared small target detection that effectively addresses challenges of low SNR, background clutter, and scale variation. Through our novel and Hierarchical Adaptive Fusion Framework, we achieve state-of-the-art results (0.96 AP50 and 0.95 F1-score on SIRSTv2) with only 5.77M parameters.

The model demonstrates exceptional robustness in challenging low-contrast and cluttered environments while maintaining computational efficiency. This makes it suitable for applications beyond defense, including search-and-rescue operations, wildlife monitoring, and industrial surveillance. While showing limitations in extremely noisy conditions, the framework establishes a foundation for future spectral–temporal enhancements, representing a significant advancement in infrared surveillance technology with broad cross-domain applications.

Author Contributions

Conceptualization, N.L. and D.W.; Methodology, N.L. and D.W.; Software, N.L.; Validation, N.L.; Formal analysis, D.W.; Investigation, D.W.; Data curation, D.W.; Supervision, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author due to privacy restrictions and the large volume of the datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hao, X.; Luo, S.; Chen, M.; He, C.; Wang, T.; Wu, H. Infrared small target detection with super-resolution and YOLO. Opt. Laser Technol. 2024, 177, 111221. [Google Scholar] [CrossRef]
Tong, Y.; Leng, Y.; Yang, H.; Wang, Z. Target-Focused Enhancement Network for Distant Infrared Dim and Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4109711. [Google Scholar] [CrossRef]
Ma, T.; Guo, G.; Li, Z.; Yang, Z. Infrared Small Target Detection Method Based on High-Low-Frequency Semantic Reconstruction. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6012505. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
Chi, W.; Liu, J.; Wang, X.; Ni, Y.; Feng, R. A Semantic Domain Adaption Framework for Cross-Domain Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Lin, F.; Bao, K.; Li, Y.; Zeng, D.; Ge, S. Learning Contrast-Enhanced Shape-Biased Representations for Infrared Small Target Detection. IEEE Trans. Image Process. 2024, 33, 3047–3058. [Google Scholar] [CrossRef] [PubMed]
Sun, Z.; Leng, X.; Zhang, X.; Zhou, Z.; Xiong, B.; Ji, K.; Kuang, G. Arbitrary-Direction SAR Ship Detection Method for Multiscale Imbalance. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5208921. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, S.; Sun, Z.; Liu, C.; Sun, Y.; Ji, K.; Kuang, G. Cross-Sensor SAR Image Target Detection Based on Dynamic Feature Discrimination and Center-Aware Calibration. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5209417. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z.; Zhou, J. A Synergistical Attention Model for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400916. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Li, L.; Xu, N.; Liu, F.; Yuan, C.; Chen, Z.; Lyu, X. AAFormer: Attention-Attended Transformer for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5002805. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic Segmentation of Remote Sensing Images by Interactive Representation Refinement and Geometric Prior-Guided Inference. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5400318. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Tao, F.; Tong, Y.; Gao, H.; Liu, F.; Chen, Z.; Lyu, X. A Cross-Domain Coupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5005105. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. A Frequency Decoupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5607921. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Peng, Z.; Zhao, Z.; Chen, Y.; Han, J.; Huang, F.; Yu, Y.; Fu, Q. Infrared small target segmentation networks: A survey. Pattern Recognit. 2023, 143, 109788. [Google Scholar] [CrossRef]
Yao, R.; Li, W.; Zhou, Y.; Sun, J.; Yin, Z.; Zhao, J. Dual-Stream Edge-Target Learning Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5007314. [Google Scholar] [CrossRef]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Qiu, W.; Wang, K.; Li, S.; Zhang, K. YOLO-based Detection Technology for Aerial Infrared Targets. In Proceedings of the 2019 IEEE 9th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Suzhou, China, 29 July–2 August 2019; pp. 1115–1119. [Google Scholar] [CrossRef]
Gongguo, Z.; Junhao, W. An improved small target detection method based on Yolo V3. In Proceedings of the 2021 International Conference on Electronics, Circuits and Information Engineering (ECIE), Zhengzhou, China, 22–24 January 2021; pp. 220–223. [Google Scholar] [CrossRef]
Lin, Z.; Huang, M.; Zhou, Q. Infrared small target detection based on YOLO v4. J. Phys. Conf. Ser. 2023, 2450, 012019. [Google Scholar] [CrossRef]
Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
Ciocarlan, A.; Le Hegarat-Mascle, S.; Lefebvre, S.; Woiselle, A.; Barbanson, C. A Contrario Paradigm for Yolo-Based Infrared Small Target Detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5630–5634. [Google Scholar] [CrossRef]
Cao, L.; Wang, Q.; Luo, Y.; Hou, Y.; Cao, J.; Zheng, W. YOLO-TSL: A lightweight target detection algorithm for UAV infrared images based on Triplet attention and Slim-neck. Infrared Phys. Technol. 2024, 141, 105487. [Google Scholar] [CrossRef]
Betti, A.; Tucci, M. YOLO-S: A Lightweight and Accurate YOLO-like Network for Small Target Detection in Aerial Imagery. Sensors 2023, 23, 1865. [Google Scholar] [CrossRef]
Hou, Y.; Tang, B.; Ma, Z.; Wang, J.; Liang, B.; Zhang, Y. YOLO-B: An infrared target detection algorithm based on bi-fusion and efficient decoupled. PLoS ONE 2024, 19, e0298677. [Google Scholar] [CrossRef]
Jawaharlalnehru, A.; Sambandham, T.; Sekar, V.; Ravikumar, D.; Loganathan, V.; Kannadasan, R.; Khan, A.A.; Wechtaisong, C.; Haq, M.A.; Alhussen, A.; et al. Target Object Detection from Unmanned Aerial Vehicle (UAV) Images Based on Improved YOLO Algorithm. Electronics 2022, 11, 2343. [Google Scholar] [CrossRef]
Liu, Y.; Li, N.; Cao, L.; Zhang, Y.; Ni, X.; Han, X.; Dai, D. Research on Infrared Dim Target Detection Based on Improved YOLOv8. Remote Sens. 2024, 16, 2878. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 1–6 January 2021; pp. 950–959. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared Small-Target Detection U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7506205. [Google Scholar] [CrossRef]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-Field and Direction Induced Attention Network for Infrared Dim Small Target Detection with a Large-Scale Dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000513. [Google Scholar] [CrossRef]
Dai, Y.; Li, X.; Zhou, F.; Qian, Y.; Chen, Y.; Yang, J. One-Stage Cascade Refinement Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000917. [Google Scholar] [CrossRef]
Tong, X.; Su, S.; Wu, P.; Guo, R.; Wei, J.; Zuo, Z.; Sun, B. MSAFFNet: A Multiscale Label-Supervised Attention Feature Fusion Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5002616. [Google Scholar] [CrossRef]
Shi, Q.; Zhang, C.; Chen, Z.; Lu, F.; Ge, L.; Wei, S. An infrared small target detection method using coordinate attention and feature fusion. Infrared Phys. Technol. 2023, 131, 104614. [Google Scholar] [CrossRef]
Xu, H.; Zhong, S.; Zhang, T.; Zou, X. Multiscale Multilevel Residual Feature Fusion for Real-Time Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5002116. [Google Scholar] [CrossRef]
Fang, H.; Ding, L.; Wang, L.; Chang, Y.; Yan, L.; Han, J. Infrared Small UAV Target Detection Based on Depthwise Separable Residual Dense Network and Multiscale Feature Fusion. IEEE Trans. Instrum. Meas. 2022, 71, 5019120. [Google Scholar] [CrossRef]
Zhang, P.; Wang, Z.; Bao, G.; Hu, J.; Shi, T.; Sun, G.; Gong, J. Multiscale Progressive Fusion Filter Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5602314. [Google Scholar] [CrossRef]
Zhang, M.; Li, B.; Wang, T.; Bai, H.; Yue, K.; Li, Y. CHFNet: Curvature Half-Level Fusion Network for Single-Frame Infrared Small Target Detection. Remote Sens. 2023, 15, 1573. [Google Scholar] [CrossRef]
Zuo, Z.; Tong, X.; Wei, J.; Su, S.; Wu, P.; Guo, R.; Sun, B. AFFPN: Attention Fusion Feature Pyramid Network for Small Infrared Target Detection. Remote Sens. 2022, 14, 3412. [Google Scholar] [CrossRef]
Liu, S.; Chen, P.; Woźniak, M. Image Enhancement-Based Detection with Small Infrared Targets. Remote Sens. 2022, 14, 3232. [Google Scholar] [CrossRef]
Chen, Y.; Li, L.; Liu, X.; Su, X. A Multi-Task Framework for Infrared Small Target Detection and Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5003109. [Google Scholar] [CrossRef]
Zhao, M.; Cheng, L.; Yang, X.; Feng, P.; Liu, L.; Wu, N. TBC-Net: A real-time detector for infrared small target detection using semantic constraint. arXiv 2019, arXiv:2001.05852. [Google Scholar]
Liu, F.; Gao, C.; Chen, F.; Meng, D.; Zuo, W.; Gao, X. Infrared Small and Dim Target Detection with Transformer Under Complex Backgrounds. IEEE Trans. Image Process. 2023, 32, 5921–5932. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape Matters for Infrared Small Target Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 10–17 October 2021; pp. 3490–3499. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024, Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Curran Associates, Inc.: Newry, UK, 2024; Volume 37, pp. 107984–108011. [Google Scholar]
Mou, X.; Lei, S.; Zhou, X. YOLO-FR: A YOLOv5 Infrared Small Target Detection Algorithm Based on Feature Reassembly Sampling Method. Sensors 2023, 23, 2710. [Google Scholar] [CrossRef]
Yue, T.; Lu, X.; Cai, J.; Chen, Y.; Chu, S. YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO. Opt. Laser Technol. 2025, 187, 112835. [Google Scholar] [CrossRef]
Ren, D.; Li, J.; Han, M.; Shu, M. DNANet: Dense Nested Attention Network for Single Image Dehazing. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2035–2039. [Google Scholar] [CrossRef]

Figure 1. A 3D grid diagram of typical infrared small targets.

Figure 2. Overall structure of IR-ADMDet. DPHFENet serves as the Backbone of IR-ADMDet, and HAFF functions as the Neck.

Figure 3. Structure of DPHFENet. (a) Left: the overall architecture of DPHFENet; middle: structure of the HFEBlock module; right: detailed structures of the DPFE module and its internal LRLP component. (b) Center: structure of the GCP module within DPFE; left and right: architectures of the AGM and SAIM sub-modules within GCP.

Figure 4. The structure of C2f and SPPF in DPHFENet.

Figure 5. Structure diagram of BGFS.

Figure 6. Illustration of the RGEF structural framework. (Left) The generalized RGEF structure with N convolutional modules. (Middle) A specialized RGEF configuration implementing two convolutional modules. (Right) RepConv structural transformation between training and deployment stages.

Figure 7. IFC structure diagram.

Figure 8. IFC receptive fields for different k values.

Figure 9. PR curves of different target detection methods. (a) SIRSTv2 dataset; (b) IRSTD-1k dataset; (c) NUDT-SIRST dataset.

Figure 10. ROC curves of different target detection methods. (a) SIRSTv2 dataset; (b) IRSTD-1k dataset; (c) NUDT-SIRST dataset.

Figure 11. Results of different methods in the Dim target scene.

Figure 12. Results of different methods in the Blurring scene.

Figure 13. Results of different methods in the Multiple Targets scene.

Figure 14. Results of different methods in the Building scene.

Figure 15. Results of different methods in the Cloud scene.

Figure 16. Results of different methods in the Bright Clutters scene.

Figure 17. Results of different methods in the High-contrast Boundary scene.

Figure 18. Results of different methods in the Noise scene.

Figure 19. The detection results of segmentation-based methods and IR-ADMDet across eight different scenarios are presented. Dashed circles indicate missed detections, and solid ellipses mark false detections.

Table 1. Configuration of the backbone. The original input infrared image has a resolution of 640 × 640 pixels.

Module	Number	Output Resolution (Pixels)	Output Channel	s
Conv	1	320 × 320	32	-
Conv	1	160 × 160	32	-
C2f	3	160 × 160	32	-
Conv	1	80 × 80	64	-
C2f	6	80 × 80	64	-
Conv	1	40 × 40	128	-
HFEBlock	6	40 × 40	128	0.25
Conv	1	20 × 20	256	-
HFEBlock	3	20 × 20	256	0.5
SPPF	1	20 × 20	256	-

Table 2. Experimental hardware parameters.

Name	Configuration
Operating system	Win11
Computing platform	CUDA 11.7
CPU	AMD Ryzen 7 5800H
GPU	NVIDIA GeForce RTX 3060
GPU memory size	6 G

Table 4. Impact of different k values in the IFC module on detection performance.

k	P	R	F1	AP50
3	0.959	0.886	0.921	0.94
5	0.948	0.952	0.95	0.96
7	0.96	0.926	0.943	0.957
9	0.946	0.9	0.922	0.953
11	0.941	0.906	0.923	0.943
13	0.945	0.909	0.927	0.941

Table 5. Quantitative analysis results. Highlighted in bold black is the best-performing method, and the second-best is denoted by italic black.

Methods	Dataset
	SIRST-V2				IRSTD-1k				NUDT-SIRST				Parameters/M
	P	R	F1	AP50	P	R	F1	AP50	P	R	F1	AP50	Parameters/M
RT-DETR [46]	0.958	0.911	0.934	0.94	0.824	0.827	0.825	0.83	0.99	0.959	0.974	0.98	20.184
DINO [47]	0.927	0.923	0.924	0.948	0.836	0.816	0.826	0.826	0.983	0.964	0.973	0.978	47.54
Sparse R-CNN [48]	0.897	0.863	0.88	0.888	0.826	0.743	0.782	0.81	0.986	0.91	0.946	0.944	77.8
Mask R-CNN [49]	0.923	0.79	0.851	0.888	0.807	0.561	0.662	0.691	0.811	0.814	0.812	0.877	43.991
TOOD [50]	0.689	0.661	0.675	0.704	0.839	0.745	0.789	0.809	0.952	0.925	0.938	0.958	32.018
YOLOv6s [51]	0.911	0.798	0.851	0.886	0.847	0.718	0.777	0.818	0.926	0.938	0.932	0.963	16.298
YOLOv7 [52]	0.898	0.71	0.793	0.792	0.796	0.704	0.747	0.749	0.945	0.894	0.919	0.941	6.195
YOLOv8s	0.908	0.801	0.851	0.881	0.826	0.743	0.782	0.81	0.948	0.893	0.92	0.962	11.126
YOLOv9s [53]	0.92	0.75	0.826	0.873	0.797	0.773	0.785	0.805	0.927	0.952	0.939	0.965	7.167
YOLOv10s [54]	0.881	0.798	0.837	0.885	0.829	0.711	0.765	0.817	0.927	0.899	0.913	0.965	7.218
YOLOv11s	0.908	0.797	0.857	0.888	0.8	0.728	0.762	0.801	0.849	0.902	0.925	0.964	9.413
YOLO-FR [55]	0.933	0.912	0.922	0.923	0.812	0.811	0.811	0.815	0.954	0.908	0.93	0.933	8.336
YOLO-MST [56]	0.941	0.925	0.933	0.935	0.825	0.819	0.822	0.831	0.971	0.911	0.94	0.947	12.7
IR-ADMDet (Ours)	0.948	0.952	0.95	0.96	0.831	0.823	0.827	0.852	0.963	0.938	0.95	0.978	5.768

Table 6. Our model versus SOTAs: Comparison of P, R, and F1 values on the SIRSTv2, IRSTD-1k, and NUDT-SIRST datasets. The best performers are in bold.

Methods	Dataset
	SIRSTv2			IRSTD-1k			NUDT-SIRST
	P	R	F1	P	R	F1	P	R	F1
ACM [27]	0.721	0.777	0.748	0.679	0.757	0.716	0.706	0.869	0.779
ALCNet [28]	0.838	0.665	0.741	0.700	0.820	0.755	0.809	0.797	0.803
DNANet [57]	0.876	0.863	0.869	0.820	0.726	0.770	0.954	0.959	0.956
ISTDU-Net [30]	0.852	0.796	0.823	0.780	0.770	0.775	0.947	0.941	0.944
RDIAN [31]	0.899	0.720	0.800	0.828	0.670	0.741	0.917	0.882	0.900
OSCAR [32]	0.873	0.742	0.802	0.769	0.760	0.764	0.900	0.927	0.913
IR-ADMDet (Ours)	0.948	0.952	0.950	0.831	0.823	0.827	0.963	0.938	0.950

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, N.; Wei, D. IR-ADMDet: An Anisotropic Dynamic-Aware Multi-Scale Network for Infrared Small Target Detection. Remote Sens. 2025, 17, 1694. https://doi.org/10.3390/rs17101694

AMA Style

Li N, Wei D. IR-ADMDet: An Anisotropic Dynamic-Aware Multi-Scale Network for Infrared Small Target Detection. Remote Sensing. 2025; 17(10):1694. https://doi.org/10.3390/rs17101694

Chicago/Turabian Style

Li, Ning, and Daozhi Wei. 2025. "IR-ADMDet: An Anisotropic Dynamic-Aware Multi-Scale Network for Infrared Small Target Detection" Remote Sensing 17, no. 10: 1694. https://doi.org/10.3390/rs17101694

APA Style

Li, N., & Wei, D. (2025). IR-ADMDet: An Anisotropic Dynamic-Aware Multi-Scale Network for Infrared Small Target Detection. Remote Sensing, 17(10), 1694. https://doi.org/10.3390/rs17101694

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IR-ADMDet: An Anisotropic Dynamic-Aware Multi-Scale Network for Infrared Small Target Detection

Abstract

1. Introduction

2. Relate Work

2.1. Bounding Box-Based Infrared Small Target Detection Methods

2.2. Segmentation-Based Infrared Small Target Detection Methods

2.3. Feature Fusion Network

3. Method

3.1. Overall Architecture

3.2. Dual-Path Hybrid Feature Extractor Network

3.3. Hierarchical Adaptive Fusion Framework

3.3.1. Bidirectional Gated Feature Symbiosis Module

3.3.2. Recurrent Graph-Enhanced Fusion Module

3.3.3. Interlink Fusion Core Module

3.4. Loss Function

4. Experiments and Analysis

4.1. Experimental Setup

4.1.1. Dataset and Training Settings

4.1.2. Evaluation Metrics

4.2. Ablation Study

4.3. Comparative Experiment

4.3.1. Quantitative Analysis

4.3.2. Qualitative Analysis

4.3.3. Comparison with Segmentation-Based Methods

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI