Feature Enhancement Network for Infrared Small Target Detection in Complex Backgrounds Based on Multi-Scale Attention Mechanism

Zhang, Sen; Du, Weilin; Liu, Yuan; Zhou, Ni; Li, Zheng

doi:10.3390/app15094966

Open AccessArticle

Feature Enhancement Network for Infrared Small Target Detection in Complex Backgrounds Based on Multi-Scale Attention Mechanism

by

Sen Zhang

^1,2,

Weilin Du

^1,2,

Yuan Liu

¹,

Ni Zhou

¹ and

Zheng Li

^1,*

¹

Key Laboratory of Infrared System Detection and Imaging Technology, Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4966; https://doi.org/10.3390/app15094966

Submission received: 1 April 2025 / Revised: 28 April 2025 / Accepted: 28 April 2025 / Published: 30 April 2025

Download

Browse Figures

Versions Notes

Abstract

The identification of tiny objects via single-frame infrared is a significant challenge in computer vision, primarily due to large variances in target dimensions, overcrowded backgrounds, suboptimal signal-to-noise ratios, the propensity of bounding box regression to vary with target size, and potential partial occlusion scenarios. Addressing these challenges, we propose a sturdy network for enhancing features in the infrared detection of small targets utilizing multi-scale attention. In particular, the introduction of the Iterative Attentional Feature Fusion (iAFF) module at the detection network’s neck aims to tackle the issue of minor target features being overshadowed in the process of cross-scale feature fusion. Additionally, we present the Occlusion-Aware Attention Module (OAAM), which demonstrates greater tolerance for target localization errors in regions where local features are missing due to partial occlusion. By combining the scale and spatial attention mechanisms of the Dynamic Head, our approach adaptively learns the relative importance of different semantic layers. Furthermore, the integration of Normalized Wasserstein–Gaussian Distance (NWD) aims to tackle the convergence issues associated with the increased sensitivity of bounding box regression in identifying minor infrared targets. For assessing our technique’s efficiency, we present a novel benchmark dataset, IRMT-UAV, noted for its considerable discrepancies in target size, intricate backdrops, and substantial variations in the signal-to-noise ratio. The outcomes of our experiments using the public IRSTD-1k dataset and the internally developed IRMT-UAV dataset reveal that our technique surpasses cutting-edge (SOTA) methods, with mAP50 enhancements of 1.4% and 4.9%, respectively, thus proving our method’s efficiency and sturdiness.

Keywords:

infrared small target detection; feature fusion; bounding box regression; attention mechanism

1. Introduction

The detection of small targets using infrared technology is vital in multiple fields, such as terrestrial surveillance [1], alert systems [2], and accurate direction [3], among others. Compared to traditional object detection tasks, small object detection has distinct characteristics. Small object detection stands out from conventional object detection tasks due to its unique features. Initially, owing to the diminutive dimensions of the objects or extended distances in imaging, the percentage of pixels taken up by the object in the image is minimal, and, in severe instances, it might consist of merely a handful of pixels [4,5]. Additionally, due to the considerable disparity in the pixel space between the target and background areas, smaller infrared targets show reduced signal-to-noise ratios and signal-to-clutter ratios amidst a densely cluttered backdrop [6]. Additionally, when local aliasing occurs between the target and background, local feature loss can occur [7]. These characteristics make small object detection particularly challenging.

Notable limitations persist in identifying faint and diminutive infrared targets against intricate backdrops, which can be encapsulated as follows: Initially, the uneven spread of cluttered backdrops and smaller targets leads to the easy loss of data from smaller targets in downsampling, and shallow elements with more target details are frequently missed. Furthermore, partial concealment of targets leads to complications like the loss of local data and changes in shape, complicating the identification, recognition, and pinpointing of these targets. Thirdly, as the imaging distance changes, the pixel range of the target can vary from just a few pixels to thousands. For small targets, localization loss functions based on the IoU and its various variants, such as the GIoU [8], DIoU [9], and CIoU [10], are highly sensitive to prediction deviations, leading to difficulties in boundary box regression convergence.

In response to the previously mentioned challenges, this document introduces a sturdy network for feature improvement designed to identify faint and diminutive targets amidst intricate backdrops. In particular, the detection network’s neck introduces a module called Iterative Attentional Feature Fusion (iAFF), employing a multi-scale channel attention mechanism (MS-CAM) to address the challenge of overpowering small target features in the process of cross-scale attention feature fusion. Furthermore, we suggest integrating the Occlusion-Aware Attention Module (OAAM) with the Dynamic Head’s scale and spatial focus, which dynamically discerns the significance of each semantic layer, tackling issues like local aliasing and loss of target features. Addressing the issue of convergence challenges in localization loss for minor targets in standard detection algorithms, we utilize a two-dimensional Gaussian distribution for reconstructing bounding boxes and apply Normalized Wasserstein–Gaussian Distance (NWD) to lessen the impact of the IoU on small target localization loss sensitivity. In the end, comprehensive ablation experiments are conducted with the recommended modules on RetinaNet, YOLOv8m, and YOLOv11m, and the results are compared with other SOTA algorithms to confirm the effectiveness of the proposed technique. A brief summary of the paper’s contributions follows:

(1): Comparative research is conducted to evaluate the amalgamation of multi-scale feature maps in the FPN and PAN against the newly introduced iAFF module for feature merging. Findings confirm that our suggested network, through the consolidation of diverse contextual data across channel dimensions, is capable of highlighting both extensive targets with worldwide distribution and minor targets with localized distribution. This method enhances the network’s capacity to identify and sense objects amidst significant scale fluctuations.
(2): After amalgamating features from various scales, it is proposed that the OAAM enhances the interconnectivity among feature channels dynamically. Furthermore, we integrate the Dynamic Head’s scale and spatial focus, employing adaptable convolutions to concentrate on target shapes, thereby tackling local occlusion and the disappearance of 119 object characteristics, plus the disappearance of characteristics of objects.
(3): We select the NWD as the loss function for regression analysis at the edges of smaller objects. Furthermore, we evaluate the impact of employing the IoU, GIoU, CIoU, and NWD as loss functions for localization in infrared small target bounding box regression. The findings show that the NWD loss mechanism markedly diminishes incorrect alerts and overlooked detections stemming from difficulties in pinpointing and identifying small infrared targets in congested environments. The enhancement bolsters the detection model’s overall resilience.
(4): Utilizing a long-wave infrared camera, we capture images of drones in various situations and over various distances (100–1200 m), resulting in the formation of a comprehensive infrared target dataset called IRMT-UAV. This dataset covers a wide variety of application scenarios, including those with highly cluttered backgrounds and local occlusion between objects and their surroundings. The targets in the dataset vary from bright, medium-sized objects at close range to dim, small objects at extended distances. The varied nature of this dataset bolsters its resilience and flexibility, rendering it ideal for the training and assessment of object detection algorithms.

2. Related Work

Various techniques for identifying tiny infrared objects based on mathematical models have been proposed, as shown in Table 1, including filter-focused strategies [11,12], localized contrast methods [13,14], and low-rank and low-rank decomposition techniques [15,16]. Methods based on filters divide targets through background estimation and target enhancement. Nonetheless, their use is confined to consistent backdrops and is not robust against intricate backdrops. Targets are pinpointed through local contrast-based techniques that assess the intensity variance between the target and its adjacent area. Nonetheless, they face challenges in accurately identifying dim targets. Low-rank decomposition techniques employ the sparsity of the target and the low-rank characteristics of the background to distinguish the structural elements of the target from the background. However, in scenarios involving images with intricate backdrops and diverse target forms, these techniques show a significant frequency of incorrect alerts and overlooked detections. Under real-world conditions, infrared imagery frequently presents intricate backdrops, faint targets, and considerable scale discrepancies, complicating the effectiveness of the aforementioned techniques in identifying faint and minor infrared targets.

Lately, the field of deep learning has seen remarkable advancements and major breakthroughs across various disciplines. Differing from traditional techniques for identifying tiny infrared objects, deep learning utilizes an all-encompassing, data-focused learning strategy to dynamically acquire characteristics for these targets, thus obviating the necessity for manually generated features. Wang and others endeavored to balance the metrics for detecting missed and false alarms, and proposed the development of a conditional generative adversarial network (GAN) comprising two generators and one discriminator [17]. Utilizing a context aggregation network, the pair of generators utilizes receptive fields varying in size to achieve both local and global segmentation outcomes for the target. The discriminator differentiates between the three segmentation results obtained from the pair of generators and the genuine annotations of the image. The technique presents an innovative deep learning structure aimed at lowering the rate of both missed and false alarms in detection algorithms, yet it is unsuitable for situations characterized by intensely cluttered backgrounds and localized aliasing between targets and backgrounds.

Dai and others proposed a unique module for asymmetric context modulation to identify tiny infrared objects, enhancing the interaction between overarching semantics and minute details via a point-by-point channel attention approach [18]. Hou and others developed a framework for extracting features that merges artisanal feature techniques with convolutional neural networks [19]. Chen and others developed a layered overlapping compact patch transformer (HOSPT) as an alternative to the CNN for encoding features of multiple scales from a single-frame image [20]. Zhou and others launched a single-phase cascade refinement network (OSCAR) to address the issue of incorrectly identifying minor targets as background [21]. Yao and others suggested a proficient single-phase infrared method for detecting small targets aimed at discerning spatio-temporal correlations in image sequences utilizing enhanced FCOS (Fully Convolutional One-Stage Object Detection) [22]. Luo and others employed a multi-task loss function for comprehensive training of ISTDet [28]. Du and others suggested an improvement in interframe energy accumulation (IFEA) to boost the energy of the target, reduce intense spatially unstable clutter, and identify faint, small targets [29]. Additionally, algorithms based on segmentation tasks [23,24,25,26,27] accomplished intricate segmentation at the pixel scale of faint and tiny infrared targets, along with intricate background areas, by merging features across scales and interacting within them.

3. Methodology

3.1. Overall Architecture

The workflow for our suggested network is depicted in Figure 1. The initial step involves feeding the infrared image into the main network to isolate key characteristics. Following this, the iAFF module is integrated at the network’s neck for the execution of feature fusion across different scales. Subsequently, the OAAM is utilized to dynamically improve the mutual dependencies between feature channels. Subsequently, the output attributes are transferred to the dynamic detection head, which adaptively discerns the significance of each semantic layer. Ultimately, the NWD and DFL assess the detection outcomes, calculating the loss and directing the optimization of the model.

3.2. Iterative Attentional Feature Fusion

In practical applications, targets may appear at different scales due to various perspectives, imaging distances, or camera system parameters [30]. The integration of features across multiple scales improves the model’s versatility for targets varying in size by amalgamating outputs from diverse levels and leveraging both overarching contextual and specific local details. The Feature Pyramid Network (FPN) [31] and Path Aggregation Network (PAN) [32] are currently widely adopted in models that generate pyramid-shaped multi-scale features by integrating adjacent layers in the backbone network. This fusion method combines deep abstract semantic features with shallow high-resolution features, increasing the feature representation of different scales. Conversely, conventional fusion techniques in the FPN-PAN typically entail merging or incorporating neighboring layers post the enhancement of low-resolution features and then executing conventional convolution to align with the output channels. The issue with this streamlined fusion method lies in the equal weighting of the merged feature maps, in contrast to the dominance of large-scale background elements over smaller target features in dense background noise [33,34]. To address this issue, we introduce Iterative Attentional Feature Fusion (iAFF) in the network neck to fuse semantically and scale-inconsistent features, as shown in Figure 2. More precisely, it employs a multi-scale channel attention mechanism (MS-CAM) to tackle the issue of minor target characteristics being eclipsed in the amalgamation of cross-scale attention features. The MS-CAM has the capability to highlight both large targets distributed worldwide and small targets distributed locally, by consolidating contextual data across multiple scales in the channel dimension, aiding the network in identifying and recognizing objects with significant scale differences. The formula for calculating the channel weights of the global feature context in the MS-CAM is as follows:

w = σ (g (X)) = σ (B (W_{2} δ (B (W_{1} (G (X))))))

(1)

where

g (X) \in R^{C}

represents the global feature context, and

G (X)

represents global average pooling in the spatial dimension.

δ

represents the ReLU activation function,

B

represents the batch normalization layer (BN), and

σ

represents the Sigmoid activation function. The specific implementation method is through two fully connected layers combined with activation functions, where

W_{1} \in R^{\frac{C}{r} \times C}

represents the channel dimension reduction layer,

W_{2} \in R^{C \times \frac{C}{r}}

represents the channel dimension expansion layer, and r is a parameter for channel reduction.

The MS–CAM integrates localized contextual data into the overall contextual data of the attention module, facilitating a mechanism for feature attention across different scales. The system employs point-wise convolution (PWconv) [35] for consolidating local contextual data, utilizing the interplay between points and channels at every spatial point. The formula to compute the weight of the local context channel is outlined as follows:

L (X) = B (P W C o n v_{2} (δ (B (P W C o n v_{1} (X)))))

(2)

The convolution kernel sizes of

P W C o n v_{1}

and

P W C o n v_{2}

are

C / r \times C \times 1 \times 1

and

C \times C / r \times 1 \times 1

, respectively. The shape of

L (x)

mirrors that of the input features, enabling it to maintain and accentuate specific details in the basic features. Considering the overall context channel data

g (X)

and the specific local context channel details

L (x)

, the enhanced attributes achieved by the MS-CAM are outlined below:

X^{'} = X \otimes M (X) = X \otimes σ (L (X) \oplus g (X))

(3)

where

M (X) \in R^{C \times H \times W}

denotes the attention weights determined by the MS-CAM. The symbol ⊕ represents the method of addition broadcasting in tensor calculations, while ⊗ indicates multiplication at the element level. Below is an explanation of iAFF’s use of a multi-scale channel attention system for integrating features across different scales. Given two scale feature maps X,

Y \in R^{C \times H \times W}

, with Y being the map with an expanded receptive area, the initial phase of iAFF entails merging X and Y initially using the MS-CAM:

X ⊎ Y = M (X + Y) \otimes X + (1 - M (X + Y)) \otimes Y

(4)

where

M (X + Y)

symbolizes the attention weights determined by the MS-CAM module, varying between 0 and 1. The initial outcome of the cross-scale feature fusion is represented by

X ⊎ Y

. Following this, a comparable process is utilized to merge the cross-scale characteristics further, achieving the ultimate fusion outcome

Z^{'} \in R^{C \times H \times W}

:

Z^{'} = M (X ⊎ Y) \otimes X + (1 - M (X ⊎ Y)) \otimes Y

(5)

3.3. Occlusion-Aware Attention Module

When the target partially overlaps with its adjacent area, it results in the disappearance of local characteristics, thus making the detection, identification, and pinpointing of the target more complex. In response to this problem, an OAAM is created to tackle the difficulties of local overlap and loss of target features amid intense background noise and partial blockage. As shown in Figure 3, the first part of the OAAM integrates output features from patches of different sizes to adapt to varying degrees of local overlap. Specifically, this is achieved by serially applying three 3 × 3 convolutions to integrate network output features across patch sizes of 3 × 3, 5 × 5, and 7 × 7. Motivated by the citation [36], our method utilizes depthwise separable convolutions [37], integrating residual links, to determine the importance of different channels. While depthwise separable convolutions can understand the significance of various channels and lessen parameters, they fail to consider the informational interconnections among channels. As a remedy for this deficit, the results from various depthwise convolutions are then merged using pointwise (1 × 1) convolution, as depicted in Figure 3. Subsequently, a dual-layer fully integrated network is utilized to amalgamate data across various channels, thereby improving the inter-channel connections within the network. The fully connected layers process outputs through an exponential function, expanding the value spectrum from [0, 1] to [1, e]. Exponential normalization creates a steady correlation in mapping, thereby boosting the detection output’s robustness against positional inaccuracies caused by local overlaps. In the end, the performance of the OEAM module improves when the primary features are integrated as the focal point, thereby enhancing the model’s capacity to handle local overlaps and manage the disappearance of target features in the presence of considerable background interference and partial occlusion.

3.4. Dynamic Head

Within the feature maps produced by the Feature Pyramid Network (FPN), deep feature maps of low resolution hold advanced semantic data, yet there is a risk of losing minor object details in the downsampling phase. On the other hand, high-resolution shallow feature maps preserve the spatial location and detail information of small objects but fail to capture complex semantic information. They also exhibit lower robustness and are more susceptible to local variations such as noise and lighting changes. Additionally, differences in imaging angles and specific task variances can result in various characteristics and constraints on targets, complicating the detection of smaller objects. The dynamic detection head, as depicted in Figure 4, tackles these issues by harmonizing and merging diverse attention processes across various levels of feature layers, spatial locations, and output pathways. This technique allows the model to focus dynamically on data pertinent to scale-space activities, thus effectively identifying the importance of semantic strata and the spatial specifics of the target. This technique markedly improves the ability of the detection head to represent features for targets across various scales while reducing computational load.

Given that a feature tensor

F \in R^{L \times S \times C}

symbolizes the scale levels of the feature map, the dynamic detection head can be depicted as follows [38]:

W (F) = π_{C} (π_{S} (π_{L} (F) \cdot F) \cdot F) \cdot F

(6)

where

π_{L} (\cdot)

,

π_{S} (\cdot)

, and

π_{C} (\cdot)

represent the three distinct functions of attention mechanisms utilized in the L, S, and C dimensions, respectively. The scale-aware attention mechanism

π_{L} (\cdot)

, attuned to scale, dynamically calculates the weights for feature maps across 242 various scales to facilitate dynamic feature merging.

π_{L} (F) \cdot F = σ (f (\frac{1}{S C} \sum_{S, C} F)) \cdot F

(7)

where

f (\cdot)

is a point-wise convolution operation. The spatial-aware attention mechanism

π_{S} (\cdot)

employs an attention module to focus on spatial regions that consistently exhibit differences across feature levels of different scales. In particular, the process begins with employing deformable convolutions [39] to develop an attention mechanism for infrequent features, followed by the aggregation of features at various levels within identical spatial points.

π_{S} (F) \cdot F = \frac{1}{L} \sum_{l = 1}^{L} \sum_{k = 1}^{K} ω_{l, k} \cdot F (l; p_{k} + Δ p_{k}; c) \cdot Δ m_{k}

(8)

K represents the number of spatial sparse samples,

Δ p_{k}

denotes the spatial offset parameters learned within the network, and

Δ m_{k}

refers to the scale factor parameters learned at the spatial locations (

p_{k} + Δ p_{k}

) of the input feature layer. The task-aware attention mechanism

π_{C} (\cdot)

adjusts channel activation dynamically to accommodate various detection activities.

π_{C} (F) \cdot F = max (α^{1} (F) \cdot F_{C} + β^{1} (F), α^{2} (F) \cdot F_{C} + β^{2} (F))

(9)

3.5. Normalized Wasserstein–Gaussian Distance

The IoU metric, frequently employed in detecting general objects, demonstrates remarkable sensitivity in detecting small targets in infrared. Figure 5 demonstrates this, showing a reduction in the IoU from 0.47 to 0.08, correlating with the shift in the small target’s position from 1 to 3 pixels. Therefore, alternative evaluation metrics are needed to more accurately assess infrared small targets. The Normalized Wasserstein–Gaussian Distance is presented as an innovative metric. The Wasserstein distance is capable of gauging the likeness among distributions with little to no intersection and remains unaffected by objects of varying sizes. As a result, this approach addresses the similarity between positive and negative samples, as well as the rarity of positive samples in the training stage of infrared small targets.

Infrared small targets’ bounding box is depicted using a two-dimensional Gaussian model due to their unique intensity distribution characteristics.

f (x |μ, Σ) = \frac{exp (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ))}{2 π {|Σ|}^{\frac{1}{2}}}

(10)

where x,

μ

, and

Σ

symbolize the Gaussian distribution’s two-dimensional coordinates, the average vector, and the covariance matrix, in that order. In the case of smaller targets, the two-dimensional spread of pixel values roughly meets

μ = [\begin{matrix} c_{x} \\ c_{y} \end{matrix}] Σ = [\begin{matrix} \frac{w^{2}}{4} & 0 \\ 0 & \frac{h^{2}}{4} \end{matrix}]

(11)

The Wasserstein distance between two two-dimensional Gaussian distributions

μ_{1} = N (m_{1}, Σ_{1})

and

μ_{2} = N (m_{2}, Σ_{2})

can be defined as follows:

W_{2}^{2} (μ_{1}, μ_{2}) = {∥m_{1} - m_{2}∥}_{2}^{2} + T r (Σ_{1} + Σ_{2} - 2 {(Σ_{2}^{1 / 2} Σ_{1} Σ_{2}^{1 / 2})}^{1 / 2})

(12)

This formula can be approximated as follows:

W_{2}^{2} (μ_{1}, μ_{2}) = {∥m_{1} - m_{2}∥}_{2}^{2} + {∥Σ_{1}^{1 / 2} - Σ_{2}^{1 / 2}∥}_{2}^{2}

(13)

Then, the distance between the Gaussian distributions

N_{a}

and

N_{b}

corresponding to the bounding boxes

A = (c x_{a}, c y_{a}, w_{a}, w_{b})

and

B = (c x_{b}, c y_{b}, w_{b}, w_{b})

can be expressed as follows:

W_{2}^{2} (N a, N b) = {({[c x_{a}, c y_{a}, \frac{w_{a}}{2}, \frac{h_{a}}{2}]}^{T}, {[c x_{b}, c y_{b}, \frac{w_{b}}{2}, \frac{h_{b}}{2}]}^{T})}_{2}^{2}

(14)

Utilizing the exponential version of the aforementioned formula, the derived value is limited to a span of 0–1, with C representing a dataset-related constant [40].

N W D (N_{a}, N_{b}) = exp (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C})

(15)

4. Experiments and Analysis

This part employs the extensively used benchmark dataset IRSTD-1k [41] along with our dataset IRMT-UAV to confirm the efficacy of our suggested approach. Our approach involves contrasting the suggested algorithm with other cutting-edge (SOTA) algorithms and undertaking extensive ablation tests to accurately evaluate our algorithm’s effectiveness.

4.1. Dataset and Evaluation Metrics

The IRSTD-1k dataset is designed specifically for identifying and dividing minor infrared objects. Offering 1000 authentic images, the IRSTD-1k dataset displays a range of target forms and sizes and dense, cluttered backdrops marked by “weak” and “small” features. Within this framework, the term “weak” refers to the low signal-to-noise ratio, faint contrast in the background, and weak infrared radiation intensity of the targets, while “small” indicates a lack of target pixels, making the collection of texture data during detection more complex.

We constructed the IRMT-UAV dataset using infrared images captured by a mid-wave infrared camera within the 3–5

μ

m wavelength range with a resolution of 640 × 512 pixels and shooting distances ranging from 100 to 1200 m. This dataset primarily features small targets and consists of 3500 infrared images of drones collected over more than a year across different seasons, weather conditions, and complex backgrounds. The challenges posed by this dataset include varying target orientations, scales, and occlusions, with a high proportion of small targets, many of which are almost invisible to the naked eye. The dataset contains 4697 bounding box labels for drones. As shown in Figure 6, the scale variation of drone targets in our dataset is significant. With the increase in imaging distance, the normalized height and width of the targets can be less than 1%, meaning the targets occupy only a few pixels. Our dataset includes both surface targets occupying thousands of pixels and point targets occupying only a few pixels. Additionally, we analyzed the signal-to-noise ratio (SNR) distribution for 4697 drone instances. Due to changes in the background and imaging distance, the SNR at the drone targets varies significantly, with an average SNR of around 60.

To evaluate the effectiveness of our proposed algorithm relative to other advanced (SOTA) algorithms, measures including precision, recall, and mAP50 are employed. Precision denotes the proportion of correctly identified positives (TP) out of the predicted positive cases. The computation proceeds in the following manner:

P r e c i s i o n = T P / (T P + F P)

(16)

Recall represents the proportion of actual positive samples that are correctly predicted as positive. The computation proceeds in this manner:

R e c a l l = T P / (T P + F N)

(17)

mAP50 represents the mean average precision of the model at an IoU threshold of 0.5. The computation proceeds in this manner:

m A P 50 = \frac{1}{C} \sum_{i = 1}^{C} A P_{i, 50}

(18)

4.2. Experiment Setup

This paper’s experimental configuration employs a Windows 11 OS, a 13th generation Intel (R) Core (TM) i9-13900K 3.00 GHz CPU, 128 GB RAM, and a pair of NVIDIA GeForce RTX 3090 GPUs. PyTorch 2.5.1 functions as the foundational deep learning framework, in conjunction with CUDA version 12.1, while Python 3.10 is used. For comparative experiments with other algorithms, the backbone used is YOLOv11m, while the ablation experiments use RetinaNet, YOLOv8m, and YOLOv11m as backbones. The improved network does not use pretrained model weights during initialization. The training phase of the model ends when the loss function is smoothed out. The input image’s dimensions are set at 640 and the batch size at 64, and the number of epochs is maintained at 300. The starting rate of learning is established at 0.001. Table 2 displays the precise hyperparameters employed in the model’s training.

To prevent model overfitting, we implemented a series of measures. Firstly, we employed data augmentation techniques such as scaling, translation, and Mosaic methods to enhance the validity and diversity of the dataset. Secondly, we applied L2 regularization to suppress excessively large weights in the model, thereby reducing the model’s dependency on these weights. Additionally, we adjusted training parameters to address potential instability and oscillation during the training process. Specifically, we set the batch size to 64, utilizing a larger batch size to reduce noise in the training data. We adopted the Stochastic Gradient Descent (SGD) algorithm to optimize the model weights, with an initial learning rate of 0.001, a momentum parameter of 0.9, and a weight decay coefficient of 0.0005. By introducing a regularization term in the loss function, we further mitigated the risk of overfitting.

4.3. Comparison with Other Methods

4.3.1. Quantitative Results

We utilized YOLO11 as the backbone of our algorithm. YOLO11 is the latest generation of object detection models introduced by Ultralytics. Compared to previous versions, YOLO11 incorporates several improvements in architecture design and training processes, significantly enhancing the model’s performance and flexibility. The YOLOv11 network primarily consists of the Backbone, Neck, and Head. The Backbone employs the C3k2f module, enhancing feature capabilities through Bottleneck Block, SPPF, and C2PSA modules. The Neck, situated between the Backbone and Head networks, is responsible for feature fusion and enhancement. The Head network is the decision-making component of the object detection model, tasked with producing the final detection results. YOLO11 offers various models designated as n, s, m, l, and x, each differing in parameter size and computational complexity. In this paper, we utilize the YOLOv11m model as the benchmark model, which comprises 409 layers, 20,114,688 parameters, 20,114,672 gradients, and a theoretical computational complexity of 68.5 FLOPs.

We report a quantitative comparison of the performance of the proposed algorithm with other state-of-the-art algorithms on the IRSTD-1k and IRMT-UAV datasets, including EFLNet [42], MDvsFA [17], YOLOSR-IST [4], ACM [18], ISNet [41], ALCNet [33], and DNANet [43], as shown in Table 3. On the IRSTD-1k dataset, compared to the SOTA algorithms, our precision improved by 2.4%, recall increased by 1.9%, and mAP50 improved by 1.4%.

Additionally, we carried out comparative studies using our self-created dataset, IRMT-UAV. The suggested dataset encompasses complex situations, including intensely cluttered backdrops, faint, diminutive targets, and incomplete occlusion. The technique we use utilizes a multi-faceted channel focus system for small and faint targets against intricate backdrops, resulting in a notable enhancement of the algorithm’s efficiency. In the case of the IRMT-UAV dataset, our precision improved by 3.7%, recall increased by 4.3%, and mAP50 improved by 4.9%.

4.3.2. Visual Results

Our presentation includes a variety of application situations featuring variations in UAV target scales, intense clutter, and partial occlusion, accompanied by visual outcomes from diverse techniques, performed on the IRSTD-1k and IRMT-UAV datasets, as depicted in Figure 7. GT indicates the annotated true position of the drone in the captured images, with the top-right corner showing a detailed local view of the drone target. It can be observed that the scale of the drone changes significantly with variations in imaging distance. We collected drone images against various backgrounds, such as clouds, buildings, rural vegetation, and urban backgrounds. In these settings, the wide range of grayscale variations in the images resulted in strong background clutter and very weak visual features of the drone targets, further increasing the difficulty for the detection algorithm. Circles of varied hues are used to emphasize accurately identified targets, incorrect alerts, and undetected targets. Orange circles represent correctly detected targets, purple circles indicate false alarms, and blue circles denote missed detections. Initially, we introduced a juxtaposition of our suggested algorithm against alternative techniques amidst intense clutter environments. For medium-sized targets with scales of tens of pixels, all algorithms successfully detected the targets, although a few algorithms exhibited false alarms. With the escalation of the clutter backdrop and the diminishing scale of the target, the target becomes more obscured in the background, leading to other algorithms’ inability to accurately identify the target. Conversely, the algorithm we suggest, utilizing a multi-scale channel attention approach, successfully integrates features across different scales, enhancing the detection accuracy for minor targets amidst intricate backgrounds. The subsequent comparative experiments indicate that other algorithms experience various missed detection issues in scenes involving local feature loss, whereas our algorithm, based on the OAAM and dynamic detection head, strengthens the connections between channel relationships in partially occluded regions, achieving better detection performance across all scenarios.

4.4. Ablation Experiments

For a deeper assessment of how the suggested modules affect the algorithm’s efficiency, RetinaNet, YOLOv8m, and YOLOv11m were chosen as the core networks, and we executed evaluation activities on the dataset we proposed. Table 4 shows the changes in various detection metrics after adding different modules to the algorithms. Our suggested modules are observed to markedly enhance performance in all three baseline scenarios, with the YOLOv11m algorithm showing the most significant enhancement, as shown in Figure 8. Employing a multi-scale channel attention system, the iAFF module focuses on both large targets distributed globally and small targets distributed locally, leading to a notable enhancement in precision (an increase of 6.8%). The OAAM enhances the connections between channels in locally overlapping regions, making the detection output more tolerant of positional prediction deviations in such scenarios, leading to the most significant improvement in recall (an increase of 4.2%). Utilizing two-dimensional Gaussian models, the NWD component replicates tiny target bounding boxes and uses the normalized Wasserstein distance to assess the similarity between real and predicted boundaries, addressing the variability in the convergence of IoU-based loss functions during training for smaller targets. Utilizing spatial and scale attention mechanisms, the dynamic detection head markedly improves its ability to represent features for targets across various scales, leading to a notable enhancement in mAP50 (an increase of 6.1%).

4.4.1. The Impact of NWD on Network

To validate the effectiveness of NWD for small object bounding box regression, we conducted the following ablation study: keeping other parts of the model unchanged, we replaced the localization loss function with the IoU, GIoU, DIoU, CIoU, and NWD, respectively. Table 5 shows the changes in the algorithm’s detection metrics when using different localization loss functions. Findings show that employing NWD as the loss function markedly enhances the algorithm’s accuracy, retrieval, and mAP50 relative to loss functions based on the IoU.

Additionally, we found in our experiments that using NWD as the loss function resulted in an AP of 17.8% at 20 epochs and 19.7% at 40 epochs, whereas using the CIoU as the loss function resulted in an AP of 11.1% at 20 epochs and 12.6% at 40 epochs. Utilizing NWD as the loss function resulted in stabilization of the loss, leading to the model’s convergence at approximately 200 epochs. Conversely, employing the CIoU as the loss function resulted in the model reaching convergence only after 300 epochs. This indicates that the loss function, based on NWD, adeptly grasps the spatial variations among bounding boxes, enhancing its resilience to alterations in bounding box sizes, thus boosting the precision of detecting small objects.

4.4.2. The Impact of iAFF on the Network

Comparative analyses were conducted to assess iAFF’s impact on cross-scale feature fusion, employing both the initial feature fusion (direct concatenation) and iAFF for multi-scale feature fusion in the FPN-PAN, as illustrated in Figure 9.The results indicate that the simple channel concatenation causes large-scale background features to obscure small-scale target features, which further complicates small-scale target detection and recognition. Conversely, our approach compiles contextual data across multiple scales along the channel’s length, highlighting both extensive targets with widespread distribution and minor targets with localized distribution, facilitating target identification amidst significant scale fluctuations.

4.4.3. The Impact of the OAAM and Dy-Head on the Network

In order to examine how the OAAM and Dy-Head affect the recall rate in the detection models, we executed four series of comparative tests by progressively incorporating these two modules into the foundational model. The first group used only the basic detection model. The second group added the OAAM, the third group added the Dy-Head, and the fourth group added both the OAAM and the Dy-Head. The selected application scenario included the detection of faint targets in semi-occluded scenes, and the results are shown in Figure 10. Findings from the experiments reveal that, in the chosen situations, the fundamental model invariably showed instances of undetected events. The OAAM, through feature learning in locally occluded regions, demonstrated a higher tolerance for positional deviations in targets with missing local features, enabling the detection of targets in semi-occluded scenes. However, as the target scale further decreased, missed detections also occurred. The Dy-Head was able to accurately detect targets with significant scale variation but also showed a certain degree of missed detection. When both the Dy-Head and OAAM were incorporated into the algorithm, the model achieved significantly improved detection performance across all scenarios, leading to a substantial increase in recall rate.

Our model currently operates at 70.8 GFLOPS, with an inference time of 12.8 ms on a single NVIDIA RTX 3090 GPU. Including preprocessing and postprocessing steps, the overall algorithm achieves a frame rate of 48 FPS. These metrics reflect the computational demands of our approach, which prioritizes enhancing detection accuracy and reducing false alarm rates in complex infrared scenarios. Our primary focus has been on improving the model’s performance in challenging environments rather than optimizing for lightweight deployment. We acknowledge that reporting on model speed and resource usage is valuable for assessing practical applicability and robustness. We plan to explore avenues for reducing the theoretical computational load of our model to achieve frame rates above 50 FPS. Additionally, we are considering deploying our model on embedded hardware systems to further evaluate its efficiency and suitability for real-world applications. Our current research involves infrared imaging of drones maneuvering through occluding elements like branches and mesh structures at long distances. The difficulty of precisely controlling the pixel-level movement of drones in the image makes it challenging to establish controlled occlusion percentages. We acknowledge the importance of studying the OAAM under such conditions and are actively exploring ways to construct datasets that allow for precise control over occlusion percentages in future work.

5. Conclusions

We suggest an innovative neural network design to boost the learning capabilities of infrared small targets aimed at extracting features across multiple scales and performing boundary regression on UAV targets against intricate backdrops. The introduction of the iAFF module allows us to compile diverse contextual data across channel dimensions, focusing on both extensive targets with worldwide distributions and minor targets with localized distributions. Our proposed OAAM strengthens the connections between channels in partially occluded regions, enabling the model to more effectively handle issues such as local occlusion and the loss of target features under partial occlusion. Altering the loss function to the NWD enables the network to discern the significance of each semantic layer, focusing more on the superficial aspects of infrared minor targets. The newly implemented NWD metric aids in producing superior positive and negative samples, thereby efficiently resolving the sensitivity concerns of the IoU metric in the context of infrared small targets. The outcomes of experiments using the public IRSTD-1k dataset and our custom IRMT-UAV dataset reveal that our technique surpasses cutting-edge (SOTA) methods, showing remarkable precision and resilience in diverse application situations, such as situations with intense clutter, small-scale targets, and partial occlusion.

Our proposed algorithm can identify and locate small targets with lower false-positive and false-negative rates and higher accuracy against complex backgrounds and in adverse conditions, such as conditions with low contrast and high noise. Enhancing the detection capability of small targets in dynamic scenes is crucial in many application domains, such as drone surveillance and autonomous driving. Our next goal is to reduce the model’s parameter size and computational complexity while maintaining the current algorithm’s performance, thereby lowering hardware costs and energy consumption, which is critical for spaceborne and airborne devices.

Author Contributions

The process of conceptualization: Z.L. and S.Z.; techniques: S.Z.; software: S.Z.; affirmation: S.Z.; methodical examination: S.Z.; references: W.D., Y.L., N.Z. and Z.L.; curating data: W.D. and S.Z.; composing—initial draft creation: S.Z.; authoring—evaluating and refining: W.D., Y.L. and S.Z.; overseeing: Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

The study did not obtain any funding from outside sources.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available because they are confidential.

Conflicts of Interest

The writers affirm the absence of any conflicting interests.

Abbreviations

The following abbreviations are used in this manuscript:

FPN	Feature Pyramid Network
PAN	Path Aggregation Network
iAFF	Iterative Attentional Feature Fusion
OAAM	Occlusion-Aware Attention Module
NWD	Normalized Wasserstein–Gaussian Distance
MS-CAM	Multi-Scale Channel Attention Mechanism

References

Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior attention-aware network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002013. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1670–1674. [Google Scholar] [CrossRef]
Sun, Y.; Yang, J.; An, W. Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3737–3752. [Google Scholar] [CrossRef]
Li, R.; Shen, Y. YOLOSR-IST: A deep learning method for small target detection in infrared remote sensing images based on super-resolution and YOLO. Signal Process. 2023, 208, 108962. [Google Scholar] [CrossRef]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. Hcf-net: Hierarchical context fusion network for infrared small object detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection. arXiv 2024, arXiv:2412.16986. [Google Scholar] [CrossRef]
Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-frame infrared small-target detection: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
Qian, X.; Zhang, N.; Wang, W. Smooth giou loss for oriented object detection in remote sensing images. Remote Sens. 2023, 15, 1259. [Google Scholar] [CrossRef]
Wang, M.; Fu, B.; Fan, J.; Wang, Y.; Zhang, L.; Xia, C. Sweet potato leaf detection in a natural scene based on faster R-CNN with a visual attention mechanism and DIoU-NMS. Ecol. Inf. 2023, 73, 101931. [Google Scholar] [CrossRef]
Liu, X.-B.; Yang, X.-Z.; Yang, C.; Zhang, S.-T. Object detection method based on CIoU improved bounding box loss function. Chin. J. Liq. Cryst. Displays 2023, 38, 656–665. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets 1999, Denver, CO, USA, 4 October 1999; Volume 3809, pp. 74–83. [Google Scholar]
Rivest, J.F.; Fortin, R. Detection of dim targets in digital infrared imagery by morphological image processing. Opt. Eng. 1996, 35, 1886–1893. [Google Scholar] [CrossRef]
Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A robust infrared small target detection algorithm based on human visual system. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared small target detection via non-convex rank approximation minimization joint l 2, 1 norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 950–959. [Google Scholar]
Hou, Q.; Wang, Z.; Tan, F.; Zhao, Y.; Zheng, H.; Zhang, W. RISTDnet: Robust infrared small target detection network. IEEE Geosci. Remote Sens. Lett. 2021, 19, 7000805. [Google Scholar] [CrossRef]
Chen, G.; Wang, W.; Tan, S. Irstformer: A hierarchical vision transformer for infrared small target detection. Remote Sens. 2022, 14, 3258. [Google Scholar] [CrossRef]
Dai, Y.; Li, X.; Zhou, F.; Qian, Y.; Chen, Y.; Yang, J. One-stage cascade refinement networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000917. [Google Scholar] [CrossRef]
Yao, S.; Zhu, Q.; Zhang, T.; Cui, W.; Yan, P. Infrared image small-target detection based on improved FCOS and spatio-temporal features. Electronics 2022, 11, 933. [Google Scholar] [CrossRef]
Li, R.; An, W.; Xiao, C.; Li, B.; Wang, Y.; Li, M.; Guo, Y. Direction-coded temporal U-shape module for multiframe infrared small target detection. IEEE Trans. Neural Netw. Learn. Syst. 2023, 36, 555–568. [Google Scholar] [CrossRef]
Ying, X.; Liu, L.; Wang, Y.; Li, R.; Chen, N.; Lin, Z.; Sheng, W.; Zhou, S. Mapping degeneration meets label evolution: Learning infrared small target detection with single point supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15528–15538. [Google Scholar]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000513. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef] [PubMed]
Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. RSSFormer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef]
Ju, M.; Luo, J.; Liu, G.; Luo, H. ISTDet: An efficient end-to-end neural network for infrared small target detection. Infrared Phys. Technol. 2021, 114, 103659. [Google Scholar] [CrossRef]
Du, J.; Lu, H.; Zhang, L.; Hu, M.; Chen, S.; Deng, Y.; Shen, X.; Zhang, Y. A spatial-temporal feature-based detection framework for infrared dim small target. IEEE Trans. Geosci. Remote Sens. 2021, 60, 3000412. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 16794–16805. [Google Scholar]
Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective fusion factor in FPN for tiny object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1160–1168. [Google Scholar]
Hu, M.; Li, Y.; Fang, L.; Wang, S. A2-FPN: Attention aggregation based feature pyramid network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15343–15352. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
Sunkara, R.; Luo, T. YOGA: Deep object detection in the wild with lightweight feature learning and multiscale attention. Pattern Recognit. 2023, 139, 109451. [Google Scholar] [CrossRef]
Zhang, K.; Wang, W.; Lv, Z.; Feng, J.; Li, H.; Zhang, C. LKDPNet: Large-Kernel Depthwise-Pointwise convolution neural network in estimating coal ash content via data augmentation. Appl. Soft Comput. 2023, 144, 110471. [Google Scholar] [CrossRef]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. Yolo-facev2: A scale and occlusion aware face detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
Schuler, J.P.S.; Romani, S.; Abdel-Nasser, M.; Rashwan, H.; Puig, D. Grouped pointwise convolutions reduce parameters in convolutional neural networks. Mendel 2022, 28, 23–31. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
Chen, F.; Wu, F.; Xu, J.; Gao, G.; Ge, Q.; Jing, X.Y. Adaptive deformable convolutional network. Neurocomputing 2021, 453, 853–864. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18– 24 June 2022; pp. 877–886. [Google Scholar]
Yang, B.; Zhang, X.; Zhang, J.; Luo, J.; Zhou, M.; Pi, Y. EFLNet: Enhancing feature learning network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5906511. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Schematic diagram of the overall algorithm process.

Figure 2. Schematic diagram of the iAFF. Application considerations for the MS-CAM block: the solid arrow path on the left logically represents the application of the weight M(·), while the dashed arrow path on the right logically represents the application of the weight (1−M(·)).

Figure 3. Schematic diagram of the OAAM.

Figure 4. Schematic diagram of the Dynamic Head.

Figure 5. The motivation for introducing the Normalized Wasserstein–Gaussian Distance: (a) sensitivity analysis of the IoU on tiny-scale object. A represents the actual boundary box, while B and C represent predicted boundary boxes with deviations of 1 pixel and 3 pixels, respectively. (b) three-dimensional intensity distribution of infrared small targets.

Figure 6. Description of the IRMT-UAV dataset: (a) The distribution of height and width of bounding boxes for drone targets. The dimensions have been normalized, meaning that the height and width of the targets are divided by the height and width of the image, respectively. (b) The distribution of the signal-to-noise ratio (SNR) in the dataset. We have compiled the distribution statistics of the SNR for 4697 instances in the dataset, where density represents the probability density.

Figure 7. Comparison of the performance of different algorithms for target detection in highly cluttered backgrounds. The blue patch in the upper right corner of the first row represents a magnified detail of the drone target in the local image. Orange circles represent correctly detected targets, purple circles indicate false alarms, and blue circles denote missed detections.

Figure 8. The variation in detection metrics after inserting different modules into YOLOv11.

Figure 9. The feature maps generated by different feature fusion methods: “baseline” refers to the direct concatenation of feature maps from different scales, while the last column indicates cross-scale feature fusion using the iAFF.

Figure 10. The study investigates the impact of the OAAM and Dy-Head on recall performance. “GT” represents the ground truth labeled images. “Baseline” refers to the model without using either the OAAM or the Dy-Head. “OAAM” indicates the model with only the OAAM applied. “Dy-Head” refers to the model with only the Dy-Head module applied. “OAAM&Dy-Head” represents the model with both modules applied. Orange circles represent correctly detected targets, purple circles indicate false alarms, and blue circles denote missed detections.

Table 1. Related work list.

	Mathematical Models Based	Deep Learning Based
Box regression	MMF [11]	MDvsFA [17]
	MIP [12]	ACM [18]
	IRHVS [13]	ISTDNet [19]
	RLCM [14]	IRSTFormer [20]
	NRAM [15]	OCRNet [21]
	IPM [16]	FCOS [22]

Segmentation	–	IFEA [23]
		DTUNet [24]
		MDLE [25]
		RDANet [26]
		FSENet [27]

Table 2. The specific hyperparameter information of the model.

Hyperparameters	Value
Learning Rate	0.001
Momentum	0.937
Epochs	300
Batch Size	64
Image Size	640 × 640
Optimizer	SGD
Weight Decay	0.0005
Patience	100
Work	16
Pretrained	False
Warmup Epochs	3
Warmup Momentum	0.8
Warmup Bias Learning Rate	0.1

Table 3. A numerical analysis comparing the performance of various techniques on the IRMT-UAV and IRSTD-1k data collections.

Method	IRMT-UAV			IRSTD-1k
Method	Precision	Recall	mAP50	Precision	Recall	mAP50
MDvsFA	60.8%	50.7%	59.7%	55.0%	48.3%	47.5%
YOLOSR-IST	56.8%	68.4%	41.5%	41.5%	47.0%	44.1%
ACM	76.5%	75.2%	74.7%	67.9%	60.5%	64.0%
ISNet	82.0%	84.7%	83.4%	71.8%	64.1%	65.8%
ALCNet	84.8%	78.0%	81.3%	83.9%	65.6%	72.9%
EFLNet	88.2%	84.7%	86.8%	85.4%	70.8%	73.8%
DNANet	89.1%	85.2%	87.9%	84.3%	72.1%	74.4%
Ours	92.8%	89.5%	91.7%	86.7%	74.0%	75.8%

Table 4. Ablation study on different modules. The baseline models use the backbones of RetinaNet, YOLOv8m, and YOLOv11m, respectively. “✓” indicates that the module is used, while “✗” means the module is not used.

Backbone	iAFF	OAAM	Dy-Head	NWD	Precision	Recall	mAP50
YOLOv11m	✗	✗	✗	✗	81.9%	69.5%	74.4%
	✓	✗	✗	✗	88.7%	74.5%	82.3%
	✓	✓	✗	✗	89.7%	82.3%	84.5%
	✓	✓	✓	✗	91.3%	88.1%	85.6%
	✓	✓	✓	✓	92.8%	89.5%	91.7%
YOLOv8m	✗	✗	✗	✗	79.5%	68.2%	73.7%
	✓	✗	✗	✗	86.1%	74.1%	81.9%
	✓	✓	✗	✗	87.3%	80.9%	83.7%
	✓	✓	✓	✗	88.6%	82.8%	84.9%
	✓	✓	✓	✓	91.7%	87.4%	89.5%
RetinaNet	✗	✗	✗	✗	78.4%	67.8%	72.6%
	✓	✗	✗	✗	84.3%	76.3%	78.9%
	✓	✓	✗	✗	85.9%	78.7%	80.6%
	✓	✓	✓	✗	86.3%	80.5%	81.4%
	✓	✓	✓	✓	90.1%	87.4%	88.9%

Table 5. Quantitative analysis comparing various loss functions.

Loss	IRMT-UAV			IRSTD-1k
Loss	Precision	Recall	mAP50	Precision	Recall	mAP50
IoU	87.8%	86.4%	87.2%	86.3%	83.7%	85.4%
GIoU	87.9%	86.3%	88.3%	86.7%	84.2%	85.9%
DIoU	88.9%	87.2%	89.3%	87.9%	85.3%	86.2%
CIoU	90.1%	88.4%	89.1%	88.4%	85.2%	86.9%
NWD	92.8%	89.5%	91.7%	90.3%	89.1%	88.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S.; Du, W.; Liu, Y.; Zhou, N.; Li, Z. Feature Enhancement Network for Infrared Small Target Detection in Complex Backgrounds Based on Multi-Scale Attention Mechanism. Appl. Sci. 2025, 15, 4966. https://doi.org/10.3390/app15094966

AMA Style

Zhang S, Du W, Liu Y, Zhou N, Li Z. Feature Enhancement Network for Infrared Small Target Detection in Complex Backgrounds Based on Multi-Scale Attention Mechanism. Applied Sciences. 2025; 15(9):4966. https://doi.org/10.3390/app15094966

Chicago/Turabian Style

Zhang, Sen, Weilin Du, Yuan Liu, Ni Zhou, and Zheng Li. 2025. "Feature Enhancement Network for Infrared Small Target Detection in Complex Backgrounds Based on Multi-Scale Attention Mechanism" Applied Sciences 15, no. 9: 4966. https://doi.org/10.3390/app15094966

APA Style

Zhang, S., Du, W., Liu, Y., Zhou, N., & Li, Z. (2025). Feature Enhancement Network for Infrared Small Target Detection in Complex Backgrounds Based on Multi-Scale Attention Mechanism. Applied Sciences, 15(9), 4966. https://doi.org/10.3390/app15094966

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feature Enhancement Network for Infrared Small Target Detection in Complex Backgrounds Based on Multi-Scale Attention Mechanism

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overall Architecture

3.2. Iterative Attentional Feature Fusion

3.3. Occlusion-Aware Attention Module

3.4. Dynamic Head

3.5. Normalized Wasserstein–Gaussian Distance

4. Experiments and Analysis

4.1. Dataset and Evaluation Metrics

4.2. Experiment Setup

4.3. Comparison with Other Methods

4.3.1. Quantitative Results

4.3.2. Visual Results

4.4. Ablation Experiments

4.4.1. The Impact of NWD on Network

4.4.2. The Impact of iAFF on the Network

4.4.3. The Impact of the OAAM and Dy-Head on the Network

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI