Global and Local Context-Aware Detection for Infrared Small UAV Targets

Zhao, Liang; Zhang, Yan; Li, Yongchang; Zhong, Han

doi:10.3390/drones9110804

Open AccessArticle

Global and Local Context-Aware Detection for Infrared Small UAV Targets

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(11), 804; https://doi.org/10.3390/drones9110804

Submission received: 4 September 2025 / Revised: 7 November 2025 / Accepted: 10 November 2025 / Published: 18 November 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Re-designing the feature extraction backbone for infrared unmanned aerial vehicle targets.
The designed algorithm has demonstrated superior performance on two different target scale datasets.

What are the implications of the main findings?

Suitable for detecting targets of different scales, including small unmanned aircraft.
Proposed a single-stage framework for infrared small target detection.

Abstract

The widespread adoption of small unmanned aerial vehicles poses increasing challenges to public safety. Compared with visible-light sensors, infrared imaging offers excellent nighttime observation capabilities and strong robustness against interference, enabling all-weather UAV surveillance. However, detecting small UAVs in infrared imagery remains challenging due to low target contrast and weak texture features. To address these challenges, we propose IUAV-YOLO, a context-aware detection framework built upon YOLOv10. Specifically, inspired by the receptive field mechanism in human vision, the backbone network is re-designed with a multi-branch structure to improve sensitivity to small targets. Additionally, a Pyramid Global Attention Module is incorporated to strengthen target–background associations, while a Spatial Context-Aware Module is developed to integrate spatial contextual cues and enhance target-background discrimination. Extensive experiments demonstrate that, compared with the baseline model, IUAV-YOLO achieves performance gains of 4.3% in AP0.5 and 2.6% in AP0.5–0.95 on the self-built IRSUAV dataset, with a reduction of 0.7M parameters. On the public SIRST-UAVB dataset, IUAV-YOLO attains improvements of 29.7% in AP0.5 and 16.3% in AP0.5–0.95. Compared with other advanced object detection algorithms, IUAV-YOLO demonstrates a superior accuracy-efficiency trade-off, highlighting its potential for practical infrared UAV surveillance applications.

Keywords:

UAV; infrared image; YOLOv10; object detection; context-aware

1. Introduction

With the rapid advancement of information technology and the continuous reduction of manufacturing costs, low-altitude unmanned platforms—represented by small-scale unmanned aerial vehicles (UAVs)—have experienced swift development. Owing to their ease of operation, low cost, and high efficiency, they have been widely applied in diverse fields such as news reporting, geographic surveying, search and rescue, equipment inspection, fire patrol, and environmental monitoring. However, the widespread adoption of small UAVs has also raised significant challenges in terms of security and privacy. For instance, non-cooperative UAVs may infringe upon citizens’ privacy or cause secondary harm due to improper operation; “illegal flights” may threaten aviation safety around airports; and malicious actors may exploit UAVs for unlawful surveillance or destructive activities. Consequently, effective detection of small UAVs has become increasingly critical for public safety.

Infrared imaging detection systems, benefiting from strong anti-interference capability and all-weather target recognition advantages, are gradually becoming a core component of low-altitude detection systems. Nevertheless, in complex environments, detecting small infrared UAV targets remains challenging due to weakened target features and high similarity between targets and background. Previous studies classified the infrared small unmanned aerial vehicle detection as belonging to the field of infrared small target detection [1,2]. The studies on infrared small targets mainly focused on two categories of solutions: model-driven heuristic methodologies and deep learning-based data-driven approaches. Heuristic strategies typically incorporate thresholding, filtering, and contrast optimization techniques. Threshold-based segmentation techniques, including fixed threshold approaches, utilize a predefined cutoff value in grayscale intensity to differentiate objects from the background. In this method, the entire image is processed with one fixed threshold to distinguish targets from the background. In multi-threshold segmentation, multiple gray-level cutoff values are employed to segment objects across distinct intensity ranges within the image, assigning uniform intensity values to pixels within each partitioned region [3]. In adaptive thresholding, threshold parameters are dynamically calculated based on local statistical characteristics within image subregions, as exemplified by Otsu’s inter-class variance maximization method [4]. Regarding filtering techniques, directional filtering applies mean or median operations along horizontal, vertical, and diagonal orientations, with subsequent maximum value selection for background suppression as demonstrated by Deshpande et al. [5]. Advanced filtering approaches include bilateral filtering that combines spatial proximity with intensity similarity for edge-preserving image enhancement and improved target-background differentiation [6]. The contrast optimization method, such as the multi-scale contrast enhancement framework proposed by Wei et al. [7], utilizes variable-sized image blocks to extract scale-adaptive grayscale features, which is suitable for various object detection scenarios. Dai et al. [8] developed a weighted image block framework that enhances sparse characteristics of diminutive targets while suppressing background noise via prioritized weighting. However, such conventional approaches rely extensively on predefined assumptions and manually engineered features, potentially leading to elevated computational demands. Beyond these limitations (inflexible adaptation and suboptimal clutter performance), heuristic techniques necessitate substantial manual intervention in feature design. Data-driven detection paradigms mitigate reliance on manual feature engineering by employing machine learning algorithms that autonomously derive feature representations from datasets. Among these, deep learning—grounded in artificial neural networks—emerges as a prominent technique for hierarchical feature extraction and representation, employing multi-layered architectures to automatically uncover latent patterns within data. Contemporary object detection methodologies have demonstrated enhanced robustness and generalization capabilities compared with conventional approaches, as evidenced by comprehensive experimental data [9]. These detection frameworks primarily exist in two architectural paradigms: dual-phase detectors exemplified by Faster R-CNN [10] and its subsequent variants [11,12], and unified detection systems including YOLO [13] and SSD [14]. The former architecture operates through sequential region proposal generation followed by precise classification and bounding box regression, while the latter paradigm integrates localization and classification into a single network pass. This streamlined approach offers distinct advantages in computational efficiency and real-time processing capabilities, particularly beneficial for time-sensitive applications.

To address the challenges posed by small target size, sparse features, and low target-background contrast in long-range UAV infrared imaging, this paper proposes IUAV-YOLO (Infrared Unmanned Aerial Vehicle-You Only Look Once), an infrared weak and small UAV target detection algorithm built upon the YOLOv10 framework [15] and enhanced with global-local contextual awareness. First, in long-range imaging scenarios, UAV targets typically occupy only a few pixels, causing their features to be easily lost during convolution and downsampling. To mitigate this issue, we re-design the backbone network and introduce a Feature Enhancement Module (FEM). By incorporating a multi-branch architecture and dilated convolutions, FEM expands the receptive field while reducing parameter overhead, thereby strengthening local contextual representation and significantly improving the fidelity of small-target features. Second, UAV targets inevitably exhibit scale variability due to differences in flight altitude and posture, leading to substantial fluctuations in their apparent size within the image. To accommodate this effect, we design a Pyramid Global Attention (PGA) module in the deeper multi-scale feature extraction stage. Built upon the standard SPPF structure, PGA incorporates global average pooling and global max pooling to capture background-aware contextual cues and combines them with attention mechanisms for adaptive feature weighting. This enhances the network’s capability to jointly model global and local context, enabling robust detection across multiple target scales. Finally, complex backgrounds often introduce false positives and pseudo-target interference. To address this, we integrate a Spatial Context Awareness Module (SCAM) into the network neck. By jointly modeling channel-wise and spatial dependencies, SCAM explicitly suppresses irrelevant background responses while highlighting target saliency, thereby improving the discriminability between UAV targets and cluttered environments.

The primary innovations presented in this study can be outlined as follows:

(1) Development of an advanced infrared detection framework for small UAVs in cluttered environments. The backbone architecture integrates re-designed cascaded multi-branch dilated convolutions combined with conventional convolution layers, enabling the acquisition of more comprehensive local characteristics for improved infrared UAV identification.

(2) Introduction of an innovative PGA architecture for enhanced feature representation. The spatial pyramid pooling fusion module strategically combines global maximum and average pooling operations, integrating superior background and boundary contextual information into feature maps. A refined attention mechanism is implemented to selectively emphasize multi-scale features, optimizing both target positioning precision and classification accuracy.

(3) Validation through comprehensive experimental assessments, which demonstrate superior performance in complex infrared environments with improved detection accuracy. The proposed solution achieves significant gains over existing state-of-the-art approaches in small target recognition tasks.

2. Related Work

2.1. Segmentation-Based Infrared Small Target Detection Algorithms

Segmentation-based infrared small target detection methods treat the original image as pixels, performing binary classification for each pixel to distinguish between foreground and background, outputting a binary segmentation result, and locating the target based on a threshold. Compared with model-driven approaches, such algorithms, through pixel-level annotations, can precisely capture the morphology and boundaries of small targets at the spatial level, thereby achieving higher detection accuracy. Early segmentation networks primarily relied on the U-Net [16] framework and its variants [17,18,19]. The RISDnet framework [20] developed by Hou et al. integrates manually designed features with deep convolutional representations from CNNs to establish a comprehensive feature mapping architecture. This system learns to correlate hybrid feature representations with target probability estimates, enabling detection through threshold-based selection on derived likelihood distributions. Subsequent improvements emerged through ISTDU-Net [21], which employs feature grouping mechanisms to prioritize small-target characteristics, thereby optimizing the transformation process of infrared inputs into probabilistic outputs. Dai et al. [22] enhanced the ability to extract target information through asymmetric network structure design and attention to target contrast. Zhang et al. [23] employed shape modeling techniques alongside super-resolution approaches to better identify and localize sparsely distributed targets. Li et al. [24] developed a densely interconnected framework to maintain consistent feature responses during target detection. Wang et al. [25] created MDvsFA, a GAN-based solution that optimizes the equilibrium between undetected targets and false positives during segmentation. While these advancements addressed certain challenges, the inherent limitations of sparse input data and low-contrast characteristics of small targets continue to hinder effective feature extraction. Consequently, ISNet [26] and MSAFFNet [27] incorporated boundary refinement techniques to enhance segmentation accuracy. Parallel developments have seen Transformer architectures adapted for IRSTD applications, such as IAANet [28] which employs Transformer encoders to analyze attention patterns between target areas and background regions. Yang et al. [29] subsequently introduced the PBT methodology. The framework employs an asymmetrical encoder-decoder design incorporating a gradually adaptive background perception strategy through self-attention mechanisms. Hu et al. [30] proposed the Dynamic Self-Attention Network (DATransNet), which utilizes a Dynamic Self-Attention (DATrans) mechanism to simulate central difference convolution for gradient feature extraction, aiming to capture and preserve critical details of small targets. Additionally, a Global Feature Extraction Module (GFEM) is introduced to prevent the network from focusing solely on local details while neglecting global information. This design enhances the model’s ability to represent both local and global features, thereby improving its robustness in complex backgrounds and multi-target scenarios.

Although the detection accuracy of the aforementioned methods is significantly superior to that of model-driven approaches, modeling infrared small UAV target detection as a semantic segmentation task still presents substantial challenges. Firstly, there is a lack of datasets and difficulties in annotation. Taking mainstream segmentation datasets as an example, most existing public data samples contain a large number of commercial aircraft, ships, floating objects, and other targets, with a relatively small number of samples for low, slow, and small UAV targets. Moreover, the pixel-level semantic segmentation task significantly increases the cost of manual annotation. Additionally, pixel-level segmentation requires high-resolution feature maps, and high-precision model architectures are complex and computationally expensive [31]. Secondly, segmentation algorithms tend to have low real-time performance and high computational resource demands. While segmentation models generally achieve good accuracy, their structures are usually complex, with a large number of parameters, making them inadequate for tasks requiring real-time responses or where computational resources may be limited.

2.2. Bounding Box-Based Infrared Small Target Detection Algorithms

Early researchers attempted to fine-tune the general visual task-based object detection framework to meet the requirements of the task. McIntosh et al. adjusted the parameters of Faster R-CNN [12] and YOLOv3 [32], improving the detection performance by optimizing the input feature vectors. However, such simple model modifications often suffer from low recognition rates and high false alarm rates when dealing with small target detection. Therefore, subsequent research gradually shifted towards combining the characteristics of small targets and making more targeted improvements to the network structure. Drawing on the design concept of residual networks, Zhang et al. [33] proposed a multi-scale lateral connection structure from the perspective of deep and shallow layer feature fusion, and introduced a predefined training strategy for small target prior boxes, achieving good results in the real-time detection task of infrared unmanned aerial vehicles. Inspired by the candidate region mechanism, Ou et al. [34] proposed an infrared unmanned aerial vehicle detection method combining shape prior segmentation and multi-scale feature aggregation, but it is still prone to false detection of high-contrast areas in complex backgrounds. Misbah et al. [35], building on the improved YOLOv5 framework, optimized the feature extraction and fusion structure to achieve UAV detection in night infrared environments. Although the results were promising, the dataset used contained relatively large targets and simple backgrounds, leaving a noticeable gap from real-world scenarios. To address the problem of insufficient representation of small target features, Hao et al. [1] developed the YOLO-ISTD network, enhancing the feature extraction ability and designing a specialized small target detection head, thereby reducing missed detections and false alarms. Yue et al. [36] constructed the YOLO-MST network, substituting the SPPF module in the backbone network with a custom MSFA module, optimizing the neck structure, and introducing a multi-scale dynamic detection head in the prediction stage, improving the detection robustness in complex scenes through multi-scale feature fusion. Liu et al. [37] proposed the IRMSD-YOLO network, designing a multi-scale dilated attention module based on the inverse residual structure, enhancing the small target detection ability, and reducing the misjudgment rate in complex backgrounds with the support of efficient feature transformation and multi-scale perception. Zhu et al. [38] proposed the lightweight network YOLO-SDLUWD based on YOLOv7, reducing redundant parameters and loss of intermediate feature information to solve the bottleneck of inference speed, achieving a balance between detection accuracy and speed. Zhou et al. [39] introduced an ultra-resolution convolutional neural network in the YOLOv8 framework to improve the clarity and resolution of infrared images, thereby enhancing the accuracy of unmanned aerial vehicle detection. Nguyen et al. [40] designed a new model for small unmanned aerial vehicle detection in complex mountainous environments based on YOLOv10 and introduced a generative adversarial network (GAN) to enhance small target samples, effectively improving the detection performance. Tang et al. [41] constructed the IRSTD-YOLO network, introducing an infrared small target enhancement module on the basis of YOLOv11s, and combining a multi-branch strategy of local, global and large-scale context information, significantly improving the representation and detection ability of small targets in low-contrast environments. In summary, these studies provide diverse and practical technical support for the anti-unmanned aerial vehicle task in infrared scenarios.

In response to the challenges of infrared UAV target detection, the above-mentioned research mainly focused on the following aspects: regarding target feature enhancement, the introduction of feature enhancement modules, improved feature fusion structures, and carefully designed attention mechanisms effectively mitigate missed detections and false alarms caused by weak feature representations in the initial framework. For complex background suppression, several studies have designed attention or perception modules to suppress background noise or incorporated self-attention–based global perception structures to enhance the discrimination of small targets under low-contrast and cluttered backgrounds. With respect to target classification and prediction, research has primarily considered the sensitivity to target size and position, designing tailored loss functions accordingly. At the data and image quality level, researchers have constructed synthetic infrared UAV datasets, employed GAN-based models to enrich sample diversity, and adopted super-resolution reconstruction techniques to improve infrared image quality, thereby further enhancing detection accuracy. Nevertheless, due to the varying degrees of feature loss of small infrared targets within deep neural networks, current single-stage, bounding box-based detection algorithms still exhibit certain limitations in detecting weak and small UAV targets.

3. Proposed Methods

The architecture of our IUAV-YOLO model is illustrated in Figure 1. Backbone: Unlike the original YOLOv10, our feature extraction backbone employs three standard convolutions, one Spatial-Channel Decoupled Convolution, and two FEM modules [42]. This configuration significantly reduces model parameters while expanding the receptive field and enhancing local feature perception capabilities. Furthermore, we introduce the PGA module to capture global background and edge information, with enhanced focus on multi-scale features to strengthen global-local contextual feature extraction. Neck: Building upon the PAN (Path Aggregation Network) structure, we incorporate upsampling operations for multi-scale feature fusion (denoted by solid red lines in Figure 1), thereby improving multi-scale detection accuracy. The SCAM [42] is subsequently integrated to reinforce cross-channel and spatial information perception. Prediction Head: Input feature maps with dimensions 160 × 160 × 64, 80 × 80 × 128, and 40 × 40 × 256 are processed through YOLOv10’s detection head. Unlike previous YOLO series heads, this architecture adopts a consistent dual-assignment strategy for NMS-free training, thereby improving detection stability and eliminating the need for post-processing.

3.1. Feature Enhancement Module

Considering that distant UAV targets occupy minimal pixels in imagery exhibiting classic small-target characteristics and acknowledging the inherent limitations of YOLO-series backbones, where successive standard convolutions and pooling operations cause progressive feature degradation in deep layers, small target representations become severely diluted. To address this fundamental challenge, we introduce the lightweight Feature Enhancement Module (FEM) after backbone restructuring. As illustrated in Figure 2, the FEM employs a multi-branch architecture where property-specific convolutions enable richer local context learning, while strategically incorporated dilated convolutions expand the receptive field to strengthen small-object feature representation. The formulation of the FEM module is expressed as

\begin{matrix} B_{1} & = f_{a t c o n v}^{3 \times 3} {f_{c o n v}^{3 \times 1} {f_{c o n v}^{1 \times 3} [f_{c o n v}^{1 \times 1} (F)]}} \end{matrix}

(1)

\begin{matrix} B_{2} & = f_{a t c o n v}^{3 \times 3} {f_{c o n v}^{1 \times 3} {f_{c o n v}^{3 \times 1} [f_{c o n v}^{1 \times 1} (F)]}} \end{matrix}

(2)

\begin{matrix} B_{3} & = f_{c o n v}^{3 \times 3} [f_{c o n v}^{1 \times 1} (F)] \end{matrix}

(3)

\begin{matrix} O u t & = f_{c o n v}^{1 \times 1} (F) \oplus Cat (B_{1}, B_{2}, B_{3}) \end{matrix}

(4)

where denotes the input feature map,

f_{conv}^{1 \times 1}, f_{conv}^{3 \times 1}, f_{conv}^{1 \times 3}, and f_{conv}^{3 \times 3}

represent standard convolution operations with kernel sizes of 1 × 1, 3 × 1, 1 × 3, and 3 × 3,

f_{atconv}^{3 \times 3}

indicates a dilated convolution with dilation rate 5, ⊕ signifies the element-wise addition operation, denotes the concatenation operation of the feature map,

B_{1}

,

B_{2}

, and

B_{3}

correspond to the output feature maps of the three processing branches, and Out is the final output of the FEM module.

3.2. Pyramid Global Attention Module

To address multi-scale UAV targets in real-world scenarios, we designed the PGA module, which incorporates partial global background and edge information to acquire a global perspective while mitigating scale variations. Subsequently, an attention mechanism enhances multi-scale feature focus, strengthening global-local feature capture. As shown in Figure 3, based on the SPPF structure, we integrated global average pooling and global max pooling layers to enhance global feature extraction capability. These were then concatenated with the original pooling feature layers and fed into the Attention Context Aggregation Module. Through dynamic feature weighting, this module enhances responses in important feature regions, achieving multi-scale context information aggregation and improving detection accuracy. Within the Attention module, the feature map is split into three branches (from left to right). Branch 1 is processed through convolutional layers, while Branches 2 and 3 undergo activation functions to form three sub-branches: the attention coefficient Branch A, the spatial weight Branch B, and the feature extraction Branch K. The K and V branches are then aggregated via matrix multiplication, followed by channel adjustment through convolutional layers. This is then computed with attention branch a for dynamic feature-weighted fusion. Finally, a residual connection with the main branch preserves critical target details, yielding the final output. The formulation of the Attention module is as follows:

\begin{matrix} A & = σ ({Conv}_{a} (X)) \end{matrix}

(5)

\begin{matrix} K & Softmax = (reshape ({Conv}_{k} (X)) \end{matrix}

(6)

\begin{matrix} V & = reshape ({Conv}_{v} (X)) \end{matrix}

(7)

\begin{matrix} Y & = {Conv}_{m} (\sum_{i = 1}^{HW} V_{i} K_{i}) \cdot A \end{matrix}

(8)

\begin{matrix} \bar{X} & = X + Y \end{matrix}

(9)

where X denotes the input feature map, H × W represents the spatial dimensions (height and width) of the feature map,

C o n v_{a}

,

C o n v_{k}

, and

C o n v_{v}

,

C o n v_{m}

are 1 × 1 convolution operations,

σ

is the sigmoid function used for attention map generation, reshape denotes tensor transformation, and denotes the final output.

3.3. Spatial Context-Aware Module

Following the enhanced backbone feature extraction and PAN-based multi-scale feature fusion, the network architecture comprehensively incorporates multi-scale target contextual information while effectively representing small target characteristics; we then introduce the SCAM, which leverages global context to model cross-spatial relationships between pixels, thereby suppressing irrelevant backgrounds and enhancing detection capability for targets of interest. As illustrated in Figure 4, the SCAM comprises three principal branches. The first branch integrates global context using global max pooling and global average pooling layers. The second branch applies a 1 × 1 convolution for feature transformation, while the third branch performs a 1 × 1 convolution for key–value dimensionality reduction. Matrix multiplication of the second branch with the first and third branches yields two distinct contextual representations encoding cross-channel and spatial dependencies, which undergo convolutional refinement followed by Hadamard product fusion to generate the module output, with the spatial context representation of pixels at each layer formulated as follows:

\begin{matrix} V_{i}^{j} & = U_{i}^{j} + a_{i}^{j} \sum_{j = 1}^{N_{i}} [\frac{exp (w_{q k} U_{i}^{j})}{\sum_{n = 1}^{N_{i}} exp (w_{q k} U_{i}^{n})}] \cdot w_{v} U_{i}^{j} \end{matrix}

(10)

\begin{matrix} a_{i}^{j} & = \frac{exp ([avg (U_{i}); max (U_{i})] U_{i}^{j})}{\sum_{n = 1}^{N_{i}} exp ([avg (U_{i}); max (U_{i})] U_{i}^{n})} \cdot w_{v} \end{matrix}

(11)

where

U_{i}^{j}

and

V_{i}^{j}

denote the input and output of the pixel i in the j-level feature map,

N_{i}

represents the total number of pixels,

w_{q} k

and

w_{v}

are linear transformation matrices for projected feature maps, and avg() and max() perform GAP and GMP.

4. Results of the Experiments

This section presents experimental evaluations to verify the superior performance of the proposed IUAV-YOLO model in infrared small UAV detection tasks. A thorough description of the experimental settings is provided, covering datasets, evaluation indicators, implementation details, and baseline comparison methods. We then report quantitative results benchmarked against leading deep learning approaches, followed by ablation experiments that assess the individual contributions of each proposed module.

4.1. Data Preparation

To validate the performance of our proposed algorithm, we employ two datasets: the self-constructed IRSUAV (Infrared Small Unmanned Aerial Vehicle) dataset and the public SIRST-UAVB (Single Infrared Small Target-Unmanned Aerial Vehicle Bird) benchmark [43]. Currently, publicly available infrared datasets for low-slow-small (LSS) UAVs in complex environments are relatively scarce. To obtain an infrared LSS UAV target dataset under practical scenarios, we developed an infrared small UAV target dataset in complex scenes using uncooled infrared photoelectric equipment. The dataset collection spans an extended duration, with data acquired across different seasons and diverse weather conditions including sunny/cloudy days and multiple time periods (morning, noon, and evening), fully considering target signature variations induced by temperature changes under these conditions. Simultaneously, we simulated real-world scenarios by controlling UAV flight attitudes and distances during collection against heterogeneous backgrounds, including skies, buildings, trees, and urban infrastructure. All data were manually annotated. A subset of this data was curated to form the IRSUAV dataset comprising 6400 images. IRSUAV features rich urban, cloud, mountainous, and forest backgrounds, with partially intertwined scenarios significantly increasing UAV detection difficulty. Representative dataset scenes are shown in Figure 5, with UAV targets annotated by red bounding boxes.

Released to the public in late 2024, SIRST-UAVB stands as the most extensive and challenging single-frame infrared small target dataset available at the time. Spanning a full year of data collection, the dataset captures scenes under varying seasonal, meteorological, and environmental conditions, with a particular focus on complex urban landscapes. It contains 3000 annotated images, featuring 2955 UAV targets alongside 1742 avian distractors. The targets differ significantly in terms of orientation, size, and degree of occlusion, with a notable portion consisting of extremely small objects—many of which are barely visible to the naked eye. SIRST-UAVB provides a comprehensive and demanding benchmark for evaluating infrared detection algorithms focused on low, slow, and small (LSS) UAVs, particularly within cluttered urban environments. Sample images are illustrated in Figure 6, where red bounding boxes denote UAVs and green boxes indicate bird distractors.

As shown in Figure 7, we conducted a statistical analysis of the dimensions and spatial distribution of the drone target annotations in the two datasets. The dimensions of the target annotations in the SIRST-UAVB dataset mainly ranged from 10 to 80 pixels, while those in the IRSUAV dataset mainly ranged from 60 to 800 pixels.

4.2. Evaluation Metrics

In order to better reflect the detection performance of the designed algorithm for infrared low-speed and small unmanned aerial vehicle targets, we adopt the relevant evaluation indicators, as summarized in Table 1.

4.3. Parameter Settings

The experiment utilized a 4090 graphics card with Python version 3.10, Torch version 2.1, and CUDA version 12.1. To ensure a thorough performance evaluation, the IRSUAV and SIRST-UAVB datasets were randomly split into training, validation, and test subsets using an 8:1:1 ratio. Model training was conducted from scratch under different configurations. For the IRSUAV dataset, the training was carried out over 300 epochs using the Stochastic Gradient Descent (SGD) optimizer with a batch size of 32, a momentum of 0.937, a weight decay of 0.0005, and an initial learning rate of 0.01; other parameters are set as default. As for the RT-DETR algorithm, its hyperparameters were fine-tuned according to its specific architectural characteristics. The learning rate is 0.0001, the momentum coefficient is 0.9, the weight decay coefficient is set to 0.0001, and the default data augmentation method is mosaic. To fully exploit the learning capacity of the SIRST-UAVB dataset, parameters were selected through comparative experimentation: training spanned 700 epochs with a patience value of 70, using SGD as the optimizer along with the same batch size, momentum, weight decay, and learning rate settings as above. During training, the validation set was utilized to assess model performance at the end of each epoch, and the checkpoint with the highest validation accuracy was retained for final testing on the test set.

4.4. Comparative Experiment

4.4.1. IRSUAV Dataset

The IUAV-YOLO model’s learning progression is visualized in Figure 8, illustrating the dynamic relationship between training loss reduction and validation metric improvements. The train/box_om metric quantifies the discrepancy between ground truth bounding boxes and model predictions, while train/cls_om measures classification confidence errors across object categories. The train/dfl_om parameter reflects boundary regression accuracy, where lower DFL values indicate more precise localization predictions. Parallel branches track alternative pathway performance through train/box_oo (bounding box refinement) and train/cls_oo (category differentiation). Validation performance is comprehensively evaluated through metrics/recall (B), metrics/precision (B), metrics/mAP50 (B), and metrics/mAP50-95 (B). The training curves demonstrate a consistent downward trend in loss metrics coupled with progressive validation score enhancements, indicating effective parameter optimization and a stable convergence pattern throughout the learning phase. The index demonstrates a progressive ascent before reaching equilibrium, signifying the model’s effective generalization capabilities.

In order to verify the performance of the designed algorithm, this article conducted qualitative and quantitative comparisons with other common object detection algorithms. The results, summarized in Table 2 and Figure 9 and Figure 10, demonstrate that IUAV-YOLO achieves notable gains in detection accuracy while maintaining lightweight characteristics. Specifically, the model achieves a 4.3% improvement in AP0.5 and a 2.6% gain in AP0.5–0.95 compared with the baseline YOLOv10n, along with a reduction of 0.7 million parameters. Compared with the Transformer-based RT-DETR algorithm, IUAV-YOLO outperforms it by 1.7% in AP0.5, while requiring only approximately one-eighth of the parameter count. Additionally, experimental observations indicate that RT-DETR suffers from training instability, with accuracy fluctuations of around 1% under identical conditions. When benchmarked against YOLOv13n, the proposed model improves AP0.5 by 5.8% and AP0.5–0.95 by 3.2%, while reducing the parameter size by 0.3 M. Relative to YOLOv13s, AP0.5 increases by 3.8%, AP0.5–0.95 by 0.2%, and model size is reduced by 13.5 M. Similarly, IUAV-YOLO surpasses YOLOv12n with a 5.4% gain in AP0.5 and a 2.8% increase in AP0.5–0.95, along with a 0.4 M reduction in parameters. Against YOLOv12s, performance improves by 1.2% (AP0.5) and 0.5% (AP0.5–0.95), with 13.9 M fewer parameters. Compared with Ultralytics’ YOLOv11n, IUAV-YOLO achieves improvements of 1.8% and 2.1% in AP0.5 and AP0.5–0.95, respectively, while reducing model size by 0.4 M. For YOLOv11s, the proposed method shows gains of 0.8% in AP0.5 and 1.5% in AP0.5-0.95 and compresses the model to roughly a quarter of its original size. Furthermore, IUAV-YOLO exceeds the widely used YOLOv8n by 2% in AP0.5 and 1.6% in AP0.5–0.95 and reduces parameters by 1.2 M. Compared with the IRSTD-YOLO infrared small target detection algorithm improved based on YOLOv11s, our approach achieves a 1.6% improvement in AP0.5 while cutting down 17.2 M parameters, underscoring its effectiveness and efficiency. Finally, we conducted experiments on the two-stage object detection algorithm. The algorithm presented in this paper demonstrates significant performance advantages. Based on statistics from multiple images, the IUAV-YOLO model achieves an average inference speed of 88.7 FPS on 640 × 512 images when tested on an NVIDIA RTX 4090.

In summary, the detection algorithm we designed has certain performance advantages under the premise of low model complexity and fewer parameters. As shown in Figure 9 and Figure 10, we selected some data under the background of sky, mountains, suburbs, and cities in different periods of time to display the detection results. The green circle represents the correct detection of the UAV by the model, and the red circle represents the error of the model detection, that is, a false alarm. The advanced YOLO series object detectors all have the problems of failing to detect the object of interest to varying degrees or being unable to solve the problem that the object is similar to the background. The RT-DETR algorithm has better detection accuracy than the YOLO series algorithms, but its high training cost, instability, and model complexity still have shortcomings in cost-constrained large-scale application requirements. Our proposed algorithm still has good detection results in the above different scenarios under the premise of ensuring low model complexity. In summary, the IUAV-YOLO algorithm has certain performance superiority in the detection of small infrared UAV targets in urban environments.

4.4.2. SIRST-UAVB Dataset

To comprehensively assess the IUAV-YOLO algorithm’s performance, rigorous comparative analyses were conducted against deep learning detectors using the SIRST-UAVB dataset. The comparison outcomes are illustrated in Table 3. From the analysis, it is evident that traditional YOLO-based architectures face considerable challenges in accurately identifying small infrared targets. In comparison, the IUAV-YOLO model demonstrates a significant advantage over the baseline model; the AP0.5 metric has increased by 29.7%. The table also presents in detail the detection performance of the algorithm for flying bird interference targets.

4.5. Ablation Experiment

In order to better show the performance of each part of the IUAV-YOLO network, based on the self-built UrbanUAV dataset, ablation experiments are carried out by replacing the relevant specific part of the IUAV-YOLO with the original baseline model structure. The results are shown in Table 4, where w/h represents the replacement of the improved structure with the corresponding structure of the original baseline model.

As illustrated in Table 4, the individual contributions of the FEM, PGA, and SCAM modules all influence the overall target detection accuracy of the network. The subsequent article conducted ablation experiments to examine the effect of adding modules one by one. As shown in Table 5, this fully demonstrated the effectiveness of the modules.

Effect of the FEM Module: When the backbone architecture incorporating the FEM module is replaced with that of the original baseline model, the detection performance drops significantly—AP0.5 and AP0.5–0.95 decrease by 4.8% and 2.3%, respectively. This decline highlights the importance of FEM, which enhances semantic representation through a cascaded structure of convolutions and multi-branch pathways. It effectively captures diverse discriminative features while preserving small target details. In contrast, the baseline backbone downscales input features through multiple convolutional operations, resulting in the loss of critical small-object information and weak feature discriminability at deeper layers.

Effect of the PGA Module: Substituting the enhanced PGA module with the original SPPF structure leads to reductions of 1.9% in AP0.5 and 0.8% in AP0.5–0.95. The PGA module incorporates both global average pooling and global max pooling to integrate broader background context. Additionally, it employs a context aggregation mechanism that captures both global and local spatial dependencies, thereby improving small object detection performance. The original SPPF lacks this spatial contextual awareness, limiting its effectiveness for such targets.

Effect of the SCAM Module: Eliminating the SCAM module results in a drop of 1.9% in AP0.5 and 1.2% in AP0.5–0.95. While FEM and PGA provide enhanced local feature representation and context extraction, modeling the spatial-channel relationships between targets and their surroundings remains essential. The SCAM module addresses this by capturing long-range contextual dependencies and re-establishing pixel-wise correlations, effectively suppressing irrelevant background noise and enhancing the model’s target-background discrimination ability. Without SCAM, the model’s capacity to distinguish targets from similar backgrounds is significantly compromised.

In this study, the Gradient-weighted Class Activation Mapping (Grad-CAM) technique is employed to visualize the model’s attention distribution during target detection. This method applies a color mapping to the two-dimensional grayscale activation maps, where regions with redder tones indicate stronger attention from the model. As illustrated in Figure 11 and Figure 12, several infrared images are selected to demonstrate the contribution of the FEM and SCAM modules to the model’s attention mechanism. It can be observed that, after incorporating the FEM module, the model focuses more on the target regions, indicating that the FEM module effectively enhances target feature representation. However, this process also introduces certain background-like features. In contrast, after the SCAM module is applied, the target regions exhibit more intense red coloration, suggesting that the model becomes more concentrated on the targets while reducing attention to background areas with similar characteristics. Overall, the SCAM module suppresses background interference to a certain extent, thereby improving the model’s ability to distinguish targets from the background and enhancing detection performance.

To further investigate the specific impact of the improved components in the PGA module, we conducted ablation studies on its substructures. We define SPPFimprov as the original SPPF with only the addition of global average and max pooling branches and SPPFattention as the version that solely incorporates the attention mechanism. Results, detailed in Table 6, reveal that removing either enhancement substantially affects detection accuracy, confirming the crucial role of global contextual branches in strengthening spatial perception.

The article conducted a preliminary exploration of the fusion effect of different scale features within the network, as illustrated in Figure 1, with the corresponding results summarized in Table 7. The findings indicate that fusing deep and shallow features at different scales in IUAV-YOLO (red solid line) enables the model to better adapt to targets of varying sizes and improves its multi-scale detection capability.

5. Discussion

This paper presents an algorithm named IUAV-YOLO for object detection of infrared small unmanned aerial vehicles. Addressing the unique characteristics of infrared small targets while optimizing the utilization of convolutional features, we implement a re-designed backbone network. Our enhanced architecture employs a parameter-efficient, lightweight multi-branch configuration to optimize discriminative feature extraction and enhance the model’s capacity for small object representation. The proposed PGA mechanism improves spatial comprehension through partial background data integration and attention-driven contextual modeling. Complementing this, the SCAM component employs global pixel-level correlation analysis to minimize background interference while emphasizing critical target regions.

The findings reveal that IUAV-YOLO effectively addresses challenging scenarios involving occluded objects and blurred textures in dynamic infrared environments. When benchmarked against YOLOv10n, the proposed architecture achieves a 4.3% enhancement in AP0.5 and a 2.6% improvement in AP0.5–0.95 metrics while operating with 0.7 million fewer parameters, highlighting its computational efficiency for resource-constrained deployment. On the SIRST-UAVB dataset, our solution demonstrates a 29.7% increase in AP0.5 and a 16.3% gain in AP0.5–0.95 performance alongside a 0.6 million parameter reduction, validating the backbone network’s efficacy for small UAV detection at low altitudes. Comprehensive ablation studies systematically evaluate each component’s contribution and reveal multi-scale feature integration advantages, providing valuable insights for developing next-generation UAV countermeasure systems. However, as shown in Figure 13, the red box in the image represents the unmanned aerial vehicle target, our algorithm still exhibits certain detection failures when the contrast between the target and the background is very low. In future work, we plan to further optimize the model or adopt a multi-frame detection strategy to enhance the algorithm’s capability for target detection in complex background scenarios.

Author Contributions

All authors contributed to the conception of this article. L.Z. and Y.Z. developed the initial ideas and framework; Y.L. and H.Z. performed the literature search and analysis; L.Z. drafted the manuscript; L.Z., Y.L. and H.Z. contributed to the visualization of the content; and Y.Z. reviewed and revised the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The source code will be available at https://github.com/deeplearningzl/IUAV-YOLO (accessed on 9 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hao, Z.; Wang, Z.; Xu, X.; Jiang, Z.; Sun, Z. YOLO-ISTD: An infrared small target detection method based on YOLOv5-S. PLoS ONE 2024, 19, e0303451. [Google Scholar] [CrossRef]
Xu, Y.; Shao, A.; Kong, X.; Wu, J.; Chen, Q.; Gu, G.; Wan, M. Infrared Small Target Detection Based on Sub-Maximum Filtering and Local Intensity Weighted Gradient Measure. IEEE Sens. J. 2024, 24, 22236–22248. [Google Scholar] [CrossRef]
Kapur, J.N.; Sahoo, P.K.; Wong, A.K. A new method for gray-level picture thresholding using the entropy of the histogram. Comput. Vis. Graph. Image Process. 1985, 29, 273–285. [Google Scholar] [CrossRef]
Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets 1999, Denver, CO, USA, 19–23 July 1999; Volume 3809, pp. 74–83. [Google Scholar]
Tomasi, C.; Manduchi, R. Bilateral filtering for gray and color images. In Proceedings of the Sixth International Conference on Computer Vision, Bombay, India, 4–7 January 1998; pp. 839–846. [Google Scholar] [CrossRef]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; pp. 949–958. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for Infrared Small Object Detection. IEEE Trans. Image Process. 2023, 32, 364–376. [Google Scholar] [CrossRef]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Hou, Q.; Wang, Z.; Tan, F.; Zhao, Y.; Zheng, H.; Zhang, W. RISTDnet: Robust Infrared Small Target Detection Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared Small-Target Detection U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Zhang, J.; Guo, J.; Li, Y.; Gao, X. Dim2Clear Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
Wang, H.; Zhou, L.; Wang, L. Miss Detection vs. False Alarm: Adversarial Learning for Small Object Segmentation in Infrared Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8508–8517. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape Matters for Infrared Small Target Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 867–876. [Google Scholar] [CrossRef]
Tong, X.; Su, S.; Wu, P.; Guo, R.; Wei, J.; Zuo, Z.; Sun, B. MSAFFNet: A Multiscale Label-Supervised Attention Feature Fusion Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior Attention-Aware Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Yang, H.; Mu, T.; Dong, Z.; Zhang, Z.; Wang, B.; Ke, W.; Yang, Q.; He, Z. PBT: Progressive Background-Aware Transformer for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Hu, C.; Huang, Y.; Li, K.; Zhang, L.; Long, C.; Zhu, Y.; Pu, T.; Peng, Z. DATransNet: Dynamic Attention Transformer Network for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Tong, X.; Zuo, Z.; Su, S.; Wei, J.; Sun, X.; Wu, P.; Zhao, Z. ST-Trans: Spatial-Temporal Transformer for Infrared Small Target Detection in Sequential Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–19. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, Y.; Shi, Z.; Zhang, J.; Wei, M. Design and Training of Deep CNN-Based Fast Detector in Infrared SUAV Surveillance System. IEEE Access 2019, 7, 137365–137377. [Google Scholar] [CrossRef]
Ren, K.; Chen, Z.; Gu, G.; Chen, Q. Research on infrared small target segmentation algorithm based on improved mask R-CNN. Optik 2023, 272, 170334. [Google Scholar] [CrossRef]
Zhao, J. Wireless and Satellite Systems. In In Proceedings of the 13th EAI International Conference, WiSATS 2022, Singapore, 12–13 March 2023; Springer Nature: Berlin/Heidelberg, Germany, 2023; Volume 509. [Google Scholar]
Yue, T.; Lu, X.; Cai, J.; Chen, Y.; Chu, S. YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO. Opt. Laser Technol. 2025, 187, 112835. [Google Scholar] [CrossRef]
Liu, B.; Jiang, Q.; Wang, P.; Yao, S.; Zhou, W.; Jin, X. IRMSD-YOLO: Multiscale Dilated Network With Inverted Residuals for Infrared Small Target Detection. IEEE Sens. J. 2025, 25, 16006–16019. [Google Scholar] [CrossRef]
Zhu, J.; Qin, C.; Choi, D. YOLO-SDLUWD: YOLOv7-based small target detection network for infrared images in complex backgrounds. Digit. Commun. Netw. 2025, 11, 269–279. [Google Scholar] [CrossRef]
Zhou, G.; Liu, X.; Bi, H. Recognition of UAVs in Infrared Images Based on YOLOv8. IEEE Access 2025, 13, 1534–1545. [Google Scholar] [CrossRef]
Phat, N.T.; Giang, N.L.; Duy, B.D. GAN-UAV-YOLOv10s: Improved YOLOv10s network for detecting small UAV targets in mountainous conditions based on infrared image data. Neural Comput. Appl. 2025, 37, 17217–17229. [Google Scholar] [CrossRef]
Tang, Y.; Xu, T.; Qin, H.; Li, J. IRSTD-YOLO: An Improved YOLO Framework for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 Febraury–4 March 2025; Volume 39, pp. 9202–9210. [Google Scholar]

Figure 1. IUAV-YOLO algorithm framework.

Figure 2. Feature enhancement module.

Figure 3. Pyramid global attention module.

Figure 4. Spatial context-aware module.

Figure 5. IRSUAV dataset: representative scene samples.

Figure 6. SIRST-UAVB dataset: representative scene samples.

Figure 7. The target size and spatial distribution of the dataset.

Figure 8. Model training results (the train represents the training set and val represents the verification set).

Figure 9. Examples of results from different algorithms (green circles are true targets, red circles are false alarms).

Figure 10. Examples of results from different algorithms (green circles are true targets, red circles are false alarms).

Figure 11. Through the FEM module, the distribution of thermal maps before and after is shown.

Figure 12. Through the FEM module, the distribution of thermal maps before and after.

Figure 13. Examples of failed detection by the IUAV-YOLO algorithm.

Table 1. Evaluation metrics used for infrared UAV detection.

Metric	Description
Precision (P)	Ratio of correct positive predictions to all detections.
Recall (R)	Ratio of correctly detected positives to all ground-truth positives.
mAP50	Mean Average Precision at IoU = 0.5.
mAP50–95	Mean Average Precision averaged over IoU thresholds from 0.5 to 0.95.
Parameters (Param)	Number of model parameters.
Frames Per Second (FPS)	FPS refers to the number of images that can be processed per second.
Floating-point Operations per Second in billions (GFLOPs)	The floating-point operation volume (GFLOPs) reflects the computational complexity of the model during the operation process.

Table 2. Comparison of the model with different advanced algorithms.

Algorithm	AP_0.5	AP_0.5:0.95	Precision	Recall	Params	FPS	GFLOPs
RT-DETR (res18)	0.836	0.420	0.910	0.808	40.5 M	-	57.2
Cascade R-CNN (res50)	0.804	-	-	-	69.1 M	-	105.5
Faster R-CNN (res50)	0.776	-	-	-	41.3 M	-	75.5
YOLOv13s	0.815	0.380	0.893	0.769	18.6 M	-	20.7
YOLOv13n	0.795	0.368	0.883	0.721	5.4 M	-	6.2
YOLOv12s	0.841	0.395	0.904	0.777	19.0 M	-	21.2
YOLOv12n	0.799	0.372	0.886	0.720	5.5 M	-	6.5
YOLOv11s	0.845	0.384	0.916	0.794	19.2 M	-	21.3
YOLOv11n	0.835	0.378	0.911	0.777	5.5 M	-	6.3
YOLOv10s	0.840	0.393	0.881	0.749	16.5 M	-	24.4
YOLOv10n	0.810	0.374	0.877	0.738	5.8 M	-	8.2
YOLOv9t	0.824	0.384	0.917	0.744	4.2 M	-	6.4
YOLOv8s	0.839	0.393	0.881	0.778	20.0 M	-	23.6
YOLOv8n	0.833	0.383	0.902	0.765	6.3 M	-	8.1
IRSTD-YOLO	0.837	0.401	0.897	0.777	22.3 M	-	43.2
IUAV-YOLO (Ours)	0.853	0.400	0.909	0.794	5.1 M	88.7	24.2

Table 3. Comparison of the model with different advanced algorithms.

Algorithm	Category	AP_0.5	AP_0.5:0.95	Precision	Recall	Params (M)	GFLOPs
YOLOv13s	All	0.617	0.326	0.896	0.486	18.7	20.7
	UAV	0.719	0.389	0.910	0.581	–	–
	Bird	0.514	0.263	0.883	0.392	–	–
YOLOv13n	All	0.604	0.311	0.846	0.480	5.4	6.2
	UAV	0.711	0.374	0.888	0.587	–	–
	Bird	0.496	0.248	0.803	0.373	–	–
YOLOv12s	All	0.640	0.346	0.861	0.512	19.0	21.2
	UAV	0.751	0.413	0.904	0.613	–	–
	Bird	0.529	0.278	0.818	0.402	–	–
YOLOv12n	All	0.564	0.287	0.836	0.472	5.6	6.5
	UAV	0.705	0.362	0.883	0.585	–	–
	Bird	0.423	0.211	0.789	0.359	–	–
YOLOv11s	All	0.662	0.363	0.898	0.528	19.2	21.3
	UAV	0.764	0.430	0.938	0.611	–	–
	Bird	0.561	0.297	0.859	0.444	–	–
YOLOv11n	All	0.601	0.315	0.841	0.494	5.5	6.3
	UAV	0.700	0.373	0.866	0.582	–	–
	Bird	0.502	0.257	0.815	0.405	–	–
YOLOv10s	All	0.637	0.335	0.863	0.522	16.6	24.4
	UAV	0.723	0.388	0.895	0.599	–	–
	Bird	0.551	0.283	0.831	0.444	–	–
YOLOv10n	All	0.618	0.333	0.849	0.502	5.8	8.2
	UAV	0.725	0.393	0.896	0.586	–	–
	Bird	0.511	0.272	0.802	0.418	–	–
YOLOv8n	All	0.618	0.325	0.867	0.490	5.7	8.1
	UAV	0.726	0.389	0.902	0.576	–	–
	Bird	0.510	0.261	0.832	0.405	–	–
IUAV-YOLO (Ours)	All	0.915	0.496	0.914	0.868	5.2	24.2
	UAV	0.956	0.555	0.957	0.912	–	–
	Bird	0.873	0.437	0.872	0.824	–	–

Table 4. Different module structure influences.

Algorithm	AP_0.5	AP_0.5:0.95	Precision	Recall	Params
IUAV-YOLO	0.853	0.399	0.912	0.783	5.1 M
IUAV-YOLO (w/h FEM)	0.805	0.376	0.887	0.730	8.7 M
IUAV-YOLO (w/h PGA)	0.834	0.391	0.871	0.770	4.5 M
IUAV-YOLO (w/h SCAM)	0.834	0.387	0.892	0.773	4.7 M
YOLOv10n (Baseline)	0.81	0.374	0.877	0.738	5.8 M

Table 5. Add experiments module by module.

Algorithm	AP_0.5	AP_0.5:0.95	Precision	Recall	Params
YOLOv10n (Baseline)	0.81	0.374	0.877	0.738	5.8 M
YOLOv10n + FEM	0.816	0.386	0.845	0.75	4.1 M
YOLOv10n + FEM + PGA	0.834	0.387	0.892	0.773	4.7 M
YOLOv10n + FEM + PGA + SCAM (IUAV-YOLO)	0.853	0.399	0.912	0.783	5.1 M

Table 6. Effects of each part of PGA.

Algorithm	AP_0.5	AP_0.5:0.95	Precision	Recall	Params
PGA	0.853	0.399	0.912	0.783	5.1 M
SPPFimprov	0.837	0.392	0.887	0.788	4.5 M
SPPFattention	0.838	0.394	0.892	0.776	4.7 M

Table 7. Influence of fusion at different scales.

Algorithm	AP_0.5	AP_0.5:0.95	Precision	Recall	Params
Solid red line	0.853	0.399	0.912	0.783	5.1 M
Black dashed line	0.849	0.424	0.779	0.849	5.1 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Zhang, Y.; Li, Y.; Zhong, H. Global and Local Context-Aware Detection for Infrared Small UAV Targets. Drones 2025, 9, 804. https://doi.org/10.3390/drones9110804

AMA Style

Zhao L, Zhang Y, Li Y, Zhong H. Global and Local Context-Aware Detection for Infrared Small UAV Targets. Drones. 2025; 9(11):804. https://doi.org/10.3390/drones9110804

Chicago/Turabian Style

Zhao, Liang, Yan Zhang, Yongchang Li, and Han Zhong. 2025. "Global and Local Context-Aware Detection for Infrared Small UAV Targets" Drones 9, no. 11: 804. https://doi.org/10.3390/drones9110804

APA Style

Zhao, L., Zhang, Y., Li, Y., & Zhong, H. (2025). Global and Local Context-Aware Detection for Infrared Small UAV Targets. Drones, 9(11), 804. https://doi.org/10.3390/drones9110804

Article Menu

Global and Local Context-Aware Detection for Infrared Small UAV Targets

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Segmentation-Based Infrared Small Target Detection Algorithms

2.2. Bounding Box-Based Infrared Small Target Detection Algorithms

3. Proposed Methods

3.1. Feature Enhancement Module

3.2. Pyramid Global Attention Module

3.3. Spatial Context-Aware Module

4. Results of the Experiments

4.1. Data Preparation

4.2. Evaluation Metrics

4.3. Parameter Settings

4.4. Comparative Experiment

4.4.1. IRSUAV Dataset

4.4.2. SIRST-UAVB Dataset

4.5. Ablation Experiment

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI