YOLO-UIR: A Lightweight and Accurate Infrared Object Detection Network Using UAV Platforms

Wang, Chao; Wang, Rongdi; Wu, Ziwei; Bian, Zetao; Huang, Tao

doi:10.3390/drones9070479

Open AccessArticle

YOLO-UIR: A Lightweight and Accurate Infrared Object Detection Network Using UAV Platforms

by

Chao Wang

¹

,

Rongdi Wang

^2,*

,

Ziwei Wu

¹

,

Zetao Bian

¹

and

Tao Huang

¹

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

²

Collaborative Innovation Center of Ultra-Precision Manufacturing, Hunan Institute of Advanced Technology, Changsha 410072, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(7), 479; https://doi.org/10.3390/drones9070479

Submission received: 19 May 2025 / Revised: 2 July 2025 / Accepted: 4 July 2025 / Published: 7 July 2025

(This article belongs to the Special Issue Intelligent Image Processing and Sensing for Drones, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Within the field of remote sensing, Unmanned Aerial Vehicle (UAV) infrared object detection plays a pivotal role, especially in complex environments. However, existing methods face challenges such as insufficient accuracy or low computational efficiency, particularly in the detection of small objects. This paper proposes a lightweight and accurate UAV infrared object detection model, YOLO-UIR, for small object detection from a UAV perspective. The model is based on the YOLO architecture and mainly includes the Efficient C2f module, lightweight spatial perception (LSP) module, and bidirectional feature interaction fusion (BFIF) module. The Efficient C2f module significantly enhances feature extraction capabilities by combining local and global features through an Adaptive Dual-Stream Attention Mechanism. Compared with the existing C2f module, the introduction of Partial Convolution reduces the model’s parameter count while maintaining high detection accuracy. The BFIF module further enhances feature fusion effects through cross-level semantic interaction, thereby improving the model’s ability to fuse contextual features. Moreover, the LSP module efficiently combines features from different distances using Large Receptive Field Convolution Layers, significantly enhancing the model’s long-range information capture capability. Additionally, the use of Reparameterized Convolution and Depthwise Separable Convolution ensures the model’s lightweight nature, making it highly suitable for real-time applications. On the DroneVehicle and HIT-UAV datasets, YOLO-UIR achieves superior detection performance compared to existing methods, with an mAP of 71.1% and 90.7%, respectively. The model also demonstrates significant advantages in terms of computational efficiency and parameter count. Ablation experiments verify the effectiveness of each optimization module.

Keywords:

UAVs; infrared; object detection; lightweight; bidirectional feature fusion

1. Introduction

In recent years, Unmanned Aerial Vehicles (UAVs) have been widely applied in the field of remote sensing due to their broad field of view and high mobility. These applications cover a wide range of critical domains, including precision agriculture, environmental monitoring, resource exploration, and security surveillance [1,2,3,4,5]. Infrared imaging technology, as an essential complementary technology to remote sensing, provides effective imaging support under nighttime and low-light conditions. Although its performance may be limited under certain complex meteorological conditions, such as heavy rain or thick smoke, it remains a valuable tool for UAV remote sensing missions, especially when visual spectrum imaging is insufficient. As a result, many UAVs are now equipped with infrared electro-optical payloads to enhance their capabilities in tasks such as reconnaissance and object detection. These systems have played a significant role in Geographic Information Systems (GISs) and Earth observation. Object detection is one of the core tasks in UAV remote sensing, and its performance directly affects the execution of downstream tasks, such as object tracking, trajectory prediction, and scene understanding [6,7,8].

However, object detection in UAV infrared imagery faces numerous challenges. Due to the high altitude of UAV operations, objects often appear extremely small in the images, and the absence of color information in infrared imagery complicates object identification, as shown in Figure 1. In addition, these challenges are compounded by the hardware limitations of UAVs. UAVs have limited communication capabilities, and transmitting data to the cloud for processing results in significant latency, which is not conducive to real-time decision making. Therefore, edge computing capabilities on UAVs have become increasingly important, as they enable rapid object detection and real-time data processing, thereby enhancing the efficiency of remote sensing missions.

Driven by the advancements in artificial intelligence, deep learning-based methods have become the primary means for UAV infrared object detection, with many scholars having made improvements and optimizations based on general object detection [9,10,11,12,13]. Deep learning methods for object detection are generally categorized into one-stage and two-stage approaches. One-stage methods predict coordinates and classes simultaneously, while two-stage methods typically first generate region proposals, and then classify these proposals and perform bounding box regression to achieve precise localization and identification of objects. To further enhance detection accuracy, the introduction of attention mechanisms and feature fusion methods have significantly improved the detection performance of deep learning models. Multi-scale feature extractors have also provided better feature representation capabilities. However, current detection models generally face two common issues: (1) Most improvements are based on existing general object detection methods in the visible light domain and are not specifically designed for infrared image object detection. The significant differences between infrared and visible light images imply that these methods do not fully leverage the unique characteristics of infrared imagery. (2) Although some methods, such as long-range feature extraction and self-attention mechanisms, can significantly improve detection accuracy, they also bring substantial computational and parameter loads, thus hindering their effective deployment on edge devices.

To address these challenges, this paper proposes a lightweight and efficient UAV infrared image object detection network named YOLO-UIR (UAV InfRared). The network is based on YOLO and has been improved in the backbone feature extraction layer and feature fusion layer. To tackle the difficulty of detecting small objects in infrared imagery, we introduce a lightweight spatial perception module (LSP) that captures spatial features using multiple large-sized convolutional kernels, thereby strengthening feature representation and establishing reliable long-range feature representations. In terms of effective feature extraction, we improve the existing C2f [14] module by designing an adaptive dual-stream attention module, which aims to increase the weight of important channel information and thus enhance the overall effectiveness of feature information. Finally, we propose a context interaction fusion module that improves the efficiency of feature fusion while maintaining lightweight characteristics by employing a mutual query strategy among contexts. The main contributions of this paper are as follows:

(1) We propose a lightweight and accurate UAV infrared object detection network based on the YOLO framework, with optimized feature extraction and fusion mechanisms. The superior performance of this network is demonstrated through experiments on two different datasets.

(2) We design an Efficient C2f module as the basic unit for feature extraction. This module constructs a global–local self-attention mechanism and introduces a learnable weight parameter to automatically adjust the fusion weights, thereby significantly enhancing the model’s feature extraction capability.

(3) We introduce a lightweight spatial perception module (LSP) that extracts long-range information using large-kernel reparameterized depthwise separable convolution layers. This module endows the model with a larger receptive field and enables the effective integration of local and long-range information.

(4) We design a bidirectional feature interaction fusion module (BFIF) for fusing semantic information across different levels. The bidirectional interaction mechanism of this module significantly enhances the feature fusion performance.

2. Related Works

2.1. Generic Object Detection

Object detection is a critical research direction in the field of computer vision, with the core aim of accurately identifying and localizing one or multiple objects within images or videos. Traditional object detection methods predominantly rely on manually designed feature extraction mechanisms and classification through human-defined rules, such as the Histogram of Oriented Gradients (HOG) [15] and Deformable Parts Model (DPM) [16]. However, owing to their limited generalization capabilities, these traditional approaches struggle to adapt to complex and variable image scenarios and have gradually been replaced by deep learning-based methods. Deep learning-based object detection methods can be primarily categorized into two types: one-stage and two-stage methods [17]. One-stage methods typically predict the location and class of objects directly from images, without the need for an explicit region proposal step, thereby offering significant advantages in computational efficiency. The YOLO (You Only Look Once) series of algorithms [18,19,20] is a quintessential representative of one-stage methods, renowned for its efficiency, lightweight nature, and broad applicability, and has been successfully applied across multiple domains. The YOLO algorithm has evolved through continuous updates and iterations by numerous scholars. Throughout the evolution of the YOLO series, a multitude of improvement methods have been successively proposed and applied in the field of object detection, such as the feature pyramid method [21] and the anchor-free mechanism [22] introduced after YOLOv3 [20], all of which have significantly enhanced the detection performance and generalization capabilities of the models. On the other hand, two-stage methods generally consist of a Region Proposal Network (RPN) [23] and a classifier, initially predicting potential locations of objects and then classifying these regions separately, thus offering higher accuracy in detection. R-CNN (Region-based Convolutional Neural Network) [23] is a representative two-stage algorithm, and based on R-CNN, researchers have further developed improved algorithms such as Fast R-CNN [24], Faster R-CNN [25], and Mask R-CNN [26]. During the iterative process of this series of algorithms, numerous innovative improvement methods have been proposed and applied in the field of object detection, effectively enhancing the detection performance and generalization capabilities of the models.

2.2. UAV Infrared Object Detection

Owing to the prevalence of small objects in UAV-captured images, many general object detection algorithms perform poorly. Consequently, numerous detection techniques specifically tailored for UAV object detection have been proposed. Wang et al. [27] proposed UAV-YOLOv8, which introduced a multi-branch detection head to improve the detection accuracy for small objects. Zhao et al. [28] proposed G-YOLO for UAV infrared object detection, which improved the lightweight backbone feature extraction network based on YOLOv8 and applied depthwise separable convolution [29] to reduce the model’s parameter count. Ding et al. [30] proposed a detection model capable of real-time operation, which optimized the model’s computational speed by utilizing partial convolution [31] and group convolution [32], thereby reducing the model’s computational load. However, these optimizations led to a decline in detection accuracy. Nevertheless, attention mechanisms can effectively enhance the model’s representation ability and improve the detection accuracy for small objects. Zhang et al. [33] proposed CE-RetinaNet, which designed a novel channel attention mechanism, while Wang et al. [34] introduced the ECA module [35] in YOLOFIV. He et al. [36] proposed ALSS-Yolo, which designed a lightweight channel attention module to enhance feature representation. Other related works include [37,38]. However, most current work focuses on improvements for visible light object detection, and these improvements are not specifically tailored for infrared object detection. This results in many of these methods failing to provide significant gains in infrared imaging object detection, with limitations still existing in detection accuracy and computational speed. In this paper, we analyze the characteristics of UAV infrared object detection and design a detection model specifically for this domain.

2.3. Model Lightweighting Methods

Despite their powerful generalization and fitting capabilities, neural networks often come at the cost of a substantial number of parameters and high computational requirements. To enable efficient edge computing, numerous methods for model lightweighting have been proposed. Depthwise separable convolution [29] is an efficient convolutional technique. Unlike traditional convolutional neural networks, depthwise separable convolution performs operations on each channel of the feature map independently, significantly reducing the computational load. However, due to the lack of interaction between channels, it is often combined with traditional convolution to achieve optimal performance. In addition, several novel lightweighting techniques have been introduced, such as reparameterized convolution [39], channel shuffle and rearrangement [40,41], and partial convolution [31]. In this paper, inspired by prior work, we integrate reparameterization and partial convolution techniques into our proposed improvements, effectively reducing the model’s parameter count and computational load.

3. Proposed Method

3.1. Overview

The lightweight UAV infrared object detection model framework proposed in this paper is illustrated in Figure 2. The model consists of three main components: the Backbone, Neck, and Head. To meet the demands of different computational platforms and processing speeds, we designed three model variants: YOLO-UIR-n, YOLO-UIR-s, and YOLO-UIR-m. These variants differ primarily in the number of channels and the number of blocks within the Efficient C2f module. The number of channels in each layer of YOLO-UIR-s is twice that of YOLO-UIR-n, while YOLO-UIR-m has three times the number of channels compared to YOLO-UIR-n. Additionally, the number of blocks in the Efficient C2f module of YOLO-UIR-m is twice that of YOLO-UIR-n. The YOLO-UIR-n model is illustrated in Figure 2. Taking a UAV infrared image with a resolution of 640 × 512 as an example, the processing pipeline is described as follows: (1) Backbone Processing: The image is fed into the Backbone structure, where it undergoes downsampling via multiple Convolution-BatchNorm-SiLU (CBS) modules, feature extraction via the Efficient C2f module, and long-range spatial information capture via the LSP module. This results in three levels of feature maps with dimensions of

48 \times 80 \times 64

,

96 \times 40 \times 32

, and

192 \times 20 \times 16

, respectively. (2) Neck Processing: The multi-level features are then input into the Neck structure, where upsampling and downsampling operations are performed. Additionally, feature fusion is conducted using the BFIF module. The processed features are then transmitted to the Head structure. (3) Head Processing: The Head structure processes the three different levels of features to predict object locations and classes. During model testing, a non-maximum suppression (NMS) post-processing operation is also performed.

In the following sections, we will provide detailed descriptions of the components and mechanisms of the model: Section 3.2 will present the Efficient C2f module, Section 3.3 will present the LSP module, Section 3.4 will present the BFIF module, and Section 3.5 will present the loss function.

3.2. Efficient C2f

We optimize the existing C2f structure and introduce the Efficient C2f module as the primary feature extraction structure. The Efficient C2f module consists of two main components: stacked convolutional blocks, including a MainConv and multiple PCBlocks, and an adaptive dual-stream attention module, as shown in Figure 3. We replace the original Bottleneck structure with the PCBlock, which has fewer computations and parameters, thereby making it more suitable for deployment in lightweight drone structures. The computational process is described as follows: Given the input feature map

x \in R^{H \times W \times C}

, a channel-wise split operation is initially performed to obtain x and

x^{'}

. Subsequently, a conventional convolution operation is applied to

x^{'}

. The resulting

x^{'}

is then concatenated with the original x to produce the final output

x_{o u t}

. This convolutional structure significantly reduces the number of computations and parameters, eliminates redundant information, and allows the channels that did not undergo convolution to interact with the feature maps that did through the subsequent Batch Normalization (BN) operation.

x, x^{'} = s p l i t (x, r a t i o)

(1)

x^{'} = c o n v 2 d \otimes x^{'}

(2)

x_{o u t} = S i L U (B N (c o n c a t (x, x^{'})))

(3)

where the split operation separates the feature map along the channel dimension, with the ratio being the separation proportion, typically set to 0.5. BN denotes the Batch Normalization computation, and concat represents the concatenation of feature maps along the channel dimension.

To enhance the model’s feature representation capability and adaptively increase the weights of important channels, we design an adaptive dual-stream attention (ADA) module. Unlike conventional channel attention, we observe that within the C2f structure, multiple convolution operations extract features from the input data, retaining semantic features from different depth levels. Considering the differences within the feature maps themselves and between different levels, we design a channel attention mechanism to capture local and global features, as shown in Figure 3. After the original input x is processed by multiple serial convolutional modules, the feature map is operated by a GlobalAveragePooling and then passed through 1D convolution and fully connected computations to obtain local and global feature attention weights. By introducing a learnable parameter

α

, we adaptively adjust the weights of local and global attention and map them back to the original input. In the local feature attention, we set the kernel size of the 1D convolution to 9 to ensure sufficient interaction within each feature level while maintaining the lightweight nature of the attention module.

f = G l o b a l A v e r a g e P o o l i n g (x)

(4)

a t t e n t i o n = σ (c o n v 1 d (f)) * (1 - α) + σ (R e L U (L i n e a r (f))) * α

(5)

x_{o u t} = x * a t t e n t i o n

(6)

3.3. Lightweight Spatial Perception

ASPP [42] provides a large receptive field, enabling the capture and fusion of long-range contextual information and thereby preventing the model from being trapped in local perception. However, this process involves large-dilation convolution operations, which incur significant computational costs and produce fragmented information, highly unfavorable for lightweight model design and deployment. Therefore, we design a lightweight spatial perception module named the lightweight spatial perception (LSP) module, as shown in Figure 4. The computational process of the LSP module is described as follows: The input feature map is first split into four parts along the channel dimension using a channel-wise split operation. Each part undergoes a reparameterized large-kernel 2D depthwise separable convolution [29] to capture long-range information. The results are then concatenated along the channel dimension and transformed via a

1 \times 1

convolutional layer to integrate the features. The final output is obtained by adding the transformed result to the original input through a skip connection, thereby preserving the input information.

x_{1}, x_{2}, x_{3}, x_{4} = s p l i t (x)

(7)

x_{o u t} = C o n v_{1 \times 1} (G N (c o n c a t_{i = 1}^{4} (R e p D W C o n v_{i} \otimes x_{i})))

(8)

During inference, the reparameterized convolution [39] layers can be merged. In LSP, we configure four reparameterized convolution layers with different kernel sizes. Each reparameterized convolution consists of a large-kernel convolution layer and a

3 \times 3

convolution layer. These layers are separated during training but are merged for inference deployment. Therefore, the reparameterized large convolution kernels do not significantly increase the parameter count and computational cost. The principle of reparameterized convolution is as follows:

x_{o u t} = C o n v_{l} \otimes x + C o n v_{s} \otimes x

(9)

x_{o u t} = C o n v_{r e p} \otimes x

(10)

where

C o n v_{l}

represents the convolution operation with a large kernel, and

C o n v_{s}

represents the

3 \times 3

convolution layer.

In the LSP module, we leverage different scales of receptive fields to hierarchically extract and fuse information from varying distances. Specifically, the large-kernel sizes are set to 7, 13, 17, and 21. The reparameterized convolution layers, which include

3 \times 3

convolution layers, provide basic local feature representations for the large-kernel convolutions during training. This design choice thereby prevents overfitting of the large convolution layers.

3.4. Bidirectional Feature Interaction Fusion

To more effectively integrate features from different hierarchical levels, we propose a bidirectional feature interaction fusion method (BFIF), as shown in Figure 5. The computational process of BFIF is described as follows: (1) The feature map

x_{l}

from a higher semantic level is upsampled and processed through a

C o n v 1 \times 1

convolution to obtain a feature map of the same size as that from a lower semantic level. (2) Taking the lower semantic level feature map as an example, the higher semantic feature map

x_{h}

undergoes a

C o n v 1 \times 1

convolution to achieve a linear transformation, resulting in a feature map of size

1 \times H \times W

. This feature map contains spatial positional information from the higher semantic feature map, which we refer to as Context. (3) The lower semantic level feature map

x_{l}

is reshaped into a size of

C \times H W

to facilitate interaction with the

c o n t e x t

. (4) The

c o n t e x t

is then matrix-multiplied with the reshaped lower semantic level feature map to obtain channel weight

a t t e n t i o n

guided. Then,

a t t e n t i o n

is element-wise multiplied with the original lower semantic feature map to produce the result x. (5) x is subsequently processed through a convolutional layer, followed by layer normalization and a ReLU activation function to produce the final output.

x_{h} = C o n v_{1 \times 1} (U p S a m p l e (x_{h}))

(11)

c o n t e x t = s o f t m a x (R e s h a p e (x_{l}))

(12)

a t t e n t i o n = m a t m u l (R e s h a p e (x_{h}), c o n t e x t)

(13)

x_{l}^{'} = C o n v_{1 \times 1} (f (x_{l} * a t t e n t i o n))

(14)

where

U p S a m p l e

denotes upsampling via linear interpolation,

m a t m u l

denotes the matrix multiplication operation, and f denotes the nonlinear transformation.

This process leverages feature interaction across different hierarchical levels at each spatial position, computes weight coefficients, and enhances the original input. Compared to the conventional approach of concatenating features at the channel level and directly processing them through a convolutional layer, this method improves the flow of information between features and enhances the effectiveness of feature representation.

3.5. Loss Function

The total loss function

L_{t o t a l}

of our proposed YOLO-UIR model consists of three parts, consistent with YOLOv8. Since the model is an anchor-free structure, we use DFL (Distribution Focal Loss) [14] and CIOU [43] (Complete-IoU) as the regression losses

L_{b o x}

and

L_{d f l}

for optimizing the coordinates of the predicted bounding boxes. We use CE Loss (Cross-Entropy Loss) as the classification loss

L_{c l s}

. The formulas are as follows: The formulas for

L_{d f l}

and

L_{b o x}

are relatively complex and can be found in their respective references. In these equations,

λ_{1}

,

λ_{2}

, and

λ_{3}

are weight balancing parameters.

L_{t o t a l} = L_{c l s} * λ_{1} + L_{b o x} * λ_{2} + L_{d f l} * λ_{3}

(15)

L_{c l s} = \frac{1}{N} \sum_{i = 1}^{N} y_{i c} \cdot l o g (p_{i c})

(16)

4. Experiments and Analysis

4.1. Experiment Datasets

(1) HIT-UAV

The HIT-UAV [44] dataset is a specialized infrared dataset designed for object detection in Unmanned Aerial Vehicle (UAV) applications, created by a research team from Harbin Institute of Technology. This dataset comprises 2898 infrared images extracted from 43,470 frames captured by UAVs across various real-world scenarios, including schools, parking lots, roads, and playgrounds, under both daytime and nighttime lighting conditions, as shown in Figure 6. This diverse range of scenarios ensures the dataset’s broad applicability across different environments. The dataset includes five object classes: Person, Bicycle, Car, OtherVehicle, and DontCare. The classes of OtherVehicle and DontCare represent a very small proportion of the total dataset. Therefore, to enhance the dataset’s cleanliness and utility, OtherVehicle instances were merged into the Car classes, and DontCare samples were excluded from the dataset. The images in the dataset have a resolution of 512 × 640. Prior to experimentation, the dataset was divided into training, validation, and test sets in a 7:1:2 ratio, resulting in 2029 training images, 290 validation images, and 579 test images. This division facilitates the training, tuning, and evaluation of object detection models, ensuring a balanced and representative distribution of data for robust algorithm development.

(2) DroneVehicle

The DroneVehicle dataset [45] is a large-scale aerial vehicle detection dataset collected and annotated by a research team from Tianjin University. This dataset comprises 28,439 pairs of RGB and infrared images, captured by drones across various real-world scenarios, including daytime and nighttime conditions, as show in Figure 7. In this study, we exclusively utilized the infrared images for our experiments. The dataset encompasses five object classes: Car, Truck, Bus, Van, and Freight Car. The images in the dataset have a resolution of 840 × 712. To facilitate more effective model training, we uniformly resized the images to 640 × 512. Similar to the HIT-UAV dataset, we divided the DroneVehicle dataset into training, validation, and test sets in a 7:1:2 ratio, resulting in 19,907 training images, 2844 validation images, and 5688 test images.

4.2. Evaluation Metrics

In order to comprehensively evaluate the performance of the model, we have selected precision, recall, and mean Average Precision (mAP) as the primary metrics, which are commonly used in object detection tasks. These metrics provide a balanced assessment of the model’s accuracy and robustness in detecting objects across various classes and scenarios. Precision measures the proportion of true positive detections among all positive predictions, while recall quantifies the proportion of true positives among all actual positive instances. Meanwhile, mAP evaluates the model’s ability to detect objects accurately across different Intersection over Union (IoU) thresholds, providing a more holistic view of detection performance.

Furthermore, to assess the computational efficiency of the model, we have incorporated the number of model parameters and FLOPs as additional metrics. The number of parameters indicates the model’s complexity and storage requirements, which are critical factors for deployment in resource-constrained environments. FLOPs, on the other hand, quantify the computational load required for a single forward pass of the model, thereby reflecting its suitability for real-time applications. By considering both performance and efficiency metrics, we aim to provide a thorough evaluation of the model’s capabilities and limitations.

P r e c i s i o n = \frac{T P}{T P + F P}

(17)

R e c a l l = \frac{T P}{T P + F N}

(18)

where

T P

denotes true positives,

F N

denotes false negatives, and

F P

denotes false positives.

A P = \int_{0}^{1} P (r) d r

(19)

m A P = \frac{1}{c} \sum_{i = 1}^{c} A P_{i}

(20)

where

P (r)

denotes the precision at recall level r, and c denotes the number of classes.

4.3. Implementation Details

We conduct our experiments using an Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz and an NVIDIA A100-PCIE-40GB GPU (Santa Clara, CA, USA) as the hardware environment. The programming language used is Python 3.9, and the PyTorch 1.12 library is employed. We implement the compared models using MMDetection and Ultralytics.

In terms of training and testing parameter settings, we set the values of

λ_{1}

,

λ_{2}

, and

λ_{3}

in Equation (15) to 0.5, 7.5, and 0.375, respectively. To ensure a fair comparison, we do not fine-tune these hyperparameters and keep them as consistent as possible with those used in the other compared models. For the DroneVehicle dataset, we set the training to 100 epochs, while for the HIT-UAV dataset, we set it to 150 epochs. Data augmentation includes random blurring, mirroring, and flipping. We replaced the BCE (Binary Cross-Entropy) loss with focal loss for calculating the confidence loss of bounding boxes, thereby alleviating the problem of class imbalance between positive and negative samples. The initial learning rate is set to 0.004 and is decayed using a cosine annealing schedule, eventually dropping to

1 \times 10^{- 5}

. The optimizer used is SGD. For post-processing, we employ traditional NMS with a score threshold of 0.05 and an IoU threshold of 0.65 for all models. The

α

in Equation (5) is set to 0.5 initially, and it will be optimized to an appropriate value by gradient descent.

4.4. Comparative Experiments

To comprehensively evaluate and compare the performance of our proposed model, we selected a range of state-of-the-art object detection models as benchmarks. These include one-stage detectors such as YOLOv7 [46], YOLOv8 [14], YOLOv9 [47], YOLOv10 [48], and YOLOv11 [49], the two-stage detector Cascade-RCNN [50], and the Transformer-based architecture Swin-Transformer [51]. Additionally, we included models specifically designed for UAV infrared object detection, namely G-YOLO [28], LRI-YOLO [30], and CEMP [52], for comparative analysis. In the comparative experiments, all models were configured according to the parameters set in Section 4.3.

4.4.1. Results on DroneVehicle Dataset

The comparative experimental results on the DroneVehicle dataset are presented in Table 1. As evident from the table, the proposed YOLO-UIR model achieves the highest mAP across three different scales of comparative experiments. It also demonstrates significant advantages in terms of FLOPs and parameters, being more lightweight than other models except for G-YOLO. Figure 8 displays selected detection results, highlighting the challenges posed by occlusion, blurriness, and small inter-class differences in the DroneVehicle dataset. For instance, the freight car in the lower left corner of the first column is challenging to detect or classify correctly due to its small size and limited representative information. However, the YOLO-UIR model, augmented with a feature-enhanced attention module, can effectively capture subtle texture features in the original image, leading to accurate detections. Similar observations can be made in columns 3 and 4. Column 5 illustrates detection results in a blurred environment, where background objects resembling vehicle objects can cause local perception to lead to model misdetection, as seen in the results of CEMP-YOLO, YOLOv9, YOLOv11, and LRI-YOLO. The YOLO-UIR model, with its large receptive field module improvement, can integrate information over longer distances, thereby minimizing the likelihood of mistaking local background for objects.

Figure 9 and Figure 10 depict the Precision–Recall (P-R) curves for different classes of YOLO-UIR and the overall classes of various models, respectively. Figure 9 indicates that the proposed model performs well in detecting cars and buses, but precision drops sharply for other classes in the recall range of 0.8 to 1.0, suggesting that a certain proportion of these objects remain undetected and underscoring the difficulty of detecting such classes. In Figure 10, the YOLO-UIR model exhibits a larger area under the curve, outperforming all other models in precision at the same recall rate.

4.4.2. Results on HIT-UAV Dataset

The comparative experimental results on the HIT-UAV dataset are presented in Table 2. As shown in the table, the proposed YOLO-UIR model consistently achieves higher mAP values than existing detection models across different scales of experiments, while also exhibiting lower computational complexity in terms of floating-point operations and parameter count except for G-YOLO. Example detection results on the HIT-UAV dataset are illustrated in Figure 11. Unlike the DroneVehicle dataset, HIT-UAV contains a higher proportion of extremely small objects with resolutions generally below 10 pixels, such as pedestrians and bicycles. These objects often overlap in densely populated areas, posing significant challenges to detection models. In the first column of Figure 11, a long-range road vehicle detection scenario is demonstrated. The imaging characteristics and size of vehicles in such scenes differ from those in other contexts, necessitating enhanced feature extraction capabilities from the detection model. In this case, YOLOv9, G-YOLO, and LRI-YOLO failed to detect the object. Column 3 shows a detection scenario with occluded vehicles, where YOLOv8 and YOLOv11 exhibited duplicate detections. This phenomenon is attributed to the models’ insufficient receptive fields, which prevent effective integration of surrounding feature information for accurate object boundary localization. Column 4 presents a dense detection scenario, where the proposed YOLO-UIR model accurately identifies objects despite interference from streetlights and other background elements. This demonstrates the model’s robustness in complex environments.

Figure 12 and Figure 13 depict the P-R curves for different classes of YOLO-UIR and the overall P-R curves comparing various models, respectively. Although YOLO-UIR achieves higher precision and recall rates overall, some objects in the Person and Bicycle classes remain undetected, highlighting the ongoing challenges in detecting these small and easily occluded objects.

4.5. Ablation Studies and Analysis

To thoroughly validate the effectiveness of the proposed improved modules, we conducted a series of ablation studies on the DroneVehicle dataset using the YOLO-UIR-n model. These studies primarily focused on the Efficient C2f, LSP, and BFIF modules. We systematically replaced these improved modules with their original counterparts in the YOLO architecture and analyzed their performance in detail.

4.5.1. Ablation Study on Efficient C2f Module

As shown in Table 3, after replacing the conventional C2f module with the Efficient C2f module, the mAP of the model increased by 1.3%, while the number of parameters and FLOPs decreased by 0.4M and 0.9G, respectively. These results demonstrate the high efficiency of the Efficient C2f module. To further investigate the underlying mechanism of this performance improvement, we visualized the feature maps output by the second Efficient C2f module in the Backbone, as shown in Figure 14. It can be clearly observed from the figure that the features extracted by the Efficient C2f module exhibit more pronounced semantic representations at the channel level. Taking vehicle objects as an example, more channels show significant responses to vehicle objects in the 48-channel feature maps, whereas the baseline model primarily contains more texture information. This indicates that the Efficient C2f module effectively enhances the semantic information representation of features, reduces channel redundancy, and strengthens the aggregation of effective semantic information.

To further compare existing attention mechanisms, we conducted comparative experiments with other attention modules, and the ablation results are shown in Table 4. In Table 4, we incorporated the SE [53], ECA [35], CBAM [54], and our proposed ADA attention modules, respectively. On the DroneVehicle dataset, ADA achieved the highest mAP, which is 0.5% and 0.4% higher than that of SE and ECA, respectively, demonstrating the superiority of ADA. We attribute this to the fact that different feature channels have varying importance across different stages and receptive fields. Neither a single local attention mechanism nor a global attention mechanism can fully meet the needs of the entire network. Therefore, by designing an attention mechanism with adaptive weights that integrates both local and global attention, we achieved better performance. When the CBAM module was added to the model, the mAP was even 0.2% lower than the baseline. We believe this is because the spatial attention in CBAM caused the model to overly focus on local information, leading to an increase in false detections.

4.5.2. Ablation Study on Lightweight Spatial Perception

In the ablation study of the LSP module, we compared it with the conventional SPFFBottleNeck module, and the results are shown in Table 3. The results indicate that the introduction of the LSP module significantly increased the model’s mAP by 1.7%, with a slight decrease in the number of parameters and FLOPs. We attribute this improvement in detection performance mainly to the multi-scale large-kernel convolutions. These depthwise separable large-kernel convolutions provide the model with a broader capability for capturing contextual information. For example, in Figure 15, models without large receptive fields are easily misled by residual images, leading to false detections. In contrast, the model with an LSP module can incorporate contextual information for object detection. For instance, vehicle objects are typically located on roads and have a certain separation from buildings. Some objects that are prone to false detection may resemble vehicles in appearance but have significantly different surrounding environments. Therefore, models with larger receptive fields can effectively avoid misclassifying such objects.

4.5.3. Ablation Study on Bidirectional Feature Interaction Fusion Module

In the ablation study of the BFIF module, we compared it with conventional 1 × 1 convolution, and the results are shown in Table 3. The results show that the introduction of the BFIF module increased the model’s mAP by 1.8%, with only minor changes in the number of parameters and FLOPs. This indicates that the BFIF module can significantly improve detection accuracy while maintaining the original computational cost. To further verify its mechanism, we visualized the features of the model’s Neck structure using the ablation-cam technique [55], as shown in Figure 16. It can be observed from the figure that the semantic information of the model is more effectively propagated and integrated due to the contextual interaction fusion. Particularly in the localization of object centers, the model achieves more precise identification, which is crucial for dense object detection tasks. For example, in the lower-left corner of the second row in Figure 16, the baseline model’s detection results show numerous repeated predictions of vehicles, while the model with the BFIF module can more accurately represent the object centers and the dimensions of the bounding boxes.

4.6. Supplemental Experiments

4.6.1. Computational and Training Efficiency Analysis

We conducted an experiment to evaluate the computational and training efficiency of the proposed model, including its inference speed and training time on the DroneVehicle dataset, as shown in Table 5. For computational performance assessment, the model was executed on a single core, as detailed in the Implementation Details section, utilizing the ONNX Runtime framework. The input tensor shape was set to

1 \times 640 \times 512

, with a batch size of 1. The model was tested over 500 iterations to determine the number of samples processed per second. For training, the model was deployed on two GPU devices.

Despite having fewer FLOPs, models such as LRI-YOLO exhibited lower inference speeds due to their complex internal structures, which involve multiple nonlinear operations and memory-intensive computations. In contrast, our proposed YOLO-UIR model achieved the highest inference speed of 47 frames per second, which is significantly faster than other models. Additionally, YOLO-UIR demonstrated the shortest training time of 1.7 h, indicating its superior training efficiency. It is worth noting that YOLOv9, with its programmable gradient information (PGI) mechanism, required a much longer training time of 8.4 h, which can be attributed to the additional computational complexity introduced during the training phase.

4.6.2. Infrared Detection in Non-UAV Perspectives

To further validate the infrared detection capabilities of the YOLO-UIR model in non-UAV perspectives, we conducted additional experiments on the LLVIP dataset. The LLVIP dataset is a visible-infrared paired dataset specifically designed for low-light vision tasks. It contains 30,976 images, or 15,488 pairs, most of which were captured in very dark scenes. The dataset features strict temporal and spatial alignment between visible and infrared images, making it suitable for tasks such as image fusion, low-light pedestrian detection, and image-to-image translation. Importantly, the dataset includes a large number of annotated pedestrians, which enhances its utility for detection tasks. For our experiments, we utilized the infrared images from the LLVIP dataset to evaluate the performance of YOLO-UIR in detecting objects under low-light conditions.

As shown in Table 6, the YOLO-UIR model achieved the highest mAP of 94.3% on the LLVIP dataset [56], outperforming all other comparative models. Moreover, YOLO-UIR exhibited a computational complexity of 3.0G FLOPs and a model size of 1.4M parameters, both of which are relatively low. These results indicate that YOLO-UIR not only possesses superior detection accuracy in non-UAV infrared scenarios but also maintains high computational efficiency and model compactness. This further corroborates the generalization and robustness of the YOLO-UIR model, demonstrating its ability to effectively handle complex and diverse infrared scenes beyond UAV perspectives.

4.6.3. Hyperparameter and Data Augmentation Effects

We conducted experiments to investigate the impact of hyperparameters and data augmentation on model performance. The results indicate that hyperparameters significantly influence model performance. As shown in Table 7, using grid search, we evaluated different values of hyperparameters

λ_{1}

,

λ_{2}

, and

λ_{3}

. Specifically, the model achieved the highest mAP of 71.1% when

λ_{1}

was set to 0.5. For

λ_{2}

, an mAP of 71.1% was obtained with values of 1 and 7.5, while a value of 20 resulted in a lower mAP of 64.4%. For

λ_{3}

, the highest mAP of 71.1% was achieved with a value of 0.375. These findings suggest that

λ_{1}

,

λ_{2}

and

λ_{3}

play crucial roles in enhancing model performance at specific values.

Regarding data augmentation, Table 8 presents the effects of three augmentation techniques: Blur, Flip, and Mirror. The results show that Blur has a significant positive impact on model performance. Without Blur, the mAP was 67.2%, whereas with Blur, the mAP increased to 71.1%. This improvement is likely due to Blur simulating the characteristics of blurred imaging in infrared scenes, thereby introducing more challenging samples and enhancing the model’s detection capability in such scenarios. In contrast, Flip had a minimal effect on model performance. The mAP was 70.6% without Flip, which is only slightly lower than the mAP of 71.1% with Flip. This is probably because natural scene images often have strong contextual relationships, and the Flip operation did not generate additional samples that closely resemble the test scenarios.

5. Conclusions

This paper proposes the YOLO-UIR model to address the issue of infrared small object detection from the perspective of UAVs. The model is based on the YOLO framework and incorporates three key modules: Efficient C2f, LSP, and BFIF. These modules are designed to enhance the model’s feature extraction capabilities and feature fusion performance. Extensive comparative experiments conducted on the DroneVehicle and HIT-UAV datasets demonstrate that YOLO-UIR outperforms existing models in terms of detection accuracy and computational efficiency. It achieves mAP scores of 71.1% and 90.7% on the two datasets, respectively, while maintaining a low computational cost with only 3.0G FLOPs. Additionally, ablation studies validate the effectiveness of each optimization module.

Despite the significant performance improvements of YOLO-UIR, some limitations remain. For instance, the model still struggles to recognize certain difficult-to-detect object classes, such as trucks, freight cars, and vans. These objects are challenging due to their relatively small sample size, large size, and significant imaging differences across various perspectives, as shown in Figure 10. Therefore, future work will focus on improving the model’s detection capabilities for these challenging objects to further enhance its generalization and robustness.

Author Contributions

Conceptualization, C.W. and R.W.; methodology, R.W.; validation, Z.B. and Z.W.; formal analysis, T.H.; investigation, R.W.; resources, C.W.; data curation, R.W.; writing—original draft preparation, R.W.; writing—review and editing, C.W.; visualization, R.W.; supervision, C.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this paper are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Feroz, S.; Abu Dabous, S. Uav-based remote sensing applications for bridge condition assessment. Remote Sens. 2021, 13, 1809. [Google Scholar] [CrossRef]
Duan, Z.; Liu, J.; Ling, X.; Zhang, J.; Liu, Z. Ernet: A rapid road crack detection method using low-altitude UAV remote sensing images. Remote Sens. 2024, 16, 1741. [Google Scholar] [CrossRef]
Xue, H.; Liu, K.; Wang, Y.; Chen, Y.; Huang, C.; Wang, P.; Li, L. MAD-UNet: A Multi-Region UAV Remote Sensing Network for Rural Building Extraction. Sensors 2024, 24, 2393. [Google Scholar] [CrossRef]
Zhu, J.; Li, Y.; Wang, C.; Liu, P.; Lan, Y. Method for Monitoring Wheat Growth Status and Estimating Yield Based on UAV Multispectral Remote Sensing. Agronomy 2024, 14, 991. [Google Scholar] [CrossRef]
Liu, S.; Wu, R.; Qu, J.; Li, Y. HDA-Net: Hybrid convolutional neural networks for small objects recognization at airports. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [Google Scholar] [CrossRef]
Debnath, D.; Vanegas, F.; Sandino, J.; Hawary, A.F.; Gonzalez, F. A Review of UAV Path-Planning Algorithms and Obstacle Avoidance Methods for Remote Sensing Applications. Remote Sens. 2024, 16, 4019. [Google Scholar] [CrossRef]
Min, X.; Zhou, W.; Hu, R.; Wu, Y.; Pang, Y.; Yi, J. Lwuavdet: A lightweight uav object detection network on edge devices. IEEE Internet Things J. 2024. [Google Scholar] [CrossRef]
Sagar, A.S.; Tanveer, J.; Chen, Y.; Dang, L.M.; Haider, A.; Song, H.K.; Moon, H. BayesNet: Enhancing UAV-Based Remote Sensing Scene Understanding with Quantifiable Uncertainties. Remote Sens. 2024, 16, 925. [Google Scholar] [CrossRef]
Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Vehicle detection from UAV imagery with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6047–6067. [Google Scholar] [CrossRef]
Ni, J.; Zhu, S.; Tang, G.; Ke, C.; Wang, T. A small-object detection model based on improved YOLOv8s for UAV image scenarios. Remote Sens. 2024, 16, 2465. [Google Scholar] [CrossRef]
Wang, S.; Jiang, H.; Yang, J.; Ma, X.; Chen, J. Amfef-detr: An end-to-end adaptive multi-scale feature extraction and fusion object detection network based on uav aerial images. Drones 2024, 8, 523. [Google Scholar] [CrossRef]
Li, J.; Xu, Y.; Nie, K.; Cao, B.; Zuo, S.; Zhu, J. PEDNet: A lightweight detection network of power equipment in infrared image based on YOLOv4-tiny. IEEE Trans. Instrum. Meas. 2023, 72, 1–12. [Google Scholar] [CrossRef]
Fu, J.; Li, F.; Zhao, J. Regional Saliency Combined With Morphological Filtering for Infrared Maritime Target Detection in Unmanned Aerial Vehicles Images. IEEE Trans. Instrum. Meas. 2025. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2023. Version 8.0.0, AGPL-3.0 License. Available online: https://github.com/ultralytics/ultralytics (accessed on 3 July 2025).
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 21–23 September 2005; IEEE: New York, NY, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Felzenszwalb, P.; McAllester, D.; Ramanan, D. A discriminatively trained, multiscale, deformable part model. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; IEEE: New York, NY, USA, 2008; pp. 1–8. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Zhang, W.; Xia, Y.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8. Drones 2024, 8, 495. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Ding, B.; Zhang, Y.; Ma, S. A Lightweight Real-Time Infrared Object Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles. Drones 2024, 8, 479. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Cohen, T.; Welling, M. Group equivariant convolutional networks. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 19–24 June 2016; pp. 2990–2999. [Google Scholar]
Zhang, Y.; Cai, Z. CE-RetinaNet: A channel enhancement method for infrared wildlife detection in UAV images. IEEE Trans. Geosci. Remote Sens. 2023. [Google Scholar] [CrossRef]
Wang, H.; Wang, C.; Fu, Q.; Si, B.; Zhang, D.; Kou, R.; Yu, Y.; Feng, C. YOLOFIV: Object detection algorithm for around-the-clock aerial remote sensing images by fusing infrared and visible features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
He, A.; Li, X.; Wu, X.; Su, C.; Chen, J.; Xu, S.; Guo, X. Alss-yolo: An adaptive lightweight channel split and shuffling network for tir wildlife detection in uav imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024. [Google Scholar] [CrossRef]
Fang, H.; Xia, M.; Zhou, G.; Chang, Y.; Yan, L. Infrared small UAV target detection based on residual image prediction via global and local dilated residual networks. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-infrared object detection by reducing cross-modality redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13733–13742. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection. Sci. Data 2023, 10, 227. [Google Scholar] [CrossRef]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, SC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2025, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Hong, Y.; Wang, L.; Su, J.; Li, Y.; Fang, S.; Li, W.; Li, M.; Wang, H. CEMP-YOLO: An infrared overheat detection model for photovoltaic panels in UAVs. Digit. Signal Process. 2025, 161, 105072. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Desai, S.; Ramaswamy, H.G. Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 983–991. [Google Scholar]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3496–3504. [Google Scholar]

Figure 1. Challenges in UAV infrared object detection.

Figure 2. YOLO -UIR-n network overview.

Figure 3. Efficient C2f module.

Figure 4. Lightweight spatial perception module.

Figure 5. Bidirectional feature interaction fusion module.

Figure 6. HIT-UAV dataset.

Figure 7. DroneVehicle dataset.

Figure 8. Comparative detection sample examples of different models on DroneVehicle dataset. The yellow arrows in the figure indicate obvious false positives or false negatives.

Figure 9. The P-R curves of YOLO-UIR-n for different classes on the DroneVehicle dataset.

Figure 10. The P-R curves of different models on the DroneVehicle dataset.

Figure 11. Comparative detection sample examples of different models on the HIT-UAV dataset. The yellow arrows in the figure indicate obvious false positives or false negatives.

Figure 12. The P-R curves of YOLO-UIR-n for different classes on the HIT-UAV dataset.

Figure 13. The P-R curves of different models on the HIT-UAV dataset.

Figure 14. Ablation study result on Efficient C2f module.

Figure 15. Ablation result in LSP module. The yellow arrows in the figure indicate obvious false positives.

Figure 16. Ablation result in BFIF module. The yellow arrows in the figure indicate obvious false positives or false negatives.

Table 1. Comparative results of different models on DroneVehicle dataset.

Methods	${AP}_{Car}$	${AP}_{Truck}$	${AP}_{Freight}$	${AP}_{Bus}$	${AP}_{Van}$	$mAP (%)$	$Flops (G)$	$Parameters (M)$
YOLOv7-t	94.0	53.8	52.3	89.1	37.4	65.3 ± 0.3	10.6	6.0
YOLOv8-n	94.7	55.6	57.0	91.4	47.9	69.3 ± 0.3	6.6	3.0
YOLOv9-t	94.4	52.3	51.4	91.3	44.5	66.8 ± 0.2	6.2	3.2
YOLOv10-n	94.5	49.1	49.9	88.4	42.4	64.9 ± 0.4	7.8	3.0
YOLOV11-n	94.8	54.0	53.5	91.4	39.0	66.6 ± 0.6	5.2	2.6
G-YOLO	93.4	53.8	49.4	89.1	42.9	65.7 ± 0.4	3.7	0.8
LRI-YOLO	94.5	56.5	54.3	91.1	46.6	68.6 ± 0.3	3.8	1.6
CEMP-YOLO	94.3	57.9	53.5	91.8	50.7	69.2 ± 0.2	3.8	2.1
YOLO-UIR-n (Ours)	94.9	60.1	59.2	90.8	50.6	71.1 ± 0.3	3.0	1.4
YOLOv8-s	94.7	62.6	62.7	93.0	53.6	73.3 ± 0.2	22.8	11.1
YOLOv9-s	94.3	62.5	63.5	91.8	51.2	72.7 ± 0.2	21.1	9.6
YOLOv10-s	94.4	57.8	61.6	90.6	50.2	70.9 ± 0.6	22.0	9.3
YOLOV11-s	95.4	63.4	59.9	92.4	49.1	72.0 ± 0.3	18.8	8.5
YOLO-UIR-s (Ours)	94.9	65.8	66.4	92.4	54.1	74.7 ± 0.2	8.7	4.6
YOLOv8-m	94.7	61.0	60.1	92.9	52.1	72.2 ± 0.4	63.2	25.9
YOLOv9-m	94.0	64.9	67.0	92.4	54.9	74.6 ± 0.1	61.0	32.6
YOLOv10-m	94.8	61.1	65.1	92.5	53.0	73.3 ± 0.2	62.4	19.7
YOLOV11-m	95.7	62.0	58.5	92.8	49.8	71.8 ± 0.3	54.2	20.1
Swin-Transformer	93.4	39.0	46.7	82.8	31.5	58.7 ± 1.7	152.2	38.5
Cascade-RCNN	93.7	41.6	46.4	84.4	40.7	61.4 ± 0.9	208.1	69.4
YOLO-UIR-m (Ours)	94.6	69.9	70.7	92.8	59.7	77.5 ± 0.2	19.2	9.1

Table 2. Comparative results of different models on HIT-UAV dataset.

Method	${AP}_{Person}$	${AP}_{Car}$	${AP}_{Bicyle}$	$mAP (%)$	$Flops (G)$	$Parameters (M)$
YOLOv7-tiny	80.6	91.1	70.4	80.7 ± 0.5	10.6	6.0
YOLOv8-n	89.2	96.9	83.8	90.0 ± 0.2	6.6	3.0
YOLOv9-t	88.0	95.8	81.2	88.3 ± 0.2	6.2	3.2
YOLOv10-n	88.1	96.5	84.3	89.6 ± 0.2	7.8	3.0
YOLOv11-n	85.0	96.6	80.2	87.3 ± 0.2	5.2	2.7
G-YOLO	87.2	96.2	81.4	88.3 ± 0.2	3.7	0.8
LRI-YOLO	88.4	96.7	85.3	90.1 ± 0.1	3.8	1.6
CEMP-YOLO	87.2	96.3	83.0	88.8 ± 0.2	3.8	2.1
YOLO-UIR-n (Ours)	88.6	96.9	86.6	90.7 ± 0.1	3.0	1.4
YOLOv8-s	90.1	97.3	86.9	91.4 ± 0.1	22.8	11.1
YOLOv9-s	89.0	97.0	86.0	90.7 ± 0.2	21.1	9.6
YOLOv10-s	89.4	96.9	87.3	91.2 ± 0.1	22.0	9.3
YOLOv11-s	88.3	96.9	83.7	89.6 ± 0.1	18.8	8.5
YOLO-UIR-s (Ours)	90.2	97.1	88.0	91.8 ± 0.1	8.7	4.6
YOLOv8-m	89.8	97.2	88.4	91.8 ± 0.1	63.2	25.9
YOLOv9-m	90.3	97.1	87.2	91.6 ± 0.1	61.0	32.6
YOLOv10-m	89.9	97.2	89.4	92.1 ± 0.1	62.4	19.7
YOLOv11-m	86.3	96.6	85.0	89.3 ± 0.1	54.2	20.1
Swin-Transformer	60.4	87.3	67.7	71.8 ± 2.6	152.2	38.5
Cascade-RCNN	71.5	90.4	75.7	79.2 ± 1.7	208.1	69.4
YOLO-UIR-m (Ours)	90.0	96.6	90.1	92.3 ± 0.1	19.2	9.1

Table 3. Ablation studies of different modules on the DroneVehicle dataset.

EfficientC2f	LSP	BIFI	$mAP$ (%)	$Flops (G)$	$Parameters (M)$
			68.2	4.2	1.8
✓			69.5	3.3	1.4
	✓		69.9	4.0	1.7
		✓	70.0	4.0	1.9
✓	✓		70.2	3.1	1.3
✓	✓	✓	71.1	3.0	1.4

Table 4. Different attention module comparison.

Method	SE	ECA	CBAM	ADA (Ours)
$m A P (%)$	69.0	69.1	68.0	69.5

Table 5. Computational and training efficiency analysis.

Model	Inference Speed (frames/s)	Training Time (h)
Yolov7-t	40	3.0
Yolov8-n	36	2.1
Yolov9-t	16	8.4
Yolov10-n	25	2.2
Yolov11-n	39	2.3
G-YOLO	38	2.4
LRI-YOLO	22	2.6
CEMP-YOLO	30	2.4
Yolo-UIR (Ours)	47	1.7

Table 6. Comparative results of different models on LLVIP dataset.

Model	$mAP (%)$	$Flops (G)$	$Parameters (M)$
Yolov7-t	89.9	10.6	6.0
Yolov8-n	94.0	6.6	3.0
Yolov9-t	93.8	6.2	3.2
Yolov10-n	92.7	7.8	3.0
Yolov11-n	94.0	5.2	2.7
G-YOLO	93.7	3.7	0.8
LRI-YOLO	92.9	3.8	1.6
CEMP-YOLO	94.2	3.8	2.1
Yolo-UIR (Ours)	94.3	3.0	1.4

Table 7. Impact of hyperparameters on model performance.

$λ_{1}$	0.1	0.5	2
$m A P (%)$	58.9	71.1	71.0
$λ_{2}$	1	7.5	20
$m A P (%)$	71.1	71.1	64.4
$λ_{3}$	0.1	0.375	1
$m A P (%)$	67.9	71.1	65.9

Table 8. Influence of data augmentation techniques on model performance.

Method	w/o Blur	w/o Flip	w/o Mirror
$m A P (%)$	67.2	70.6	69.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, C.; Wang, R.; Wu, Z.; Bian, Z.; Huang, T. YOLO-UIR: A Lightweight and Accurate Infrared Object Detection Network Using UAV Platforms. Drones 2025, 9, 479. https://doi.org/10.3390/drones9070479

AMA Style

Wang C, Wang R, Wu Z, Bian Z, Huang T. YOLO-UIR: A Lightweight and Accurate Infrared Object Detection Network Using UAV Platforms. Drones. 2025; 9(7):479. https://doi.org/10.3390/drones9070479

Chicago/Turabian Style

Wang, Chao, Rongdi Wang, Ziwei Wu, Zetao Bian, and Tao Huang. 2025. "YOLO-UIR: A Lightweight and Accurate Infrared Object Detection Network Using UAV Platforms" Drones 9, no. 7: 479. https://doi.org/10.3390/drones9070479

APA Style

Wang, C., Wang, R., Wu, Z., Bian, Z., & Huang, T. (2025). YOLO-UIR: A Lightweight and Accurate Infrared Object Detection Network Using UAV Platforms. Drones, 9(7), 479. https://doi.org/10.3390/drones9070479

Article Menu

YOLO-UIR: A Lightweight and Accurate Infrared Object Detection Network Using UAV Platforms

Abstract

1. Introduction

2. Related Works

2.1. Generic Object Detection

2.2. UAV Infrared Object Detection

2.3. Model Lightweighting Methods

3. Proposed Method

3.1. Overview

3.2. Efficient C2f

3.3. Lightweight Spatial Perception

3.4. Bidirectional Feature Interaction Fusion

3.5. Loss Function

4. Experiments and Analysis

4.1. Experiment Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Comparative Experiments

4.4.1. Results on DroneVehicle Dataset

4.4.2. Results on HIT-UAV Dataset

4.5. Ablation Studies and Analysis

4.5.1. Ablation Study on Efficient C2f Module

4.5.2. Ablation Study on Lightweight Spatial Perception

4.5.3. Ablation Study on Bidirectional Feature Interaction Fusion Module

4.6. Supplemental Experiments

4.6.1. Computational and Training Efficiency Analysis

4.6.2. Infrared Detection in Non-UAV Perspectives

4.6.3. Hyperparameter and Data Augmentation Effects

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI