IRWT-YOLO: A Background Subtraction-Based Method for Anti-Drone Detection

Cheng, Xueqi; Wang, Fan; Hu, Xiaopeng; Wu, Xinrong; Nuo, Min

doi:10.3390/drones9040297

Open AccessArticle

IRWT-YOLO: A Background Subtraction-Based Method for Anti-Drone Detection

by

Xueqi Cheng

,

Fan Wang

^*,

Xiaopeng Hu

,

Xinrong Wu

and

Min Nuo

School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(4), 297; https://doi.org/10.3390/drones9040297

Submission received: 24 February 2025 / Revised: 7 April 2025 / Accepted: 8 April 2025 / Published: 11 April 2025

Download

Browse Figures

Versions Notes

Abstract

To effectively separate low-contrast weak drone objects from complex backgrounds, the IRWT-YOLO model is proposed, in which image segmentation algorithms are leveraged to reduce background interference. The model integrates object detection and image segmentation, with segmentation utilized to extract additional image information. Furthermore, to address the challenges of limited receptive fields and weak contextual communication in infrared weak object detection, the DCPPA and RCSCAA modules are introduced. The DCPPA module employs dual convolutions to expand the receptive field and enhance feature extraction for weak drone objects. The RCSCAA module incorporates a contextual attention mechanism to capture long-range dependencies and extract multi-scale texture features. Extensive experiments on three datasets demonstrate the superiority of IRWT-YOLO, with a precision improvement of 15.5% on the SIRSTv2 dataset, a recall improvement of 14.5% on the IRSTD-1k dataset, and a 21.0% improvement in

m A P_{50 - 95}

on the 3rd Anti-UAV dataset compared to YOLOv8. These results highlight the model’s robustness and effectiveness in detecting weak objects under complex infrared conditions.

Keywords:

UAV detection; infrared weak object; IRWT-YOLO; DCPPA; RCSCAA; SAM

1. Introduction

Infrared object detection has become a significant research topic in recent years due to its wide range of applications in military surveillance, maritime rescue, and fire monitoring [1,2,3,4,5,6]. Infrared sensors offer robust performance under varying illumination conditions, capturing the outlines of objects emitting thermal radiation even in dark or low-light environments, which provides considerable advantages over visible-light imaging.

Detecting infrared weak objects presents significant challenges due to their low contrast against complex backgrounds and the absence of distinct structural or textural features, often rendering them as blurry, discrete, point-like objects [7,8,9]. Detection methodologies are broadly categorized into motion-based and appearance-based approaches. Motion-based methods [10,11] analyze inter-frame relationships to identify objects; however, they struggle with stationary objects. Appearance-based methods are further divided into traditional techniques and deep learning approaches. Traditional methods [12,13] rely on handcrafted mathematical functions to extract object features, performing adequately in specific contexts but lacking adaptability to diverse environments. Deep learning [14,15,16] approaches train models on large-scale datasets to detect objects through feature learning, demonstrating rapid detection speeds and exceptional accuracy on well-defined, large-scale datasets. However, these methods typically exhibit lower accuracy when applied to infrared weak object detection datasets. This is because infrared weak objects typically occupy only a few pixels in the image, making high-level feature extraction challenging due to their small size and low signal-to-noise ratio. Furthermore, insufficient receptive fields lead to the loss of local details, resulting in confusion between objects and background, thereby degrading detection performance. Additionally, in drone imagery, the significant scale variation of objects further complicates detection. To address these issues, context-enhancing approaches can be employed to improve the utilization of both low- and high-level features, ultimately enhancing detection robustness.

Background noise frequently obscures object features, making effective suppression crucial in infrared weak object detection [17,18,19]. Current research on background noise suppression is divided into traditional and deep learning methods. Traditional approaches [20,21] are less effective in complex backgrounds, such as urban structures, forests, and cloud cover. In contrast, deep learning methods [22,23,24], which often incorporate segmentation algorithms, require extensive datasets to effectively differentiate between background and objects. Moreover, the constrained receptive fields of many detection models hinder their ability to capture global context, further impairing the localization of weak objects.

To address these challenges, a new model IRWT-YOLO is developed in this study. It incorporates a segmentation-based data augmentation strategy and multiple architectural innovations designed to enhance the model’s ability to detect infrared weak objects.

The main contributions of this work are summarized as follows:

To address background interference in infrared weak object detection, we propose a segmentation-based augmentation strategy that enhances object–background separation using the Segment Anything Model (SAM), thereby improving detection performance for UAV targets.
A novel YOLO-based detection framework, IRWT-YOLO, is proposed. It integrates DCPPA and RCSCAA modules, with BiFormer embedded into the backbone to enrich contextual information. An additional detection head is included to improve performance for weak objects.
A dual-branch receptive field expansion module, DCPPA, is designed to preserve weak object features and improve detection robustness in complex scenes.
To enhance feature extraction and improve weak object detection performance, we integrate the RCSCAA module into the network’s neck and replace part of the C2f in the backbone with BiFormer, effectively boosting feature extraction capabilities.

The remainder of this paper is organized as follows: Section 2 reviews related work on infrared weak object detection and background suppression techniques. In Section 3, the overall architecture of the proposed IRWT-YOLO model and the structural details of the DCPPA and RCSCAA modules are introduced. Section 4 presents the experimental setup, results, and comparisons with existing state-of-the-art methods on several public datasets. Section 5 provides a discussion of the findings. Finally, Section 6 concludes the paper and outlines future research directions.

2. Related Work

Section 2 reviews the current research status, with Section 2.1 summarizing infrared weak object detection methods and Section 2.2 reviewing recent progress in background suppression techniques.

2.1. Infrared Weak Object Detection

The task of infrared weak object detection is to identify low-contrast objects in complex backgrounds. Infrared weak object detection can be divided into two types: motion-based methods [25] and appearance-based methods [26]. The motion-based approach is well suited for processing continuous dynamic video image information but relies on certain necessary prior assumptions. However, as a consequence of limitations in sensor hardware and the issue of significant positional shifts occurring within a few frames during high-speed UAV flight, leading to tracking loss, these assumptions are challenging to satisfy [20]. In contrast, the appearance-based detection method identifies UAVs on a single frame, offering simplicity and ease of implementation. Therefore, we have opted for the appearance-based method to identify UAV objects.

Appearance-based methods include both traditional approaches and deep learning techniques. Traditional methods include TopHat [27], the LCM method [28], Max-Median [29], WSLCM [30], and MSLSTIPT [31]. However, traditional methods struggle to handle clutter and noise interference and are often limited by application scenarios, leading to poor generalizability of the model.

With the advancement of hardware devices, deep learning methods have gradually become mainstream and can be categorized into two-stage and single-stage methods. Two-stage detectors, such as R-CNN [32], Fast R-CNN [33], Faster R-CNN [34], Cascade R-CNN [35], and Libra R-CNN [36], focus on resolving overfit merges to reduce computational effort for higher accuracy.

Single-stage detectors, such as YOLO [37], are known for their fast detection speed and real-time performance. These methods perform well in general object detection scenarios but often struggle with false positives, missed detections, and misdetections, particularly when dealing with a weak object in infrared images. As one of the widely used standards in the field of object detection, YOLOv5 [38] effectively balances detection accuracy and speed. YOLOv8 [39] further optimizes the backbone network enhancing the model’s ability to capture and integrate multi-scale information, thereby improving the detection of small objects in complex scenes. Liu propose the dab-detr [40] model, an end-to-end object detection model which aims to reintroduce an anchor box in detr [41] and accelerated convergence. Deformerdetr introduces the DeformableTransformer [42] to alleviate the calculation cost and improve the precision of small object detection. Although the aforementioned methods have improved the detection accuracy of infrared objects, they still have limitations, such as difficulties in handling complex backgrounds, weak contextual relationships, and restricted receptive field size. To address these challenges, we propose the IRWT-YOLO model, which aims to enhance weak object detection performance. By strengthening contextual relationships, improving feature utilization, and enhancing multi-scale object detection capabilities, the model expands the receptive field while preserving the low-level object features to improve small object detection. The proposed method suppresses false alarm rates and improves mean accuracy, achieving precise detection of UAV objects with limited resources.

2.2. Background Suppression Methods

Background suppression encompasses traditional methods and deep learning approaches. Traditional background suppression methods, which are based on unsupervised algorithms, achieve background suppression through the manual design of spatial filters. Traditional methods can be divided into two types. One type of method assumes that the background neighborhood is homogeneous, and the micro-objects disrupt the local correlation. It includes top-hat filtering [43,44], maximum mean filtering/maximum median filtering [29], Two-Dimensional Least Mean Square [45] and Quaternion Discrete Cosine Transform [46]. Therefore, it is necessary to design an appropriate filter to extract the weak objects from the background. However, in practical applications, there are many restrictions from the background, making it difficult to construct a universal adaptive filter. Some researchers emphasize feature extraction based on the assumption that the background neighborhood is homogeneous, including color features [47], texture features [48], motion features [49], and texture features [50] and operates them separately, and then uses hierarchical fusion for shadow detection and removal, thereby reducing the influence of light conditions. RLCM [51] proposes a local contrast calculation method. Multi-scale RLCM is computed for each pixel to enhance the true object and suppress all types of interference. The background suppression method based on wavelet transformation suppresses the background by separating the low-frequency sub-image and the high-frequency sub-image, and then selects the appropriate threshold to detect the object [52]. EGMM [53] places a greater emphasis on feature representation, enhancing the representational capacity of features through the introduction of belief function theory and evidential Gaussian mixture distributions, thereby more effectively handling uncertainties and complex data.

Another type of method assumes that infrared images are composed of low-rank background signals, sparse object signals, and noise signals, and extracts the sparse object signals from the original data [54]. Li [55] assumed that the infrared background image was a low-rank matrix, and the infrared object image was a sparse matrix. They reconstructed the object image and background image by optimizing the two matrices, and then obtained the small object by the threshold segmentation method. However, the above two traditional methods are difficult to apply to the case of variable object size, and the fixed filter will result in a high false alarm rate when the object size is changeable.

Deep learning-based methods have gained significant attention due to their ability to automatically learn hierarchical features from raw data, eliminating the need for manual feature extraction [56,57]. KCPNet [58] can enhance the object mask information and suppress background interference through dual mask attention. SMPISD-MTPNet [59] uses a two-layer sliding window structure to calculate background pixel changes and local contrast of the object by layers. Tian [60] proposed to reduce the interference of background clutter on mask generation by introducing structural information constraints, and to assist in determining the object location by constraining the context information of subsequent frames in the previous frame. Zhang [61] effectively improved model accuracy and generalization ability in small-sample scenarios by integrating PatchUp, self-supervised auxiliary loss, and feature preprocessing. A4-UNET [62] integrates the Combined Attention Module and attention gates in its skip connections, enhancing pixel grouping to suppress background-irrelevant information and highlight the target.

Traditional methods define infrared object detection as a segmentation problem [63]. However, in pursuit of higher detection rates, they sacrifice the integrity of the segmented objects [64], which contradicts the original intent of segmentation. To address this issue, we continue the idea of traditional methods by combining deep learning methods. We integrate the Segment Anything Model [65], which combines interactivity and zero-shot capabilities, enabling it to be directly applied to downstream tasks.

3. Methods

This section provides a detailed description of the proposed methodology. Section 3.1 outlines the background suppression approach. The IRWT-YOLO model architecture is introduced in Section 3.2. Section 3.3 and Section 3.4 elaborate on the construction details of the DCPPA and RCSCAA modules, respectively.

3.1. Background Suppression as an Image Enhancement Strategy

Although the original bounding boxes may contain irrelevant information, the object’s center point generally remains within the object area, with minimal offset. Therefore, in this section, we propose an image enhancement method that decreases background interference by segmenting the weak object from the infrared image. The proposed method’s process is shown in Figure 1.

We choose the Segment Anything Model (SAM) [65] for background suppression and image enhancement. SAM has been proven to be effective in the field of image segmentation, offering superior efficiency and better preservation of object information compared to previous background suppression methods [66]. As a pre-trained model, SAM with zero training samples can combine user-provided image and text prompts to identify the corresponding objects within the image.

In the image encoder, a Vision Transformer (ViT) is pre-trained using the Masked Autoencoder (MAE) approach. The prompt encoder is designed to process both sparse and dense input prompts. Leveraging the powerful text encoder from CLIP, the model performs positional encoding on both image and text prompts. After convolutional processing of the dense input (mask), it is element-wise summed with the image embedding. The mask decoder is a bidirectional Transformer decoder, which utilizes both self-attention and cross-attention mechanisms for refining segmentation masks.

The original labels contain

x_{c e n t e r}

, which is the percentage of the center’s x-coordinate in the image width, and

y_{c e n t e r}

which is the percentage of the center’s y-coordinate in the image height.

w i d t h

and

h e i g h t

represent the object’s width and height as relative proportions of the image’s width and height, ranging from 0 to 1. Based on the original labels in the dataset, we extract the center point coordinates and the object’s width and height, and calculate the number of pixels occupied by the object. The following formula calculates the process from the YOLO format coordinates to actual pixel coordinates, where

W_{i} m g

and

H_{i} m g

denote the width and height of the original images,

x_{left}^{pixel}

represents the left x value of the bounding box,

x_{right}^{pixel}

represents the right x value of the bounding box,

y_{top}^{pixel}

represents the upper y value of the bounding box, and

y_{bottom}^{pixel}

represents the lower y value of the bounding box:

x_{center}^{pixel} = x_{center} W_{i m a g e}

(1)

y_{center}^{pixel} = y_{center} H_{i m a g e}

(2)

W_{norm}^{pixel} = w i d t h W_{i m a g e}

(3)

H_{norm}^{pixel} = h e i g h t H_{i m a g e}

(4)

We set the center point as a positive label point and use both the bounding box and center point as the input prompt, ensuring that even in cases of blurred object edges, no additional background information is included.

Additionally, we input UAV as semantic prompt information to assist in locating the object’s shape. To resolve semantic ambiguity, SAM outputs three masks by default, from which the one with the highest score is selected as the optimal mask for generating mask information. Subsequently, the minimum bounding box of the mask pattern is calculated and used as the new bounding box to recalculate the YOLO-format object label information. The minimum bounding box is defined as the smallest rectangle that completely encloses a binary mask while maintaining edges parallel to the image coordinate axes. This geometric representation effectively suppresses irrelevant background interference by preserving only the essential spatial extent of the target object. The minimum bounding box extraction process enhances detection robustness through two key mechanisms: elimination of extraneous background pixels that may introduce noise, and preservation of the object’s fundamental geometric characteristics.

In summary, by applying SAM as an image enhancement method, we effectively enhance the accuracy of positive sample information, reduce background interference, and make weak objects more prominent. This provides an effective background suppression technique for infrared weak object detection, improving the model’s ability to detect objects in various challenging background environments.

3.2. IRWT-YOLO Model

In infrared weak object detection, because of camera zoom and UAV flight, the features are not universal due to the large change in object size, and the small size object occupies too few pixels to extract effective features. The lower-level features may not provide enough contextual information to detect small objects accurately. To address the challenge, we propose a novel model IRWT-YOLO, making improvement on three aspects: the backbone, neck, and head networks. We propose an enhanced backbone network for YOLOv8, integrating both BiFormer and C2f as convolutional layers within the backbone architecture. In this framework, C2f performs the feature extraction task in YOLOv8, generating feature maps through convolution after the bottleneck-generated feature maps are concatenated. By employing bi-level routing attention, BiFormer mitigates the influence of background noise.

This mechanism enhances the detection performance of weak objects and small-scale objects in infrared images. By effectively integrating the C2f module with the BiFormer structure, the model preserves critical low-level features during multi-scale feature extraction, thereby improving robustness in weak object detection. However, due to the high computational complexity of the BiFormer structure, replacing all C2f modules with BiFormer would lead to excessive model parameters. Replacing the module at the second layer enhances fine-grained feature representation, while implementing it at the fourth layer strengthens global attention mechanisms through interaction with contextual features. Considering the balance between computational efficiency and detection accuracy, this study ultimately replaces the C2f modules with BiFormer structures in the second and fourth layers of the network. Figure 2 illustrates the overall architecture of the proposed IRWT-YOLO model.

The Bi-level Routing Attention(BRA) module in BiFormer leverages sparse sampling, which not only reduces computational overhead but also preserves fine-grained information at the particle level, ensuring that weak object features are retained. The BRA structure is shown in the Figure 3. By incorporating a sparse self-attention mechanism, we focus on the most relevant key-value pairs, effectively mitigating the loss of fine-grained details typically encountered in conventional downsampling techniques. This modification contributes to the retention of more detailed information from the input image. First, the input image is partitioned into smaller patches through convolutional operations, which maintains the relative spatial relationships of the image. Subsequently, key features are extracted via progressively expanded feature channels. The BRA module learns the interrelationships between spatial positions, and finally, an MLP (Multi-Layer Perceptron) module is utilized to further enhance the extraction of detailed feature representations at each spatial location.

To further improve the accuracy of small object detection and address the challenge of large-scale variations in the dataset, we add an additional detection head specifically designed for small objects. This increases the number of detection heads from three to four. The new small-object detection head is more sensitive to lower-level features, effectively mitigating the issue of small object feature loss caused by excessively deep networks. While the addition of an extra detection head increases computational cost and memory consumption, it significantly enhances the detection performance for small objects, making the model more robust and accurate.

3.3. Dual Convolution Paralleled Patch-Aware Attention Module

Weak objects often present challenges because of their limited feature representation, which makes it difficult to extract semantic features. After multiple downsampling operations, these features tend to become blurred, leading to the loss of critical information. To address this issue, we propose the DCPPA module to enhance the detection of weak objects by extracting more detailed information. The architecture of the DCPPA module is shown in Figure 4.

The DCPPA module utilizes a multi-branch strategy to improve the detection accuracy of a weak object. One of the branches incorporates a Local-Global Attention mechanism with two different scales. This approach allocates extra attention to small objects by considering both local and global contexts at varying levels of granularity. Furthermore, to overcome the computational inefficiency of traditional 5 × 5 convolutions, we replace them with dual-branch, dual-layer 3 × 3 convolutions. This substitution ensures that the receptive field remains consistent while reducing the computational cost.

By employing dual convolution branches, the module extracts features from the original images, enabling a more comprehensive representation of weak object characteristics. This helps mitigate false alarm rates caused by erroneous positioning, where objects blend with the background as a result of their subtle appearance. Additionally, by concatenating information from different branches and integrating multi-scale features, the model can better process objects of various sizes, ensuring more accurate multi-scale object detection. The input feature tensor

F \in R^{H^{\times} W^{\times} C}

is convolved twice to obtain

F^{'} \in R^{H^{\times} W^{\times} C}

. The process is shown in the formula, where

W_{1} (h, w, c, c_{1})

represents the weights of the first layer convolution kernel,

b_{1} (c_{1})

is the bias term of the first layer convolution, and the convolution result

Y_{1} \in R^{H_{1}^{\times} W_{1}^{\times} C_{1}}

of the first layer is also used as input to the second layer:

Y_{1} (i, j, c_{1}) = \sum_{h = 1}^{3} \sum_{w = 1}^{3} \sum_{c = 1}^{C} X (i + h - 1, j + w - 1, c) W_{1} (h, w, c, c_{1}) + b_{1} (c_{1})

(5)

Y_{2} (i, j, c_{2}) = \sum_{h = 1}^{3} \sum_{w = 1}^{3} \sum_{c = 1}^{C} Y_{1} (i + h - 1, j + w - 1, c_{1}) W_{1} (h, w, c_{1}, c_{2}) + b_{2} (c_{2})

(6)

The Local-Global Attention method is used to extract global features. To minimize interference from irrelevant information, two dual-layer 3 × 3 convolutions are employed for feature extraction, ensuring that the network captures essential patterns without being distracted by noise or unrelated background details. The equations are as follows:

F_{c} = M_{c} (\tilde{F}) ⊙ \tilde{F}

(7)

F_{s} = M_{s} (F_{c}) ⊙ F_{c}

(8)

F^{″} = δ (B (dropout (F_{s})))

(9)

Once features from different scales and branches are extracted, they are fused through the channel attention map

M_{c} \in R^{1 \times 1 \times C^{'}}

and spatial attention map

M_{s} \in R^{H^{'} \times W^{'} \times 1}

mechanisms to integrate multi-scale features effectively. This feature fusion step helps the model focus on important regions and strengthens the detection of weak objects across multiple scales. By integrating features from multiple branches, the module learns to prioritize crucial information, allowing the model to better distinguish weak objects from the background and improving its overall detection performance.

In summary, the DCPPA module enhances weak object detection by increasing the capacity to extract detailed features and addressing the challenges posed by the blurry or ambiguous appearance of a weak object.

3.4. RCSCAA Module

To enhance the utilization of features at different scales of a weak object in infrared images, we propose an improved version of the RepVGG/RepConv ShuffleNet-Based One-Shot Aggregation module, called the RCSCAA module. The RCSOSA module is a module optimized for object detection tasks, which aims to enhance the feature representation ability and improve the detection accuracy. We design a new module RCSCAA based on the original RCSOSA module to improve the performance of the neck. RCSCAA uses RCS and the RepVGG module as the basic module, and splits the input tensor F into two identical parts with the number of channels of

C / 2

to obtain branch A and branch B. One of the branches A is processed in parallel, and the obtained branches are convolved separately to obtain

A^{'}

and branch B which are spliced to obtain a new feature map. The detailed structure of RCSCAA is presented in Figure 5.

In order to solve the problem of information isolation caused by grouped convolution, channel shuffle is used to rearrange the channels, reshape the dimensions of branch A and branch B, and transpose or cross-mix the channels. The working process of RCS is shown in the following formula, where the

F_{I d e n t i t y}

F_{1 \times 1}

and

F_{3 \times 3}

are obtained after branch parallel processing of the A branch, the branch A′ is obtained after concatenation, and the features obtained by the A′ and B concatenation are rearranged to facilitate information exchange:

F_{A^{'}} = F_{I d e n t i t y} + F_{1 \times 1} + F_{3 \times 3}, F_{Shuffle} = Shuffle (F_{A^{'}} \oplus F_{B})

(10)

The architecture of RCS is shown in Figure 6. RCS modules are sequentially stacked to perform feature extraction progressively.

To capture the long-range contextual information, we introduce the Context Anchor Attention (CAA) module [67]. In the formula below, we employ Average Pool and 1 × 1 convolution to extract the location region feature, with

P_{a v g}

denoting the average pooling operation:

F_{l - 1, n}^{pool} = {Conv}_{1 \times 1} (P_{avg} (X_{l - 1, n}^{(2)})), n = 0, \dots, N_{l} - 1

(11)

The utilization of global average pooling and one-dimensional strip convolutions facilitates the combination of local and global information, making it particularly suitable for multi-scale objects and enabling more efficient extraction of semantic information. When

n = 0

,

X_{l - 1, n}^{(2)} = X_{l - 1}^{(2)}

. Then, we apply two depthwise strip convolutions as an approximation to a standard large-kernel depth-wise convolution:

F_{l - 1, n}^{w} = {DWConv}_{1 \times k_{b}} (F_{l - 1, n}^{pool}) F_{l - 1, n}^{h} = {DWConv}_{k_{b} \times 1} (F_{l - 1, n}^{w})

(12)

We set

k_{b} = 11 + 2 \times l

to increase the receptive field with increasing depth. Finally, we produce a weight and enhance the weight before output:

A_{l - 1, n} = Sigmoid ({Conv}_{1 \times 1} (F_{l - 1, n}^{h}))

(13)

F_{l - 1, n}^{a t t n} = (A_{l - 1, n} ⊙ P_{l - 1, n}) \oplus P_{l - 1, n}

(14)

The sigmoid function ensures that the attention map is in the range

(0 - 1)

. ⊙ represents the element-wise multiplication, and ⊕ represents the element-wise summation.

4. Experiment and Results

First, Section 4.1 briefly describes the experimental setup, including the datasets for comparison with other advanced models, evaluation metrics, baseline methods, and parameter configurations, with additional implementation details provided in Section 4.2. Section 4.3 presents external comparative experiments with ten state-of-the-art object detection methods. To validate the effectiveness of each component in the proposed detector, ablation studies are conducted in Section 4.4. For intuitive demonstration of IRWT-YOLO’s advantages, visualization results are presented in Section 4.5. Finally, Section 4.6 reports additional experiments on two public datasets to verify the method’s generalizability.

4.1. Datasets and Evaluation Metrics

We train our model using the 3rd Anti-UAV public dataset [68] and evaluate its performance on the test set. The evaluation metrics include

m A P_{50 - 95}

(average of all 10 Intersection over Union (IoU) thresholds, ranging from [0.5: 0.95]) and

m A P_{50}

. The input image size is set to 640 × 512 pixels, and label 0 is assigned to UAV objects, as shown in Figure 7.

In this paper, we select precision, recall,

m A P_{50}

,

m A P_{50 - 95}

, and FLOPs as the evaluation metrics to compare the detection performance of the model and determine its strengths and weaknesses. The precision and recall are calculated based on the following formulae, using IoU = 0.5 as the threshold:

Precision = \frac{T P}{T P + F P}

(15)

Recall = \frac{T P}{T P + F N}

(16)

m A P_{50} = \frac{1}{C} \sum_{i = 1}^{C} {AP}_{i} (I o U = 0.5)

(17)

m A P_{50 - 95} = \frac{1}{C} \sum_{i = 1}^{C} \frac{1}{10} \sum_{t = 50}^{95} {AP}_{i} (I o U = \frac{t}{100})

(18)

In the formula,

T P

represents the number of true positives,

F P

represents the number of false positives,

F N

represents the number of false negatives, C is the total number of categories, and

A P_{i}

is the average precision for the i-th category. These metrics help assess how well the model detects true objects (precision) and how effectively it identifies all relevant objects (recall). The Mean Average Precision (

m A P

) is used to evaluate the overall accuracy of a model. IoU assesses model performance by calculating the ratio of the intersection to the union between the predicted bounding box and the ground truth.

m A P_{50}

refers to the proportion of correct detections when the IoU threshold is set to 0.5.

m A P_{50 - 95}

computes the

m A P

values at IoU thresholds ranging from 0.5 to 0.95, with a step size of 0.05, and takes the average of these values.

m A P_{50 - 95}

provides a more comprehensive assessment of the model’s accuracy and robustness, as higher IoU values indicate greater overlap between the predicted and ground truth bounding boxes, thereby increasing the difficulty of the detection task.

4.2. Implementation Details

We conducted experiments on the model using an NVIDIA TITAN Xp Founders (NVIDIA, Santa Clara, CA, USA) Edition with CUDA 11.8. We used a batch size of 4 and trained the model for 300 epochs. Since each image contains only one object, the mixup factor was set to 1.0 to optimize performance with fewer images. Because of the large size of the 3rd Anti-UAV training set and hardware limitations, we randomly selected 15,000 images from the training set and 5000 images from the validation set for training and evaluation.

4.3. Comparisons with the State-of-the-Art

We enhance the detection capability of the model by introducing a novel module and method. To validate the effectiveness of these improvements, the dataset we constructed after random selection on the 3rd Anti-UAV dataset was unified for the experiment after SAM processing.

In this study, we employed several advanced models, including Faster R-CNN [34], Libra R-CNN [36], Cascade R-CNN [35], DAB-DETR [40], YOLOv5 [69], YOLOv8 [70], YOLOX [71], YOLOv8-World [72], YOLOv10 [73], and YOLO11 [74]. All methods are evaluated on the 3rd Anti-UAV dataset, and the same evaluation metrics are used to ensure fair comparisons. The results for methods are reproduced using their publicly available code.

The results demonstrate that the proposed model achieves excellent performance, with validation of its effectiveness across four metrics. As shown in Table 1, the precision of the method reaches 89.5%, the recall reaches 59.3%, the

m A P_{50}

reaches 62.2%, and the

m A P_{50 - 95}

reaches 44.8%, which is superior to the current advanced algorithms.

4.4. Ablation Studies

To validate the effectiveness of each module, we conducted ablation experiments for each component and presented the results in Table 2 and Figure 8. In Figure 8, we demonstrate that the modified SAM significantly improves accuracy by reducing the amount of irrelevant negative sample information. Although precision slightly decreased after incorporating DCPPA, while

m A P_{50 - 95}

increased, we believe that DCPPA enhances detection accuracy by expanding the receptive field to gather more information, thereby suppressing noise and reducing the false positive rate. Additionally, RCSCAA enhances the accuracy of weak object detection, and BiFormer strengthens overall detection accuracy through bidirectional routing. After adding the DCPPA module, the

m A P_{50}

improves by 1.2%. The DCPPA module helps significantly improve the detection of a weak object, making it particularly useful for handling small and difficult-to-detect objects. By replacing the C2f module with the BiFormer module in the backbone network, the context relationship ability of the model is strengthened, the detection ability of the model under multi-scale conditions is strengthened, and the feature mismatch problem caused by large-scale changes is alleviated.

Figure 8 illustrates the qualitative comparisons. By applying SAM to improve the bounding boxes of the object detection task, we shrink the object bounding box area while retaining the UAV object information and reducing irrelevant background noise. As a result, the

m A P_{50 - 95}

for the standard YOLOv8 model increases from 22.7% to 42.5%, and the

m A P_{50 - 95}

for the standard IRWT-YOLO model increases from 23.3% to 44.8%, demonstrating that the integration of segmentation improves the detection performance, especially in complex environments with cluttered backgrounds.

We not only improved the detection accuracy but also maintained other metrics consistent with the baseline. This suggests that the integration of the Segment Anything Model enhances the accuracy of boundary localization and improves the detection performance at higher IoU thresholds.

m A P_{50 - 95}

, which evaluates performance across multiple IoU thresholds, is more sensitive to such improvements, particularly at higher IoU values. In contrast, metrics like precision and recall, which are influenced more by classification and overall detection performance, remain less affected. These findings underscore the value of SAM in enhancing detection accuracy in scenarios demanding high localization precision, despite limited impact on global detection metrics, such as precision and recall.

4.5. Results Visualization

To verify the effectiveness of the model, we adopted Grad-CAM to generate a visual heatmap of the model. In this figure, we observed that when UAVs appear at the edges of buildings, the detection process is often interfered with. YOLOv8 detects a potential object in these regions, but with very low confidence, making it unable to confirm the presence of a UAV. On the other hand, YOLOv11 fails to detect potential positions altogether. Furthermore, both models allocate attention to irrelevant areas of the image, leading to inefficient use of resources. When the UAV is further away and occupies only a few pixels, making feature extraction difficult, the model still demonstrates superior detection capabilities, while the other two models fail to detect the UAV.

Additionally, we conducted experiments with complex background interference, selecting scenarios with mountain forests, clouds, and buildings as backgrounds. In these complex environments, distinguishing UAVs becomes challenging for the models. UAVs with low contrast against similarly low-contrast complex backgrounds are particularly difficult to identify. As shown in the images, YOLOv11 allocates a significant portion of its attention to irrelevant areas in an attempt to locate the object, while YOLOv8 can detect the UAV but with much lower confidence compared to the IRWT-YOLO model. To demonstrate the superiority of the IRWT-YOLO model, we use heatmaps to validate its performance in Figure 9.

4.6. Experiments on Publicly Available Datasets

We note that most of the current datasets of infrared faint objects are small sample datasets. To address this issue, we additionally evaluate the IRWT-YOLO model on the SIRSTv2 [9,75,76] and IRSTD-1k [77] datasets. The SIRSTv2 and IRSTD-1k datasets both contain significantly fewer samples compared to the 3rd dataset. The IRSTD-1k dataset includes 640 training images, 160 validation images, and 200 test images, while the SIRSTv2 dataset consists of 511 training images, 255 validation images, and 255 test images. Both datasets feature diverse cluttered backgrounds that introduce interference challenges.

The IRWT-YOLO model demonstrates substantial accuracy improvements over the baseline YOLOv8 model on both datasets. The results are represented in Table 3. On the IRSTD-1k dataset, IRWT-YOLO achieves a 13.6% improvement in

m A P_{50}

and a 7.0% improvement in

m A P_{50 - 95}

. On the SIRSTv2 dataset, it achieves a 19.6% increase in

m A P_{50}

and a 9.6% increase in

m A P_{50 - 95}

. These results highlight the significant performance gains of the IRWT-YOLO model across multiple datasets, particularly under low-data and small-training scenarios. The model effectively mitigates overfitting, a common challenge under limited sample conditions, and converges more rapidly. It is confirmed that the IRWT-YOLO model maintains excellent performance even in the absence of sufficient training samples, demonstrating its outstanding transferability and adaptability, making it a robust solution for small-sample and complex-background scenarios.

5. Discussion

The experimental results show that the proposed method effectively enhances the accuracy of infrared weak object detection, achieving higher average precision than existing methods. It addresses the challenges of irrelevant background interference and the difficulty of detecting weak objects, significantly improving detection accuracy. By applying image segmentation for background suppression, we enhance the object’s visibility, eliminate background interference, and strengthen the model’s detection performance. We designed the IRWT-YOLO model specifically for infrared UAV images. By integrating BiFormer into the backbone, the model’s ability to understand and relate contextual information is improved, mitigating the issue of large object size variations due to camera zoom and other factors.

Furthermore, to address the difficulty of extracting features from weak small objects, we propose the DCPPA module, which uses dual convolution branches to expand the receptive field and enhance small object detection capabilities. We also introduce the RCSCAA module to strengthen long-range semantic feature extraction. The improvement in IRWT-YOLO’s detection performance can be attributed to the combination of background suppression, global and local context enhancement, and enhanced feature extraction capabilities. The effective integration of image segmentation reduces background clutter, allowing the model to focus on weak objects. These results align with previous studies, confirming the effectiveness of background suppression in infrared imaging tasks. Compared to earlier background suppression methods, the proposed approach demonstrates superior robustness under various environmental conditions. Previous methods often struggle with complex backgrounds, whereas this approach, leveraging image segmentation, maintains strong performance across different datasets.

However, despite the improvement in detection accuracy, the IRWT-YOLO model introduces additional parameters and increased computational complexity. Future research will focus on optimizing the model architecture and exploring lightweight configurations to accelerate inference speed and enhance practical deployment feasibility. Moreover, while the background suppression strategy has effectively reduced interference, some residual background noise remains. To address this, future work will further refine the background suppression mechanism to minimize background interference and enhance the detection capability of infrared weak objects.

6. Conclusions

Infrared UAV image detection is challenged by weak object visibility and background interference, which complicate the detection process. To address this issue, we propose using image segmentation methods to refine the object bounding boxes in the training dataset and introduce a new model, IRWT-YOLO. By applying image segmentation to improve the label information, we ensure that the bounding boxes represent the object’s minimal bounding box, reducing background clutter and enabling the model to focus more effectively on UAV-related information. This approach is particularly suitable for handling discrete, hard-to-annotate fuzzy object points and complex objects with low contrast relative to the background.

In the IRWT-YOLO model, we design new modules and structures, including the integration of BiFormer into the backbone network and introduction of the RCSCAA and DCPPA modules to enhance weak object detection performance. BiFormer utilizes a dual-route mechanism to effectively link and leverage contextual information, strengthening multi-scale object detection capabilities and improving the flow of features across different scales. RCSCAA strengthens feature extraction by introducing the CAA mechanism, which effectively captures long-range dependencies and enhances the ability to extract semantic features of the object. The DCPPA module, through a multi-branch structure, more comprehensively captures feature information from the input image, enhancing the model’s ability to handle complex images.

We conduct extensive experiments and validations using the 3rd Anti-UAV infrared UAV dataset, demonstrating the superiority of the proposed method from multiple angles. We also address supplementary matters on the SIRSTv2 and IRSTD-1k datasets. The results show that the proposed approach outperforms traditional segmentation and deep learning models. Compared to the baseline model YOLOv8, IRWT-YOLO achieves precision improvements of up to 15.5% on SIRSTv2, recall improvements of up to 14.5% on IRSTD-1k, and

m A P_{50 - 95}

improvements of up to 21.0% on the 3rd Anti UAV dataset. Furthermore, we compare the IRWT-YOLO model with ten other state-of-the-art single-stage and two-stage object detection models and achieve the best performance in the results.

While IRWT-YOLO significantly enhances detection accuracy, the increased model complexity presents challenges for real-time applications. Future work will focus on developing lightweight architectures through techniques such as model pruning, quantization, and knowledge distillation, aiming to accelerate inference while maintaining high detection performance.

Author Contributions

Conceptualization, X.C.; methodology, X.C.; validation, X.C.; writing—original draft preparation, X.C.; writing—review and editing, X.C., X.W. and M.N.; supervision, F.W., X.H. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data for this article are available at the same address as the supporting material above, and codes can be found on the following website: https://github.com/cc88-happy/IRWT-YOLO (accessed on 7 April 2025).

Acknowledgments

The authors would like to thank all the coordinators and supervisors involved and the anonymous reviewers for their detailed comments that helped to improve the quality of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Conv	Convolution
CNN	Convolutional Neural Network
mAP	Mean Average Precision
IoU	Intersection Over Union
YOLO	You Only Look Once

References

Kou, R.; Wang, C.; Fu, Q.; Yu, Y.; Zhang, D. Infrared small target detection based on the improved density peak global search and human visual local contrast mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6144–6157. [Google Scholar] [CrossRef]
Li, J.; Zhang, P.; Wang, X.; Huang, S. Infrared small-target detection algorithms: A survey. J. Image Graph. 2020, 25, 1739–1753. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the Unmanned Aerial Vehicles (UAVs): A Comprehensive Review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Feng, C.; Hou, J.; Liu, S.; Wanyan, X.; Ding, M.; Li, H.; Yan, D.; Bie, D. Key Technology for Human-System Integration of Unmanned Aircraft Systems in Urban Air Transportation. Drones 2025, 9, 18. [Google Scholar] [CrossRef]
Zhu, P.; Song, R.; Zhang, J.; Xu, Z.; Gou, Y.; Sun, Z.; Shao, Q. Multiple UAV Swarms Collaborative Firefighting Strategy Considering Forest Fire Spread and Resource Constraints. Drones 2025, 9, 17. [Google Scholar] [CrossRef]
Ahmed, A.; Farhan, M.; Eesaar, H.; Chong, K.T.; Tayara, H. From Detection to Action: A Multimodal AI Framework for Traffic Incident Response. Drones 2024, 8, 741. [Google Scholar] [CrossRef]
Yang, D.; Bai, Z.; Zhang, J. Infrared Weak and Small Target Detection Based on Top-Hat Filtering and Multi-Feature Fuzzy Decision-Making. Electronics 2022, 11, 3549. [Google Scholar] [CrossRef]
Yao, H.; Liu, L.; Wei, Y.; Chen, D.; Tong, M. Infrared small-target detection using multidirectional local difference measure weighted by entropy. Sustainability 2023, 15, 1902. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Liu, H.K.; Zhang, L.; Huang, H. Small Target Detection in Infrared Videos Based on Spatio-Temporal Tensor Model. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8689–8700. [Google Scholar] [CrossRef]
Lu, D.; An, W.; Wang, H.; Ling, Q.; Cao, D.; Li, M.; Lin, Z. Infrared Moving Small Target Detection Based on Spatial–Temporal Feature Fusion Tensor Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 78–99. [Google Scholar] [CrossRef]
Li, Y.; Li, Z.; Li, J.; Yang, J.; Siddique, A. Robust small infrared target detection using weighted adaptive ring top-hat transformation. Signal Process. 2024, 217, 109339. [Google Scholar] [CrossRef]
Zhang, X.; Ru, J.; Wu, C. Infrared Small Target Detection Based on Gradient Correlation Filtering and Contrast Measurement. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5603012. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-Guided Pyramid Context Networks for Detecting Infrared Small Target Under Complex Background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Qiu, Y.; Sha, F.; Niu, L. DKA-YOLO: Enhanced Small Object Detection via Dilation Kernel Aggregation Convolution Modules. IEEE Access 2024, 12, 187353–187366. [Google Scholar] [CrossRef]
Xu, Y.; Wan, M.; Zhang, X.; Wu, J.; Chen, Y.; Chen, Q.; Gu, G. Infrared Small Target Detection Based on Local Contrast-Weighted Multidirectional Derivative. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000816. [Google Scholar] [CrossRef]
Wang, X.; Lu, R.; Bi, H.; Li, Y. An Infrared Small Target Detection Method Based on Attention Mechanism. Sensors 2023, 23, 8608. [Google Scholar] [CrossRef]
Guo, Z.; Chen, H.; He, F. MSFNet: Multiscale Spatial-Frequency Feature Fusion Network for Remote Sensing Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 1912–1925. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Yu, Y.; Peng, Z.; Yang, M.; Huang, F.; Fu, Q. LW-IRSTNet: Lightweight Infrared Small Target Segmentation Network and Application Deployment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5621313. [Google Scholar] [CrossRef]
Fang, H.; Ding, L.; Wang, L.; Chang, Y.; Yan, L.; Han, J. Infrared Small UAV Target Detection Based on Depthwise Separable Residual Dense Network and Multiscale Feature Fusion. IEEE Trans. Instrum. Meas. 2022, 71, 5019120. [Google Scholar]
Yi, X.; Wang, B.; Zhou, H.; Qin, H. Dim and small infrared target fast detection guided by visual saliency. Infrared Phys. Technol. 2019, 97, 6–14. [Google Scholar]
Kou, R.; Wang, C.; Peng, Z.; Zhao, Z.; Chen, Y.; Han, J.; Huang, F.; Yu, Y.; Fu, Q. Infrared small target segmentation networks: A survey. Pattern Recognit. 2023, 143, 109788. [Google Scholar] [CrossRef]
Zhang, M.; Liu, T.; Piao, Y.; Yao, S.; Lu, H. Auto-MSFNet: Search Multi-scale Fusion Network for Salient Object Detection. In Proceedings of the ACM Multimedia Conference 2021, Virtual, 20–24 October 2021. [Google Scholar]
Ouyang, C.; Liu, Y.; Chen, X. A lightweight attention mechanism for Infrared Small-Target Detection. In Proceedings of the 4th Asia-Pacific Artificial Intelligence and Big Data Forum, New York, NY, USA, 11–12 March 2025; AIBDF ’24. pp. 806–810. [Google Scholar] [CrossRef]
Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Guo, G.; Ye, Q.; Jiao, J.; et al. Anti-UAV: A Large-Scale Benchmark for Vision-Based UAV Tracking. IEEE Trans. Multimed. 2023, 25, 486–500. [Google Scholar] [CrossRef]
Pavliv, M.; Schiano, F.; Reardon, C.; Floreano, D.; Loianno, G. Tracking and Relative Localization of Drone Swarms With a Vision-Based Headset. IEEE Robot. Autom. Lett. 2021, 6, 1455–1462. [Google Scholar] [CrossRef]
Zeng, M.; Li, J.; Peng, Z. The design of Top-Hat morphological filter and application to infrared target detection. Infrared Phys. Technol. 2006, 48, 67–76. [Google Scholar] [CrossRef]
Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Optics & Photonics, Denver, CO, USA, 20–22 July 1999. [Google Scholar]
Zhang, L.; Peng, Z. Infrared Small Target Detection Based on Partial Sum of the Tensor Nuclear Norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Sun, Y.; Yang, J.; An, W. Infrared Dim and Small Target Detection via Multiple Subspace Learning and Spatial-Temporal Patch-Tensor Model. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3737–3752. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2014, arXiv:cs.CV/1311.2524. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra R-CNN: Towards Balanced Learning for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–29 June 2019. [Google Scholar]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, B.; Luo, H. An Improved Yolov5 for Multi-Rotor UAV Detection. Electronics 2022, 11, 2330. [Google Scholar] [CrossRef]
Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An Optimized YOLOv8 Network for Tiny UAV Object Detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In Proceedings of the International Conference on Learning Representations, Virtual, 25 April 2022. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Zhu, H.; Liu, S.; Deng, L.; Li, Y.; Xiao, F. Infrared Small Target Detection via Low-Rank Tensor Completion With Top-Hat Regularization. IEEE Trans. Geosci. Remote Sens. 2020, 58, 1004–1016. [Google Scholar] [CrossRef]
Cao, Y.; Liu, R.; Yang, J. Small Target Detection Using Two-Dimensional Least Mean Square (TDLMS) Filter Based on Neighborhood Analysis. Int. J. Infrared Millim. Waves 2008, 29, 188–200. [Google Scholar]
Zhang, P.; Wang, X.; Wang, X.; Fei, C.; Guo, Z. Infrared Small Target Detection Based on Spatial-Temporal Enhancement Using Quaternion Discrete Cosine Transform. IEEE Access 2019, 7, 54712–54723. [Google Scholar] [CrossRef]
Zhang, H.; Xu, D. Fusing Color and Texture Features for Background Model. In Proceedings of the Fuzzy Systems and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2006; pp. 887–893. [Google Scholar]
Jabri, S.; Duric, Z.; Wechsler, H.; Rosenfeld, A. Detection and location of people in video images using adaptive fusion of color and edge information. In Proceedings of the 15th International Conference on Pattern Recognition. ICPR-2000, Barcelona, Spain, 3–8 September 2000; Volume 4, pp. 627–630. [Google Scholar] [CrossRef]
Zhou, D.; Zhang, H. Modified GMM background modeling and optical flow for detection of moving objects. In Proceedings of the 2005 IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, HI, USA, 10–12 October 2005; Volume 3, pp. 2224–2229. [Google Scholar] [CrossRef]
Chiranjeevi, P.; Sengupta, S. New Fuzzy Texture Features for Robust Detection of Moving Objects. IEEE Signal Process. Lett. 2012, 19, 603–606. [Google Scholar] [CrossRef]
Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared Small Target Detection Utilizing the Multiscale Relative Local Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar]
Sun, Y.Q.; Tian, J.W.; Liu, J. Background suppression based-on wavelet transformation to detect infrared target. In Proceedings of the 2005 International Conference on Machine Learning and Cybernetics, Guangzhou, China, 18–21 August 2005; Volume 8, pp. 4611–4615. [Google Scholar] [CrossRef]
Liang, S.; Baker, D. Real-time Background Subtraction under Varying Lighting Conditions. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 9317–9323. [Google Scholar] [CrossRef]
Luo, Y.; Li, X.; Chen, S.; Xia, C. 4DST-BTMD: An Infrared Small Target Detection Method Based on 4-D Data-Sphered Space. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5000520. [Google Scholar] [CrossRef]
Man, Y.; Yang, Q.; Chen, T. Infrared Single-Frame Small Target Detection Based on Block-Matching. Sensors 2022, 22, 8300. [Google Scholar] [CrossRef]
Emek Soylu, B.; Guzel, M.S.; Bostanci, G.E.; Ekinci, F.; Asuroglu, T.; Acici, K. Deep-Learning-Based Approaches for Semantic Segmentation of Natural Scene Images: A Review. Electronics 2023, 12, 2730. [Google Scholar] [CrossRef]
Zhang, Q.; Xu, Y.; Zhang, J.; Tao, D. ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond. Int. J. Comput. Vis. 2023, 131, 1141–1162. [Google Scholar] [CrossRef]
Han, Y.; Liao, J.; Lu, T.; Pu, T.; Peng, Z. KCPNet: Knowledge-Driven Context Perception Networks for Ship Detection in Infrared Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000219. [Google Scholar] [CrossRef]
Hu, C.; Dong, X.; Huang, Y.; Wang, L.; Xu, L.; Pu, T.; Peng, Z. SMPISD-MTPNet: Scene Semantic Prior-Assisted Infrared Ship Detection Using Multitask Perception Networks. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5000814. [Google Scholar] [CrossRef]
Tian, E.; Lei, Y.; Sun, J.; Zhou, K.; Zhou, B.; Li, H. The Segmentation Tracker with Mask-Guided Background Suppression Strategy. IEEE Access 2024, 12, 124032–124044. [Google Scholar] [CrossRef]
Zhang, L.; Lin, Y.; Yang, X.; Chen, T.; Cheng, X.; Cheng, W. From Sample Poverty to Rich Feature Learning: A New Metric Learning Method for Few-Shot Classification. IEEE Access 2024, 12, 124990–125002. [Google Scholar] [CrossRef]
Wang, R.; Tang, T.; Du, H.; Cheng, Y.; Wang, Y.; Yang, L.; Duan, X.; Yu, Y.; Zhou, Y.; Chen, D. A4-Unet: Deformable Multi-Scale Attention Network for Brain Tumor Segmentation. arXiv 2024, arXiv:cs.CV/2412.06088. [Google Scholar]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y. Reweighted Infrared Patch-Tensor Model With Both Nonlocal and Local Priors for Single-Frame Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
Kou, L.; Li, S.; Yang, J.; Lv, Y. BSVOS: Background Interference Suppression Strategy for Satellite Video Multiobject Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8823–8834. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
3rd Anti-UAV Model and Dataset. 2023. Available online: https://modelscope.cn/models/iic/3rd_Anti-UAV_CVPR23/summary (accessed on 7 July 2024).
Jocher, G. Ultralytics YOLOv5. 2020. Available online: https://zenodo.org/records/7347926 (accessed on 3 November 2024).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 7 April 2025).
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-Time Open-Vocabulary Object Detection. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 7 April 2025).
Dai, Y.; Li, X.; Zhou, F.; Qian, Y.; Chen, Y.; Yang, J. One-Stage Cascade Refinement Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000917. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, 3–8 January 2021. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape Matters for Infrared Small Target Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 867–876. [Google Scholar] [CrossRef]

Figure 1. An overview of the process. Once the dataset image is put into SAM, the segmentation image of the UAV is generated, followed by the acquisition of the minimum bounding box to obtain the label. The resulting complete image and label information are then utilized as a new dataset for training and prediction in IRWT-YOLO.

Figure 2. The architecture of the IRWT-YOLO.

Figure 3. The architecture of the BRA. In this Figure, the Chinese text is part of the original dataset annotation. Text in the first line indicates that the image is from the front-end infrared sensor. The second line represents the azimuth and elevation angles of the sensor at the time of capture. Similar annotations appear in subsequent figures and are not repeated.

Figure 4. The architecture of the DCPPA.

Figure 5. The architecture of the RCSCAA module.

Figure 6. Figure provides an overview of the RepVGG and RCS structure. (a) RepVGG at the training stage. (b) RepVGG at the inference stage. (c) RCS at the training stage. (d) RCS at the inference stage.

Figure 7. We have selected several images to illustrate two prevalent challenges within the dataset: (a) the significant size variation resulting from the drone’s back-and-forth movement, and (b) the complexity of accurately positioning the drone against intricate backgrounds. Red box in figures indicates the location of drones.

Figure 8. Performance comparison of YOLOv8 and IRWT-YOLO with and without the integration of the Segment Anything Model (SAM). The bars represent the detection accuracy for each configuration: YOLOv8 without SAM, YOLOv8 with SAM, IRWT-YOLO without SAM, and IRWT-YOLO with SAM.

Figure 9. The figure illustrates the detection effects of different models on the same image set. The first row represents the original images in the dataset, and the second row shows the detection results of the IRWT-YOLO model proposed in this study. The third and fourth rows present the detection outputs of the YOLOv8 and YOLOv11 models, respectively. In these images, the detection box and the corresponding class label are visualized only when the confidence threshold exceeds 0.3. The brighter the color, the more attention is given.

Table 1. Comparative experimental results of mainstream models based on the 3rd Anti-UAV dataset. The bold and underlined values indicate the best performance for each evaluation metric.

Module Name	Precision (%)	Recall (%)	${mAP}_{50}$ (%)	${mAP}_{50 - 95}$ (%)
Faster R-CNN [34]	81.8	48.9	49.5	32.5
Cascade R-CNN [35]	77.8	53.1	50.5	35.0
Libra R-CNN [36]	87.2	47.7	50.4	34.8
DAB-DETR [40]	70.4	54.7	28.6	39.7
YOLOv5 [69]	80.3	54.8	55.0	39.8
YOLOv8 [70]	87.3	57.1	60.1	42.5
YOLOX [71]	70.2	55.1	53.4	28.9
YOLOv8-world [72]	82.7	54.7	55.8	37.0
YOLOv10 [73]	89.3	57.1	59.5	42.9
YOLO11 [74]	88.0	55.9	59.4	40.2
IRWT-YOLO	89.5	59.3	62.2	44.8

Table 2. Ablation test of IRWT-YOLO. In this table, the checkmark represents that the corresponding module was incorporated into YOLOv8 as part of the experimental modifications.

Methods	BiFormer	RCSCAA	DCPPA	Precision (%)	${mAP}_{50 - 95}$ (%)
Baseline	-	-	-	87.4	42.5
Proposed	✓			88.7 (+1.3)	43.2 (+0.7)
		✓		88.3 (+0.9)	42.5 (+0)
			✓	87.0 (−0.4)	42.5 (+0)
	✓	✓		88.4 (+1.0)	43.9 (+1.4)
	✓	✓		87.8 (+0.4)	43.4 (+0.9)
		✓	✓	87.7 (+0.3)	43.3 (+0.8)
IRWT-YOLO	✓	✓	✓	89.5 (+2.1)	44.8 (+2.3)

Table 3. Comparative experiments on different publicly available datasets.

Dataset	IRSTD-1k		SIRSTv2
Model	YOLOv8	IRWT-YOLO	YOLOv8	IRWT-YOLO
Precsion (%)	75.1	86.9 (+11.8)	46.3	61.8 (+15.5)
Recall (%)	66.5	80.9 (+14.4)	56.5	58.3 (+1.8)
$m A P_{50}$ (%)	72.0	85.6 (+13.6)	42.3	61.9 (+19.6)
$m A P_{50 - 95}$ (%)	33.7	40.7 (+7.0)	14.0	23.5 (+9.5)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, X.; Wang, F.; Hu, X.; Wu, X.; Nuo, M. IRWT-YOLO: A Background Subtraction-Based Method for Anti-Drone Detection. Drones 2025, 9, 297. https://doi.org/10.3390/drones9040297

AMA Style

Cheng X, Wang F, Hu X, Wu X, Nuo M. IRWT-YOLO: A Background Subtraction-Based Method for Anti-Drone Detection. Drones. 2025; 9(4):297. https://doi.org/10.3390/drones9040297

Chicago/Turabian Style

Cheng, Xueqi, Fan Wang, Xiaopeng Hu, Xinrong Wu, and Min Nuo. 2025. "IRWT-YOLO: A Background Subtraction-Based Method for Anti-Drone Detection" Drones 9, no. 4: 297. https://doi.org/10.3390/drones9040297

APA Style

Cheng, X., Wang, F., Hu, X., Wu, X., & Nuo, M. (2025). IRWT-YOLO: A Background Subtraction-Based Method for Anti-Drone Detection. Drones, 9(4), 297. https://doi.org/10.3390/drones9040297

Article Menu

IRWT-YOLO: A Background Subtraction-Based Method for Anti-Drone Detection

Abstract

1. Introduction

2. Related Work

2.1. Infrared Weak Object Detection

2.2. Background Suppression Methods

3. Methods

3.1. Background Suppression as an Image Enhancement Strategy

3.2. IRWT-YOLO Model

3.3. Dual Convolution Paralleled Patch-Aware Attention Module

3.4. RCSCAA Module

4. Experiment and Results

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparisons with the State-of-the-Art

4.4. Ablation Studies

4.5. Results Visualization

4.6. Experiments on Publicly Available Datasets

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI