S-Drone-YOLO: A Parameter-Efficient P2-Guided Quality-Aware YOLO Detector for Infrared Small UAV Detection

Aldubaikhi, Ali; Patel, Sarosh

doi:10.3390/app16125854

Open AccessArticle

S-Drone-YOLO: A Parameter-Efficient P2-Guided Quality-Aware YOLO Detector for Infrared Small UAV Detection

by

Ali Aldubaikhi

^* and

Sarosh Patel

Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 5854; https://doi.org/10.3390/app16125854

Submission received: 17 May 2026 / Revised: 6 June 2026 / Accepted: 8 June 2026 / Published: 10 June 2026

(This article belongs to the Special Issue Recent Advances and New Trends in Computer Vision and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Infrared small-UAV detection remains difficult because the target often appears as a weak thermal point rather than a clear object. This problem is clear in the SIDD dataset, where most test targets are smaller than 32 × 32 pixels. To address this case, this paper proposes S-Drone-YOLO, a compact YOLO-based detector that maintains a high-resolution P2 prediction path and leverages it carefully during classification. The model starts from a lightweight YOLOv5-style detector. It adds a stride-4 P2 path and replaces the C3 neck blocks with C2fAttn to improve feature reuse before prediction. Two components are then added to the Architecture II design. The Coordinate-Aware Residual C2f Block, CAR-C2f, strengthens the P2 branch using coordinate attention and residual scaling. The P2-Guided Quality-Aware Detection Head (P2-QADH) combines local P2 details with nearby P3 context. It produces a quality map that adjusts the classification logits. The regression branch, output tensor format, and training loss interface remain unchanged. On the SIDD infrared drone dataset, S-Drone-YOLO reaches 0.988 precision, 0.939 recall, 0.699 mAP50-95, and 0.962 F1-score. It uses 6.45 M parameters and 31.3 GFLOPs. Compared with the Architecture I model, recall increases by 0.8 percentage points and mAP50-95 increases by 0.4 percentage points. At the same time, the parameter count decreases by 20.3%, and GFLOPs decrease by 43.7%. Fine-tuning on five RGB UAV datasets and a second thermal dataset (ThermalUAV2UAV) yields F1 scores ranging from 0.941 to 0.999, with an mAP50-95 of 0.843 on the thermal dataset. The background analysis also shows stable F1-scores across sky, sea, city, and mountain scenes. These results suggest that controlled P2 guidance can improve infrared small-UAV detection while keeping the model size practical.

Keywords:

small object detection; UAV detection; infrared drone detection; YOLO; P2 detection head; coordinate attention; quality-aware detection

1. Introduction

The main difficulty in this study arises from the target’s size and the weak visual evidence. In many UAV surveillance scenes, the drone is far from the camera and occupies only a few pixels. Aldubaikhi and Patel describe common problems in small-object detection, including feature loss, scale imbalance, low signal-to-noise ratio, unclear context, and sensitivity to annotation quality [1]. These problems are exacerbated in infrared UAV detection because the target may be blurred, low-contrast, or blended with the background.

The COCO protocol defines a small object as one with a bounding box area of less than 32 × 32 pixels. Medium objects have bounding boxes with areas ranging from 32 × 32 to 96 × 96 pixels. Large objects have bounding box areas of 96 × 96 pixels or more [2]. In the SIDD test split used in this study, 801 out of 949 UAV targets are smaller than 32 × 32 pixels. This represents 84.40% of the test targets. Therefore, the evaluation primarily tests the detector’s ability to detect weak UAV targets near the image resolution limit.

Infrared imagery makes small-UAV detection more difficult than in many visible-light cases. It is useful for surveillance in low-light conditions, but it removes color and most texture information. In many images, the detector sees only a weak thermal response with an unclear boundary. Some background regions can also resemble small UAVs. Examples include clouds, waves, building edges, and mountain textures. Yuan et al. introduced SIDD for single-frame infrared small-drone detection [3]. The dataset includes mountain, city, sky, and sea scenes, and each scene type creates different background interference.

Recent UAV detectors often use high-resolution heads or attention modules to reduce the loss of small-object details. BRA-YOLOv10 adds a high-resolution small-target layer for UAV detection on SIDD [4]. YOLO-Drone also uses a high-resolution head and changes the feature extraction process [5]. Another YOLOv8-based detector reports better results by using attention and a tiny detection head [6]. These studies support the use of shallow spatial features for detecting small UAVs. However, they do not clearly explain how P2 information should influence the final classification confidence.

This paper focuses on this practical gap. Adding a P2 branch can improve recall but also increase computation. Attention can improve feature selection, but if it is not controlled, it may weaken small UAV cues during training. S-Drone-YOLO addresses these issues in three ways. It keeps the P2 path for high-resolution small-target information. It refines this path with residual coordinate-aware attention. It also uses a controlled quality signal to adjust the classification logits. The design keeps the standard Ultralytics YOLO detection interface unchanged. This makes training and evaluation easier to repeat.

The main contributions of this work are summarized below:

A four-scale YOLO detector is designed for infrared UAV detection. It adds a stride-4 P2 path to the standard P3–P5 prediction structure, enabling small targets to be evaluated before significant spatial detail is lost.
A Coordinate-Aware Residual C2f Block (CAR-C2f) is placed only in the P2 branch. It strengthens location-sensitive features while preserving the same input and output channel dimensions.
A P2-Guided Quality-Aware Detection Head (P2-QADH) is introduced. It uses the nearby P3 context to refine P2 and applies a controlled P2-based quality bias to classification logits without changing the regression branch.

A full experimental analysis is reported on SIDD, including component ablation, COCO-size target analysis, comparison with recent YOLO baselines, fine-tuning on five UAV datasets, background-based analysis, and confusion-matrix-based error analysis.

2. Related Work

2.1. Small-Object Detection Challenges and Evaluation

Small-object detection should not be treated solely as a scale issue. The recent SOD survey [1] shows that small targets lose useful responses after repeated downsampling. They also suffer from low signal-to-noise ratio, scale imbalance, weak context, and higher annotation sensitivity. The COCO object-size protocol [2] is useful here because a detector may achieve a good overall score while still missing many very small objects. For UAV surveillance, these missed targets are the most important failure cases.

A fair evaluation should consider both accuracy and cost. Precision measures the fraction of predicted UAVs that are correct. Recall measures the number of real UAVs detected. mAP50-95 evaluates localization quality across stricter IoU thresholds. Parameter count and GFLOPs describe the computational cost of the model. These metrics should be reported together because a small accuracy gain may not be useful if the model becomes too heavy for practical deployment.

2.2. YOLO Detectors and UAV Detection

One-stage detectors, such as YOLO and SSD, predict bounding boxes and classes in a single forward pass [7,8]. This makes them attractive for UAV detection, where speed and model size are usually important. Later YOLO variants improved the backbone, feature fusion, prediction strategy, and training procedure. YOLOv10 and YOLOv7 also follow this efficiency-oriented direction [9,10].

Feature-pyramid design is central to UAV detection. FPN introduced top-down multiscale fusion [11], PANet added bottom-up aggregation [12], and CSPNet improved gradient flow and efficiency through cross-stage partial connections [13]. These designs influenced many YOLO backbones and necks. However, the standard P3–P5 structure can still be weak for tiny UAVs because the first prediction level has a stride of 8. If a UAV is only 5 × 5 or 10 × 10 pixels, much of its local evidence may already be weakened by the time of prediction.

High-resolution detection heads address this issue more directly. BRA-YOLOv10 used a high-resolution small-target layer on SIDD [4]. YOLO-Drone also added a high-resolution head and modified the backbone to reduce information loss [5]. A YOLOv8-based drone detector also improved performance by using attention with a tiny detection head [6]. S-Drone-YOLO follows this direction, but its P2 branch is not only an extra output. It also provides a quality cue that helps control classification confidence.

2.3. Attention-Based Feature Refinement

Attention is useful when it is placed where the model needs it most. Coordinate Attention is suitable for this work because it keeps horizontal and vertical position information [14]. For a UAV that covers only a few pixels, position cues can be as important as channel cues. CAR-C2f uses coordinate attention in a residual form so that the P2 feature is refined without forcing the attention output to dominate the original feature.

C2f-style aggregation also helps the neck reuse intermediate features. A C3 block uses one processed branch and one shortcut branch, then fuses them at the output. C2f collects intermediate bottleneck outputs, and C2fAttn adds attention-guided refinement. In S-Drone-YOLO, replacing C3 with C2fAttn is useful because infrared backgrounds can contain small structures that look like UAVs, such as wave edges, bright building parts, skyline points, or thermal noise.

2.4. Infrared UAV Datasets and Generalization

SIDD is the main dataset in this study. It contains 4737 infrared images with a resolution of 640 × 512, covering mountain, city, sky, and sea backgrounds [3]. The dataset focuses on quadrotor UAVs in realistic intrusion scenes and provides pixel-level annotation. Recent drone-detection studies, including BRA-YOLOv10, have also used SIDD [4], which makes it a relevant benchmark for comparing infrared small-UAV detectors.

Six UAV datasets are used for fine-tuning-based generalization tests. TIB-Net contains 2850 images with multi-rotor and fixed-wing UAVs in low-altitude scenes [15]. Det-Fly contains 13,271 DJI Mavic images in sky, city, field, and mountain backgrounds [16]. UAVfly contains 10,281 images from urban, suburban, desert, field, lake, sky, and mountain scenes [17]. LRDD v1 contains 21,190 long-range drone images with small targets and difficult backgrounds [18]. DUT Anti-UAV provides 10,000 image sequences for visible-light UAV detection and tracking [19]. The Thermal UAV 2UAV dataset adds 3856 thermal images captured by a DJI Zenmuse XT2 sensor (Da-Jiang Innovations Science and Technology Co., Ltd., Shenzhen, China) mounted on a DJI Matrice 210 RTK. It includes four quadcopter types and two hexacopter types, with background, single-UAV, and two-UAV images, making it useful for testing the generalization of thermal UAV-to-UAV detection [20].

2.5. Rationale for the YOLOv5 Base Architecture

Later YOLO variants, such as YOLOv8, YOLOv10, YOLO11, YOLO12, and YOLO26, introduce useful improvements in head design, training strategy, and feature aggregation. They are used in this paper as evaluation baselines (Section 5.3). However, the contribution of S-Drone-YOLO is not a new training recipe or backbone; it is a small set of architectural components (the stride-4 P2 path, the CAR-C2f block, and the P2-QADH head), whose effects need to be cleanly measured. YOLOv5s is selected as the base architecture for three reasons. (i) YOLOv5s is small (5.27 M parameters, 7.7 GFLOPs), so the cost of the proposed components is visible rather than absorbed by a larger backbone. (ii) The YOLOv5 head and loss interface (anchor-based or anchor-free Ultralytics head with the standard TaskAlignedAssigner) is well documented and stable, which makes the controlled modification of a single component (replacing the standard head with P2-QADH while keeping the regression branch intact) straightforward. (iii) The YOLOv5 codebase is widely reproduced for drone-detection studies, including BRA-YOLOv10 [4] and the YOLOv8-based drone detector of Zamri et al. [6], so the results in this paper can be compared on a common architectural footing. Newer YOLO versions are not used as the base because the proposed P2-QADH and CAR-C2f blocks are architecture-agnostic and can, in principle, be transplanted into any YOLO family in future work; this ablation uses YOLOv5 as the cleanest test environment.

3. Method

S-Drone-YOLO is built on top of the standard YOLOv5s pipeline and inherits the usual three-stage layout: backbone, feature-pyramid neck, and multi-scale detection head. Four architectural changes are introduced, each targeting a specific failure mode of small infrared UAV detection. (i) A stride-4 P2 prediction path is added so that very small targets are evaluated before too much spatial detail is lost. (ii) The C3 blocks in the neck are replaced by C2fAttn blocks to provide richer feature aggregation before the head. (iii) Inside the P2 branch, a Coordinate-Aware Residual C2f block (CAR-C2f) refines the high-resolution feature using residual coordinate attention. (iv) A P2-Guided Quality-Aware Detection Head (P2-QADH) replaces the standard head and uses the enhanced P2 feature to produce a quality map that biases classification logits at every level. At the same time, the regression branch and the YOLO output-tensor format remain unchanged. The unifying conceptual idea behind these changes is P2-guided feature control: the P2 pathway is treated not just as an extra prediction branch, but as a controlled source of small-target evidence that softly modulates the classifier at every scale. The development is presented in two stages. Architecture I introduces the P2 path and the C2fAttn neck and is analyzed in Section 5.1. Architecture II adds CAR-C2f and P2-QADH on top of Architecture I and is analyzed in Section 5.2; it constitutes the final S-Drone-YOLO design whose results are reported in Section 5.3, Section 5.4, Section 5.5, Section 5.6, Section 5.7 and Section 5.8. Section 3.1 below gives the formal architectural overview. Section 3.2, Section 3.3 and Section 3.4 then describe each component in detail.

3.1. Design Logic and Overall Architecture

The architecture as shown in Figure 1 was designed around the point where small-UAV evidence is most likely to disappear. The baseline model adopted in this work is the lightweight YOLOv5s configuration (depth_multiple = 0.33, width_multiple = 0.50) from the Ultralytics implementation. YOLOv5s was chosen for three reasons. First, its parameter count (around 5.27 M) and 7.7 GFLOPs are well-suited to embedded UAV-surveillance deployment. Second, its CSP-based backbone and PANet-style neck provide a stable starting point that allows controlled ablation of the proposed P2 pathway, CAR-C2f block, and P2-QADH head without confounding effects from large architectural differences. Third, the YOLOv5 training and evaluation pipeline (Ultralytics v8.3.138) is mature and reproducible, which is important for a study that introduces multiple architectural components in sequence. Architecture I (Section 5.1) refers to YOLOv5s extended with the stride-4 P2 path and C2fAttn neck, while Architecture II (Section 5.2) refers to S-Drone-YOLO obtained after adding CAR-C2f and P2-QADH. In a standard P3–P5 YOLO head, the first prediction stride is 8. For a 5 × 5 or 10 × 10 UAV, this can remove useful local detail before prediction. S-Drone-YOLO therefore keeps a stride-4 P2 path and treats it as the main source of fine spatial evidence. The final detector predicts at four levels, P2, P3, P4, and P5, while preserving the usual end-to-end YOLO workflow.

Let the input image be denoted by

X \in ℝ^{3 \times H \times W}

. The detector produces four pyramid features for prediction:

ℱ = {P_{2}, P_{3}, P_{4}, P_{5}}, P_{i} \in ℝ^{C_{i} \times H_{i} \times W_{i}}

(1)

Their spatial sizes depend on the detection stride:

H_{2} = H / 4, | W_{2} = W / 4; H_{3} = H / 8, | W_{3} = W / 8; H_{4} = H / 16, | W_{4} = W / 16; H_{5} = H / 32, | W_{5} = W / 32

(2)

With a 640 × 640 input, P2 has a 160 × 160 grid, while P3 has an 80 × 80 grid. This difference is important for very small UAVs. A target occupying only a few pixels can still produce a visible local response on P2, but the same response may be mixed with background noise at coarser levels.

3.2. Replacement of C3 with C2fAttn in the Neck

The Architecture I design change is replacing the C3 neck blocks with C2fAttn. The C3 block uses a common CSP structure. One branch passes through bottleneck layers, while the other branch works as a shortcut before the features are fused. This structure is stable, but it uses intermediate features sparingly. C2fAttn gives the neck a stronger feature aggregation. It keeps more intermediate information and uses attention-guided refinement to reduce unwanted background responses before the final detection head. This is useful for small infrared UAV detection, where weak drone features can easily be confused with background patterns. Figure 2 shows the conceptual comparison between C3 and C2fAttn.

This change is made at the neck because the feature maps already contain multiscale information there. The aim is to improve feature selection before the final P2–P5 pyramid is passed to the detection head. This replacement keeps the standard YOLO training interface unchanged. It also does not require a new target assignment strategy.

3.3. Coordinate-Aware Residual C2f Block

CAR-C2f is inserted only before detection in the P2 path. This placement is intentional. The P2 branch has the highest spatial resolution and carries the most useful detail for tiny UAVs, so it receives the extra refinement. P3–P5 are left unchanged to avoid unnecessary computation and to reduce the risk of disturbing deeper semantic features.

The block in Figure 3 has three stages that map directly to Equations (3)–(9). Stage A (Equations (3)–(5)): a

{Conv}_{1 \times 1}

expands the input

X

, and the result is split into two halves

Y_{0}

and

Y_{1}

;

Y_{1}

is then processed by n bottleneck blocks producing

Y_{2}

, …,

Y_{n + 1}

, and all halves are concatenated and projected by a second

{Conv}_{1 \times 1}

to give the aggregated feature

F_{agg}

. In Figure 3 this corresponds to the C2f-style block on the left. Stage B (Equations (6) and (7)): Coordinate Attention forms horizontal and vertical pooled descriptors from

F_{agg}

and produces

F_{ca}

, shown in the middle block of Figure 3. Stage C (Equations (8) and (9)):

F_{agg}

and

F_{ca}

are combined with a learnable residual scale

α_{a} = \tanh (θ_{a})

, and a global residual connection adds the input

X

back. This is the right-hand path in Figure 3. In all expressions,

e = 0.5

is the channel expansion ratio that controls the hidden width

c

=

eC

and n = 1 is the number of bottleneck stages used in the Architecture II.

For an input feature map

X \in ℝ^{C \times H \times W}

, CAR-C2f first applies a

1 \times 1

convolution and splits the result into two hidden feature partitions:

[Y_{0}, Y_{1}] = Split ({Conv}_{1 \times 1} (X)), Y_{0}, Y_{1} \in ℝ^{c \times H \times W}, c = ⌊ eC ⌋

(3)

The last hidden feature is then passed through.

n

bottleneck transformations:

Y_{k + 1} = B_{k} (Y_{k}), k = 1, \dots, n

(4)

In the final implementation,

n = 1

and

e = 0.5

. Shortcut connections, depthwise convolution, residual attention, and an initial attention scale

θ_{a} = 0

are used. The aggregated feature is written as:

F_{agg} = {Conv}_{1 \times 1} (Concat (Y_{0}, Y_{1}, Y_{2}, \dots, Y_{n + 1}))

(5)

Coordinate Attention is then used to form horizontal and vertical pooled descriptors [14]:

z_{h} (c, h) = \frac{1}{W} \sum_{w = 1}^{W} F_{agg} (c, h, w), z_{w} (c, w) = \frac{1}{H} \sum_{h = 1}^{H} F_{agg} (c, h, w)

(6)

The horizontal and vertical attention maps,

A_{h}

and

A_{w}

, are applied to the aggregated feature:

F_{ca} (c, h, w) = F_{agg} (c, h, w) \cdot A_{h} (c, h) \cdot A_{w} (c, w)

(7)

Instead of replacing the aggregated feature directly, CAR-C2f applies residual attention scaling:

F_{out} = F_{agg} + α_{a} F_{ca}, α_{a} = \tanh (θ_{a})

(8)

If the input and output shapes are the same, a global residual connection is added:

Y = X + F_{out}

(9)

The initialization of

θ_{a} = 0

is used to keep the block close to a stable residual path during the initial training phase. The model then learns how much coordinate-aware refinement should influence P2. This choice reduces early instability and limits the risk of suppressing weak UAV responses.

3.4. P2-Guided Quality-Aware Detection Head

P2-QADH replaces the standard detection head but keeps the expected YOLO output format. In the four-scale setting, the head receives

{P_{2}, P_{3}, P_{4}, P_{5}}

. Its role is not to redesign the full detection pipeline. Instead, it improves the P2 feature with nearby P3 context and produces a P2-derived quality map that adjusts classification logits. The regression branches remain structurally unchanged.

P2-QADH operates in three steps, as shown in Figure 4 Step 1, P2/P3 fusion (Equations (10)–(12)): the P3 feature is upsampled to the P2 resolution, concatenated with P2, projected by a

{Conv}_{1 \times 1}

to

F_{23}

, and a

DWConv (F_{23} - P_{2})

produces a refinement signal

Δ_{2}

. This corresponds to the “P3 upsample + fusion” branch in the upper-left part of Figure 4. Step 2, hybrid quality gate (Equations (13)–(16)): a spatial gate

G_{s}

and a channel gate

G_{c}

are computed from P2 using 1 × 1 depthwise convolutions and global average pooling. Their

\tilde{G}

is bounded by a learnable floor

f

(Equation (15)), and the enhanced P2 feature is

P_{2}^{'} = P_{2} + α_{scale} \tanh (θ_{α}) \tilde{G} ⊙ Δ_{2}

. In Figure 4, these gates are shown as the central “spatial gate” and “channel gate” modules, and the gated residual addition is labelled “

P_{2}^{'}

“. Step 3, quality-aware classification (Equations (16)–(18)): a 1 × 1 projection on P2′ followed by a sigmoid produces a single-channel quality map

Q_{2}

; the map is resized to each of the four detection levels and added to the classification logits

Z_{i}

with a learnable scale

λ_{i} = λ_{QA} \tanh (θ_{i})

. The regression branch

R_{i}

is unchanged, so the YOLO output tensor

O_{i} = Concat (R_{i}, Z_{i}^{'})

keeps the standard format. In Figure 4 this corresponds to the “quality map” block on the right and the four arrows labelled

Q_{2}

->

Z_{i}

. The final hyperparameters in the deployed model are

α_{scale} = 0.25

and

λ_{QA} = 1.25

.

The P3 feature is first resized to match the spatial resolution of P2:

P_{3}^{↑} = Upsample (P_{3}, | size = size (P_{2}))

(10)

The resized P3 feature and the original P2 feature are fused as follows:

F_{23} = {Conv}_{1 \times 1} (Concat (P_{2}, P_{3}^{↑}))

(11)

A lightweight residual refinement signal is computed from the difference between this fused feature and the original P2:

Δ_{2} = DWConv (F_{23} - P_{2})

(12)

A hybrid spatial-channel gate is then computed from P2:

G_{s} = σ ({Conv}_{1 \times 1} (DWConv (P_{2}))), G_{c} = σ ({Conv}_{1 \times 1} (GAP (P_{2}))) G = G_{s} ⊙ G_{c}

(13)

A learnable floor is added so that the gate does not push useful responses too close to zero:

\tilde{G} = G (1 - f) + f, f = 0.25 \cdot σ (θ_{f})

(14)

The enhanced P2 feature is defined as:

P_{2}^{'} = P_{2} + α_{scale} \tanh (θ_{α}) \tilde{G} ⊙ Δ_{2}

(15)

The final setting uses

α_{scale} = 0.25

and

λ_{QA} = 1.25

. A quality map is produced from the enhanced P2 feature:

Q_{2} = σ (ψ (P_{2}^{'})), Q_{2} \in ℝ^{1 \times H_{2} \times W_{2}}

(16)

For each detection level

i \in {2, 3, 4, 5}

, the quality map is resized to the corresponding classification-logit resolution and used as a controlled bias:

Q_{i} = Resize (Q_{2}, | size = size (Z_{i})), λ_{i} = λ_{QA} \tanh (θ_{i}) Z_{i}^{'} = Z_{i} + λ_{i} Q_{i}

(17)

The output at each detection level is then:

O_{i} = Concat (R_{i}, Z_{i}^{'})

(18)

Here,

R_{i}

denotes the regression output and

Z_{i}^{'}

denotes the quality-adjusted classification logits. The quality signal is deliberately limited to classification. Regression is left unchanged because modifying the box prediction could affect localization stability and the standard Ultralytics loss pipeline. This choice keeps the output tensor structure compatible with YOLO while allowing the high-resolution P2 feature to influence confidence estimation.

4. Experimental Preparation

4.1. SIDD Dataset

The main experiments use the SIDD dataset for infrared small-drone detection [3]. SIDD has 4737 infrared images with a resolution of 640 × 512. The dataset includes four background types: 2151 mountain images, 1093 city images, 780 sky images, and 713 sea images. Figure 5 illustrates examples of the SIDD dataset scenes. The targets are quadrotor UAVs. These targets match low, slow, and small drone intrusion cases. The SIDD target-size distribution using the COCO object-size definition is shown in Table 1. The dataset is split into an 80:20 train-test. Figure 6 shows the target-size distribution in SIDD for the train and test sets.

4.2. Generalization Datasets

To test adaptation beyond SIDD, Architecture II is fine-tuned on six UAV datasets. For the five RGB datasets, the input size is set to 1920, and the batch size is 8. For the additional Thermal UAV 2UAV dataset, the input size is set to 640, and the batch size is 16, following its native thermal image resolution. These datasets differ in sensor type, background, UAV shape, target distance, and imaging conditions. TIB-Net includes multiple UAV types and low-altitude backgrounds [15]. Det-Fly focuses on DJI Mavic targets in four environments [16]. UAVfly covers several geographic scenes and air-to-air detection cases [17]. LRDD v1 focuses on long-range drones with very small targets [18]. DUT Anti-UAV includes many UAV types, complex outdoor scenes, blur, camera motion, occlusion, and weather changes [19]. The Thermal UAV 2UAV dataset adds a thermal UAV-to-UAV detection setting. It contains thermal images captured by a DJI Zenmuse XT2 sensor mounted on a DJI Matrice 210 RTK. It includes four quadcopter types and two hexacopter types, making it useful for testing adaptation for infrared aerial-view UAV detection [20]. Table 2 shows the characteristics of external UAV datasets.

4.3. Implementation Details

The training configuration was used to complete the implementation details. SIDD training used 250 epochs, an input resolution of 640, and the Ultralytics settings reported in Table 3. External fine-tuning was conducted on six UAV datasets. For the five RGB datasets, the input resolution was set to 1920 and the batch size to 8. For the IR dataset, the input resolution was set to 640 with a batch size of 16, matching its thermal image resolution and allowing a larger batch size. Unless stated otherwise, the remaining settings followed the same protocol. All experiments were performed on a single workstation equipped with an NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM, an Intel Core i9-12900K CPU, and 64 GB of system RAM. Training was performed with CUDA 11.8, PyTorch 2.4.1, Windows 11, and Ultralytics 8.3.138. The Architecture I and Architecture II were trained under the same protocol.

All comparisons report precision, recall, mAP50-95, F1-score, parameter count, and GFLOPs. GFLOPs are reported under the same input-resolution setting within each experiment group. To address the practical question of inference speed, frames per second (FPS) is also reported in the YOLO baseline comparison (Section 5.3). FPS values are measured on the same RTX 4090 workstation described above, at batch size 1 and an input resolution of 640, after a 50-iteration warm-up, and are computed as the reciprocal of the average per-image forward-pass time over 1000 inference runs:

FPS = \frac{N}{\sum_{j = 1}^{N} t_{j}}, N = 1000

(19)

where

t_{j}

is the forward-pass time of the j-th inference run and N is the number of runs. FPS includes only the model’s forward pass; preprocessing and postprocessing times are excluded, so the comparison reflects the architectural cost of each detector. Parameter count and GFLOPs are still reported in every table because they are independent of the hardware and software stack, whereas FPS depends on the deployment environment.

4.4. Evaluation Metrics

Precision measures the proportion of predicted UAV detections that are correct. A high precision value means fewer false alarms:

Precision = \frac{TP}{TP + FP}

(20)

Recall measures the proportion of ground-truth UAVs that are detected. A high recall value means fewer missed targets:

Recall = \frac{TP}{TP + FN}

(21)

Here,

TP

denotes a correctly detected UAV,

FP

denotes a background region predicted as a UAV, and

FN

denotes a missed UAV. The F1-score combines precision and recall into one balanced value:

F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

(22)

The mAP50-95 metric averages average precision over ten IoU thresholds, from 0.50 to 0.95:

{mAP}_{50 : 95} = \frac{1}{10} \sum_{t \in {0.50, 0.55, \dots, 0.95}} {AP}_{t}

(23)

mAP50-95 is stricter than mAP50 because it rewards accurate localization at higher overlap thresholds. GFLOPs show the number of billion floating-point operations needed for one forward pass, while parameter count describes model size and memory demand.

5. Results and Analysis

5.1. Architecture I Development: P2 and C2fAttn

The first development phase studies two changes: adding a high-resolution P2 branch and replacing C3 neck blocks with C2fAttn. The YOLOv5-style baseline reaches 0.864 recall and 0.679 mAP50-95. After adding P2, recall rises to 0.925. This is the largest gain in Architecture I and shows that SIDD strongly benefits from stride-4 prediction. Replacing C3 with C2fAttn increases recall to 0.931 and mAP50-95 to 0.695, forming the S-Drone-YOLO model (Architecture I).

The P2 branch provides the main recall improvement because most SIDD test targets are smaller than 32 × 32 pixels. The result also shows a cost problem. The S-Drone-YOLO (Architecture I) model achieves 55.6 GFLOPs, so the Architecture II design aims to maintain the recall gain while reducing computational burden. The role of C2fAttn deserves a clearer justification at this stage, because Table 4 shows that adding C2fAttn on top of YOLOv5+P2 raises the parameter count from 5.40 M to 8.09 M and the GFLOPs from 19.5 to 55.6, while improving recall by only 0.6 percentage points and mAP50-95 by 1.1 percentage points. Taken in isolation, this trade-off is not attractive, and the C2fAttn block could indeed be removed at this stage. C2fAttn is retained because Architecture I is not the deployment target; it serves as the foundation on which CAR-C2f and P2-QADH are added in the second stage. The richer feature aggregation produced by C2fAttn is what makes the P2 feature consumed by CAR-C2f and the P2/P3 fusion used in P2-QADH informative. Table 5 confirms this design choice: starting from Architecture I and adding CAR-C2f and P2-QADH reduces the parameter count to 6.45M and GFLOPs to 31.3, while increasing recall to 0.939 and mAP50-95 to 0.699. The combined Architecture II therefore recovers most of the cost added by C2fAttn (43.7% GFLOPs reduction and 20.3% parameter reduction relative to Architecture I) and still delivers the best accuracy. In a configuration where the extra modules cannot be used, the C3 neck is, in fact, the preferable choice; C2fAttn is justified specifically because it pairs with the P2-guided refinement introduced in Architecture II.

5.2. Architecture II Model Performance and Component Ablation

The S-Drone-YOLO (Architecture II) model adds CAR-C2f and P2-QADH to the Architecture I model. It reaches 0.988 precision, 0.939 recall, 0.699 mAP50-95, and 0.962 F1-score with 6.45 M parameters and 31.3 GFLOPs. Compared with the Architecture I model, recall increases by 0.008, mAP50-95 increases by 0.004, and F1-score increases by 0.003. At the same time, parameters decrease by 20.3%, and GFLOPs decrease by 43.7%.

CAR-C2f improves recall and mAP50-95 while reducing GFLOPs compared with the Architecture I model. Adding P2-QADH gives the best recall, mAP50-95, F1-score, parameter count, and GFLOPs among the final-stage variants. Precision decreases slightly compared with the Architecture I model, but the Architecture II model detects more UAV targets and achieves the best-balanced F1-score. Figure 7 shows the incremental effect of the architectural changes on recall and mAP50-95. For this task, the recall gain is important because a missed UAV is usually more serious than a small increase in false alarms.

5.3. Comparison with Recent YOLO Baselines

The Architecture II is compared with recent YOLO baselines trained and evaluated under the same SIDD protocol as shown in Figure 8. To match the parameter scale of S-Drone-YOLO (6.45 M parameters), the small (s) variants of each YOLO family are used: YOLOv8s, YOLO11s, YOLO12s, and YOLO26s [21,22,23,24]. The s variant is the closest scale to the proposed model and is the standard small-edge configuration in the Ultralytics ecosystem. The nano (n) variants were excluded because they are significantly smaller (about 3M parameters) and would not provide a fair head-to-head comparison; the medium (m) and large (l) variants were excluded because they are substantially heavier than S-Drone-YOLO and would bias the comparison in the opposite direction.

S-Drone-YOLO obtains the highest recall, mAP50-95, and F1-score among the compared models. Recall improves by 6.8 percentage points over YOLOv8, 6.7 points over YOLO11, 8.0 points over YOLO12, and 2.1 points over YOLO26. The proposed model also has the smallest parameter count. YOLO26 uses fewer GFLOPs but achieves mAP50-95 of 0.659, compared with 0.699 for S-Drone-YOLO. This result suggests that the extra computation is used in a useful part of the network, mainly the high-resolution pathway needed for small UAVs. Inference speed (FPS) measured on the RTX 4090 workstation at batch size 1 and input resolution 640 is reported in the last column of Table 6. FPS does not follow GFLOPs monotonically. YOLOv8s reaches the highest throughput despite higher GFLOPs because its convolutional blocks map efficiently to GPU kernels. In comparison, the attention-based YOLO12s and YOLO26s show lower FPS at lower GFLOPs due to memory-bound operations. S-Drone-YOLO runs at 232 FPS, well above real-time, with its cost concentrated in the high-resolution P2 pathway. The comparison between S-Drone-YOLO and different YOLO baselines on the accuracy-compute trade-off is shown in Figure 9.

5.4. Performance by Target

The model is also evaluated using the COCO object-size definition. The SIDD test set lacks large UAV instances, so the benchmark primarily measures small and medium targets. On medium targets, the Architecture II model reaches 0.993 recall and 0.986 precision. On small targets below 32 × 32 pixels, it keeps 0.936 recall and 0.965 precision across 801 targets, as shown in Table 7. This result is important because small targets represent 84.40% of the test set.

5.5. Fine-Tuning Generalization on External UAV Datasets

Fine-tuning is performed on five external UAV datasets using an input resolution of 1920 and a batch size of 8, plus a sixth thermal infrared dataset, fine-tuned at an input resolution of 640 and a batch size of 16. This experiment is not a zero-shot test. Its purpose is to examine whether the architecture adapts well after task-specific fine-tuning. The selected datasets include different sensors, backgrounds, UAV appearances, and target distances.

The Architecture II model performs well on all five datasets. UAVfly achieves near-saturation in precision, recall, and F1-score. LRDD v1 and Det-Fly also show strong mAP50-95 under varied real-world conditions. TIB-Net has lower mAP50-95, but it still reaches 0.960 recall and 0.941 F1-score. This means the detector retrieves most targets, while localization is less stable under strict IoU thresholds. Very small boxes, annotation variation, and background clutter may explain this pattern. The sixth dataset in Table 8, ThermalUAV2UAV [20], is a thermal infrared dataset and directly tests the model on a second infrared source beyond SIDD. It contains 3856 thermal images of four quadcopter types and two hexacopter types, acquired with a thermal sensor mounted on board a UAV, so the targets appear from many viewing angles. The model was fine-tuned on this dataset at an input resolution of 640 and a batch size of 16. It reaches 0.934 precision, 0.970 recall, 0.843 mAP50-95, and 0.951 F1-score on the 476-image test split. The strong mAP50-95 of 0.843, the highest among all six datasets, confirms that the P2-guided design transfers well to a different thermal sensor and to multi-rotor types not present in SIDD. This result complements the SIDD evaluation and shows that the architecture generalizes across two independent infrared sources. Five of the six fine-tuning datasets are RGB, intentionally: SIDD and ThermalUAV2UAV are dedicated infrared benchmarks, while the five RGB datasets test whether an architecture designed around infrared characteristics still adapts when the modality changes. Public infrared UAV datasets remain limited, which is why a larger multi-source infrared cross-dataset evaluation is left as future work.

5.6. Per-Background Analysis on SIDD

The SIDD dataset includes four background types: sky, sea, city, and mountain. These scenes do not produce the same errors. Sky images usually have a simpler background. Sea and mountain images are harder because they may contain small thermal patterns or texture details that resemble UAVs. Table 9 reports the model performance for each background category. The F1-score is calculated from precision and recall using the standard F1 equation.

Looking into Figure 10, the model gives its best results in sky scenes, where both precision and recall reach 1.000. City scenes also perform well, with an F1-score of 0.979 and an mAP50-95 of 0.797. Sea scenes keep a high F1-score of 0.992, but their mAP50-95 decreases to 0.615. This means that most UAVs are correctly detected, but precise box localization becomes harder as stricter IoU thresholds are used. Mountain scenes are the hardest case. The model reaches 0.886 recall and 0.929 F1-score in this background. This result is reasonable because mountain scenes often contain many small bright and dark structures. These structures can hide weak thermal UAV signals or make them harder to separate from the background.

5.7. Confusion Matrix and Error Analysis

The confusion matrix, shown in Figure 11, gives a threshold-level view of the remaining errors. It reports 897 true-positive UAV detections, 29 false positives, and 52 false negatives. Based on these counts, precision is 0.969, recall is 0.945, and F1-score is 0.957 at the selected validation threshold. These values are useful for checking the error pattern at one operating point, but they should be read carefully alongside Table 6. The two tables report different operating points and therefore different precision, recall, and F1-score values. Table 6 reports each model at the confidence threshold that maximizes the F1-score across the full precision-recall curve, which is the standard Ultralytics validation behavior for cross-model comparison. Table 10 instead fixes the threshold at the Ultralytics default of conf = 0.001 used by the confusion-matrix routine, so that low-confidence predictions are also counted. The lower threshold admits more true positives (897 vs. 877, as implied by the 0.939 recall in Table 6), but also more false positives, which is why precision drops from 0.988 to 0.969, and recall rises from 0.939 to 0.945. Both views are consistent: they describe the same model at two different operating points on the same precision-recall curve. mAP50-95 is the threshold-independent metric and should be used as the primary accuracy criterion.

The remaining false negatives are important in surveillance applications. Missing a UAV may be more serious than using a moderate amount of extra computation. The background analysis shows that mountain scenes are the hardest case for the model. This supports the decision to keep P2 features and refine them with coordinate-aware attention. The false positives are likely due to small thermal patterns that resemble drones. These may include building edges, waves, cloud boundaries, or textured terrain.

5.8. Visual Detection Analysis

A visual comparison is provided between S-Drone-YOLO and YOLO26, as presented in Figure 12. YOLO26 is used here because it has the strongest recall among the selected YOLO baselines. The examples show that S-Drone-YOLO consistently detects small UAVs in complex infrared backgrounds. It also keeps the bounding boxes confident and stable when the target occupies only a very small area in the image.

6. Discussion

6.1. Effect of the P2 Pathway

The ablation results show that the P2 branch is the primary driver of the recall gain. When P2 is added, recall increases from 0.864 to 0.925. This result fits the target distribution in SIDD, where most test objects are smaller than the COCO small-object threshold. Adding P2 increases computation, but the Architecture II model controls much of this cost through the revised detection head and the P2-specific refinement.

6.2. Role of CAR-C2f and P2-QADH

CAR-C2f improves the P2 branch by combining coordinate attention with residual scaling. This helps the model handle small infrared UAVs that can appear similar to clouds, waves, skyline edges, or building boundaries. The residual attention scale also makes the block safer to train. It lets the block begin with stable behavior, then gradually learn how much attention to allocate to the feature map.

P2-QADH links the fine local details in P2 with nearby semantic information from P3. Its quality map adjusts only the classification logits. It does not change the box regression part. This allows the classifier to benefit from high-resolution small-target cues while maintaining stable localization and preserving the standard YOLO output format. The design is also more controlled than adding large attention modules to all detection scales.

6.3. Accuracy-Efficiency Trade-Off

Compared with the S-Drone-YOLO (Architecture I) model, the Architecture II model lowers GFLOPs from 55.6 to 31.3. At the same time, it improves recall and mAP50-95. Compared with YOLO26, the Architecture II model requires more GFLOPs but achieves better accuracy. It improves mAP50-95 by 4.0 percentage points and recall by 2.1 percentage points. This trade-off is reasonable for infrared small-UAV detection, because missed drones are a serious problem in surveillance tasks.

The term parameter-efficient is used carefully in this paper. The Architecture II model has the lowest parameter count among the compared models. However, it does not have the lowest GFLOPs. For this reason, it is more accurate to describe the model as parameter-efficient, rather than the lightest model in all cost measures.

6.4. Limitations

This study has four main limitations. First, SIDD contains single-frame infrared images, so the model cannot use motion information from video sequences. Second, the external dataset experiments are based on fine-tuning. They should not be treated as strict zero-shot generalization tests. Third, the study reports GFLOPs and parameter counts, but real deployment still requires latency and energy tests on edge devices, such as the NVIDIA Jetson. Fourth, the model is mainly tested for one-class UAV detection. Multi-class aerial-object detection may need further study, especially for class imbalance and confusion between similar object classes.

7. Conclusions

This paper presented S-Drone-YOLO, a parameter-efficient YOLO detector for infrared small-UAV detection. The model uses P2-guided quality-aware detection to improve small-target sensitivity. The main improvement comes from using the P2 pathway as a controlled source of small-target information, rather than just an extra prediction branch. The Architecture II model combines four parts: a high-resolution P2 path, a C2fAttn-based neck, CAR-C2f for coordinate-aware P2 refinement, and P2-QADH for quality-aware classification-logit adjustment. On SIDD, the model achieves 0.988 precision, 0.939 recall, 0.699 mAP50-95, and 0.962 F1-score. It uses 6.45 M parameters and 31.3 GFLOPs. Compared with the Architecture I model, the Architecture II model reduces both parameters and GFLOPs while improving recall and mAP50-95. The target-size analysis shows strong performance on small objects. The background analysis also shows that mountain scenes are still the most difficult case. Fine-tuning on five RGB UAV datasets and a second thermal dataset shows that the model adapts well to other UAV datasets and to a second independent infrared source. Overall, the results show that controlled high-resolution feature guidance is useful for detecting small UAVs in complex infrared scenes.

Author Contributions

Conceptualization, S.P. and A.A.; methodology, S.P.; software, A.A.; validation, S.P., A.A.; formal analysis, A.A.; investigation, A.A.; resources, A.A.; data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, A.A.; visualization, A.A.; supervision, S.P.; project administration, S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The LRDD v1 dataset is available at https://research.coe.drexel.edu/shared/ece/imaple/LRDD (accessed on 15 December 2025). The SIDD dataset is available at https://github.com/Dang-zy/SIDD.git (accessed on 5 January 2026). The Det-Fly dataset is available at https://github.com/Jake-WU/Det-Fly (accessed on 19 February 2026). The TIB-Net dataset is available at https://github.com/kyn0v/ TIB-Net.git (accessed on 25 February 2026). The DUT Anti-UAV dataset is available at https://github.com/wangdongdut/DUT-Anti-UAV (accessed on 26 February 2026). The UAVfly dataset is available at https://github.com/lucien22588/UAVfly.git (accessed on 27 February 2026). The Thermal UAV 2UAV dataset is available at https://github.com/GabryV00/ThermalUAV2UAV_Dataset.git (accessed on 27 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

YOLO	You Only Look Once
UAV	Unmanned Aerial Vehicle
SIDD	Single-frame Infrared Small-Drone Detection Dataset
RGB	Red, Green, and Blue
COCO	Common Objects in Context
IOU	Intersection over Union
P2	Pyramid level 2
P3	Pyramid level 3
P4	Pyramid level 4
P5	Pyramid level 5
C3	CSP Bottleneck with Three Convolutions
C2f	Cross-Stage Partial Bottleneck with Two Convolutions and Feature Fusion
C2fAttn	C2f with Attention
AP	Average precision
mAP	Mean average precision
CAR-C2f	Coordinate-Aware Residual C2f Block
GFLOP	Giga Floating-Point Operations
P2-QADH	P2-Guided Quality-Aware Detection Head
PANet	Path Aggregation Network
QA	Quality-Aware
SOD	Small object detection
CSPNet	Cross-Stage Partial Network
SSD	Single Shot MultiBox Detector
CNN	Convolutional Neural Network
mAP50	Mean Average Precision at IoU 0.50
mAP50-95	Mean Average Precision from IoU 0.50 to 0.95
TP	True Positive
FP	False Positive
FN	False Negative
Params	Parameters
FPS	Frames Per Second
SGD	Stochastic Gradient Descent
LR	Learning Rate
AMP	Automatic Mixed Precision
CUDA	Compute Unified Device Architecture

References

Aldubaikhi, A.; Patel, S. Advancements in Small-Object Detection (2023–2025): Approaches, Datasets, Benchmarks, Applications, and Practical Guidance. Appl. Sci. 2025, 15, 11882. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Yuan, S.; Sun, B.; Zuo, Z.; Huang, H.; Wu, P.; Li, C.; Dang, Z.; Zhao, Z. IRSDD-YOLOv5: Focusing on the Infrared Detection of Small Drones. Drones 2023, 7, 393. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, X.; Shi, H.; Wang, K.; Tian, Y.; Xu, Z.; Zhang, Y.; Jia, G. BRA-YOLOv10: UAV Small Target Detection Based on YOLOv10. Drones 2025, 9, 159. [Google Scholar] [CrossRef]
Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An Optimized YOLOv8 Network for Tiny UAV Object Detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
Zamri, F.N.M.; Gunawan, T.S.; Yusoff, S.H.; Alzahrani, A.A.; Bramantoro, A.; Kartiwi, M. Enhanced Small Drone Detection Using Optimized YOLOv8 with Attention Mechanisms. IEEE Access 2024, 12, 90629–90643. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Sun, H.; Yang, J.; Shen, J.; Liang, D.; Li, N.-Z.; Zhou, H. TIB-Net: Drone Detection Network with Tiny Iterative Backbone. IEEE Access 2020, 8, 130697–130707. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Z.; Lv, D.; Li, Z.; Lan, Z.; Zhao, S. Air-to-Air Visual Detection of Micro-UAVs: An Experimental Evaluation of Deep Learning. IEEE Robot. Autom. Lett. 2021, 6, 1020–1027. [Google Scholar] [CrossRef]
Cheng, Q.; Wang, Y.; He, W.; Bai, Y. Lightweight Air-to-Air Unmanned Aerial Vehicle Target Detection Model. Sci. Rep. 2024, 14, 2609. [Google Scholar] [CrossRef] [PubMed]
Rouhi, A.; Umare, H.; Patal, S.; Kapoor, R.; Deshpande, N.; Arezoomandan, S.; Shah, P.; Han, D.K. Long-Range Drone Detection Dataset. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2024; pp. 1–6. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-Based Anti-UAV Detection and Tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Foresti, G.L.; Scagnetto, I.; Tavaris, D.; Voltan, G. Thermal UAV 2UAV Dataset for Training a Counter UAV System: A Strategic Challenge in Civil and Military Domain. Strateg. Leadersh. J. 2024, 2, 59–67. [Google Scholar]
Ultralytics. YOLOv8 Official Model Card and Detection Model Table. Hugging Face. 2026. Available online: https://huggingface.co/Ultralytics/YOLOv8 (accessed on 2 May 2026).
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLO12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Ultralytics. YOLO26 Official Model Card and Detection Model Table. Hugging Face. 2026. Available online: https://huggingface.co/Ultralytics/YOLO26 (accessed on 2 May 2026).

Figure 1. Overall architecture of the S-Drone-YOLO model (Architecture II).

Figure 2. Conceptual comparison between C3 and C2fAttn.

Figure 3. Coordinate-Aware Residual C2f Block (CAR-C2f).

Figure 4. P2-Guided Quality-Aware Detection Head (P2-QADH).

Figure 5. Examples from the SIDD dataset scenes. Drone targets are marked with a red square.

Figure 6. Target-size distribution in SIDD for train and test sets.

Figure 7. Incremental effect of the architectural changes on recall and mAP50-95.

Figure 8. Detection accuracy comparison between S-Drone-YOLO and different YOLO baselines on SIDD.

Figure 9. Accuracy-compute trade-off.

Figure 10. Precision, recall, and F1-score by SIDD background category.

Figure 11. Confusion matrix of the S-Drone-YOLO (Architecture II) model on the SIDD test set.

Figure 12. Visual comparison between S-Drone-YOLO and YOLO26 on selected SIDD images.

Table 1. SIDD target-size distribution using the COCO object-size definition.

Object Size	Train Count	Train %	Test Count	Test %
Large, >96 × 96	0	0	0	0
Medium, 32 × 32 to 96 × 96	603	15.92	148	15.60
Small, <32 × 32	3185	84.08	801	84.40

Table 2. External UAV datasets used for fine-tuning-based generalization evaluation.

Dataset	Modality	Images	Key Characteristics	Reference
TIB-Net	RGB	2850	Multi-rotor and fixed-wing UAVs in low-altitude scenes	[15]
Det-Fly	RGB	13,271	DJI Mavic targets in the sky, city, field, and mountain backgrounds	[16]
UAVfly	RGB	10,281	Urban, suburban, desert, field, lake, sky, and mountain scenes	[17]
LRDD v1	RGB	21,190	Long-range drones with weather, scale, and background-blending challenges	[18]
DUT Anti-UAV	RGB	10,000	More than 35 UAV types, complex outdoor backgrounds, and tracking sequences	[19]
Thermal UAV 2UAV	IR	3856	UAV-to-UAV thermal images captured from an onboard UAV sensor; includes four quadcopters and two hexacopters with single-UAV, two-UAV, and background images	[20]

Table 3. Main training and fine-tuning settings.

Item	SIDD Experiments	External Fine-Tuning
Task	detect	detect
Mode	train	fine-tune
Epochs	250	250
Image size	640	1920
Batch size	8	8
Optimizer	SGD	SGD
Initial learning rate	0.01	0.01
Final LR factor	0.01	0.01
Momentum	0.937	0.937
Weight decay	0.0005	0.0005
Pretrained weights	True	True
Seed	0	0
Deterministic	True	True
AMP	True	True
Patience	600	600
Validation IoU	0.7	0.7
Max detections	300	300
Augmentation	hsv_h = 0.015, hsv_s = 0.7, hsv_v = 0.4, translate = 0.1, scale = 0.5, fliplr = 0.5, mosaic = 1.0, mixup = 0.0, auto_augment = randaugment, erasing = 0.4	hsv_h = 0.015, hsv_s = 0.7, hsv_v = 0.4, translate = 0.1, scale = 0.5, fliplr = 0.5, mosaic = 1.0, mixup = 0.0, auto_augment = randaugment, erasing = 0.4

Table 4. Architecture I ablation of the S-Drone-YOLO model.

Model	Precision	Recall	mAP50-95	F1-Score	Params (M)	GFLOPs
YOLOv5 baseline	0.972	0.864	0.679	0.915	5.27	7.7
YOLOv5 + P2	0.983	0.925	0.684	0.954	5.4	19.5
YOLOv5 + P2 + C2fAttn	0.989	0.931	0.695	0.959	8.09	55.6

Table 5. Final-stage ablation of CAR-C2f and P2-QADH.

Model	Precision	Recall	mAP50-95	F1-Score	Params (M)	GFLOPs
Architecture I model	0.989	0.931	0.695	0.959	8.09	55.6
Architecture I+ CAR-C2f	0.986	0.935	0.697	0.959	7.95	50.3
Architecture I + CAR-C2f + P2-QADH	0.988	0.939	0.699	0.962	6.45	31.3

Table 6. Comparison with different YOLO baselines on SIDD.

Model	Precision	Recall	mAP50-95	F1-Score	Params (M)	GFLOPs	FPS
S-Drone-YOLO (Architecture II)	0.988	0.939	0.699	0.962	6.45	31.3	232
YOLOv8s	0.982	0.871	0.696	0.923	11.1	28.4	476
YOLO11s	0.985	0.872	0.687	0.925	9.4	21.3	360
YOLO12s	0.979	0.859	0.683	0.915	9.2	21.2	237
YOLO26s	0.988	0.918	0.659	0.952	9.4	20.5	245

Table 7. S-Drone-YOLO (Architecture II) performance by object size on the SIDD test set.

Object Size	Test Count	Recall	Precision
Large, >96 × 96	0	-	-
Medium, 32 × 32 to 96 × 96	148	0.993	0.986
Small, <32 × 32	801	0.936	0.965

Table 8. Fine-tuning results of the Architecture II model on six external UAV datasets.

Dataset	Modality	Images	Precision	Recall	mAP50-95	F1-Score
TIB-Net	RGB	2850	0.923	0.960	0.431	0.941
UAVfly	RGB	10,281	1.000	0.999	0.897	0.999
LRDD v1	RGB	21,190	0.980	0.957	0.754	0.968
DUT Anti-UAV	RGB	10,000	0.974	0.925	0.710	0.948
Det-Fly	RGB	13,271	0.985	0.970	0.766	0.977
ThermalUAV2UAV	IR	3856	0.934	0.970	0.843	0.951

Table 9. S-Drone-YOLO performance by the SIDD background category.

Background	Precision	Recall	mAP50-95	F1-Score
Sky	1.000	1.000	0.837	1.000
Sea	0.992	0.993	0.615	0.992
City	0.995	0.963	0.797	0.979
Mountain	0.977	0.886	0.611	0.929

Table 10. Confusion-matrix-based error summary at the selected validation threshold.

Measure	Count or Value
True positives (TP)	897
False positives (FP)	29
False negatives (FN)	52
Precision from the matrix	0.969
Recall from the matrix	0.945
F1-score from matrix	0.957

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aldubaikhi, A.; Patel, S. S-Drone-YOLO: A Parameter-Efficient P2-Guided Quality-Aware YOLO Detector for Infrared Small UAV Detection. Appl. Sci. 2026, 16, 5854. https://doi.org/10.3390/app16125854

AMA Style

Aldubaikhi A, Patel S. S-Drone-YOLO: A Parameter-Efficient P2-Guided Quality-Aware YOLO Detector for Infrared Small UAV Detection. Applied Sciences. 2026; 16(12):5854. https://doi.org/10.3390/app16125854

Chicago/Turabian Style

Aldubaikhi, Ali, and Sarosh Patel. 2026. "S-Drone-YOLO: A Parameter-Efficient P2-Guided Quality-Aware YOLO Detector for Infrared Small UAV Detection" Applied Sciences 16, no. 12: 5854. https://doi.org/10.3390/app16125854

APA Style

Aldubaikhi, A., & Patel, S. (2026). S-Drone-YOLO: A Parameter-Efficient P2-Guided Quality-Aware YOLO Detector for Infrared Small UAV Detection. Applied Sciences, 16(12), 5854. https://doi.org/10.3390/app16125854

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

S-Drone-YOLO: A Parameter-Efficient P2-Guided Quality-Aware YOLO Detector for Infrared Small UAV Detection

Abstract

1. Introduction

2. Related Work

2.1. Small-Object Detection Challenges and Evaluation

2.2. YOLO Detectors and UAV Detection

2.3. Attention-Based Feature Refinement

2.4. Infrared UAV Datasets and Generalization

2.5. Rationale for the YOLOv5 Base Architecture

3. Method

3.1. Design Logic and Overall Architecture

3.2. Replacement of C3 with C2fAttn in the Neck

3.3. Coordinate-Aware Residual C2f Block

3.4. P2-Guided Quality-Aware Detection Head

4. Experimental Preparation

4.1. SIDD Dataset

4.2. Generalization Datasets

4.3. Implementation Details

4.4. Evaluation Metrics

5. Results and Analysis

5.1. Architecture I Development: P2 and C2fAttn

5.2. Architecture II Model Performance and Component Ablation

5.3. Comparison with Recent YOLO Baselines

5.4. Performance by Target

5.5. Fine-Tuning Generalization on External UAV Datasets

5.6. Per-Background Analysis on SIDD

5.7. Confusion Matrix and Error Analysis

5.8. Visual Detection Analysis

6. Discussion

6.1. Effect of the P2 Pathway

6.2. Role of CAR-C2f and P2-QADH

6.3. Accuracy-Efficiency Trade-Off

6.4. Limitations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI