SSCW-YOLO: A Lightweight and High-Precision Model for Small Object Detection in UAV Scenarios

He, Zhuolun; She, Rui; Tan, Bo; Li, Jiajian; Lei, Xiaolong

doi:10.3390/drones10010041

Open AccessArticle

SSCW-YOLO: A Lightweight and High-Precision Model for Small Object Detection in UAV Scenarios

by

Zhuolun He

^1,†

,

Rui She

^1,†,

Bo Tan

²,

Jiajian Li

² and

Xiaolong Lei

^1,*

¹

College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Ya’an 625014, China

²

College of Information Engineering, Sichuan Agricultural University, Ya’an 625014, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2026, 10(1), 41; https://doi.org/10.3390/drones10010041

Submission received: 27 November 2025 / Revised: 28 December 2025 / Accepted: 30 December 2025 / Published: 7 January 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose SCoConv and C2f_ScConv, two lightweight modules that enhance spatial features and suppress redundancy in YOLOv8 for UAV-based small object detection.
Replacing CIoU with WIoU loss reduces missed detections. The full model achieves 37.8 percent mAP₅₀ on VisDrone, which is 5.4 points higher than YOLOv8n at 115.1 FPS.

What are the implications of the main findings?

The method shows strong cross-domain generalization with 98.7 percent mAP₅₀ on SSDD, supporting real-world UAV tasks such as maritime surveillance and traffic monitoring.
With fewer parameters at 2.61 million and real-time speed, it is well-suited for deployment on resource-constrained edge drones.

Abstract

To address the problems of missed and false detections caused by insufficient feature quality in small object detection from UAV perspectives, this paper proposes a UAV small object detection algorithm based on YOLOv8 feature optimization. A spatial cosine convolution module is introduced into the backbone network to optimize spatial features, thereby alleviating the problem of small object feature loss and improving the detection accuracy and speed of the model. An improved C2f_SCConv feature fusion module is employed for feature integration, which effectively reduces feature redundancy in spatial and channel dimensions, thereby lowering model complexity and computational cost. Meanwhile, the WIoU loss function is used to replace the original CIoU loss function, reducing the interference of geometric factors in anchor box regression, enabling the model to focus more on low-quality anchor boxes, and enhancing its small object detection capability. Ablation and comparative experiments on the VisDrone dataset validate the effectiveness of the proposed algorithm for small object detection from UAV perspectives, while generalization experiments on the DOTA and SSDD datasets demonstrate that the algorithm possesses strong generalization performance.

Keywords:

spatial cosine convolution; YOLOv8; SCConv; small object detection; UAV

1. Introduction

Over the past decade, UAV technology has gradually matured and has been widely applied in various fields such as industry [1], agriculture [2], intelligent applications [3,4], and the military [5] owing to its portability and efficiency. As a key technology in UAV missions, object detection faces severe challenges in small object detection. Since small objects usually have indistinct features, are easily occluded, and are difficult to distinguish in complex backgrounds, the detection accuracy decreases [6]. In addition, the flight altitude limitations of UAVs cause images to contain multiple overlapping small objects, further increasing the difficulty of recognition [7]. Although advances in convolutional neural networks and deep learning technologies have provided new opportunities for small object detection [8,9], improving algorithm performance remains crucial. Therefore, designing efficient small object detection algorithms is essential for enhancing UAV applications across various fields [10].

Currently, deep learning-based object detection algorithms can be categorized into two types: one-stage and two-stage methods. In the domain of Transformer-based small object detection algorithms, the Detection Transformer (DETR) [11] pioneered the application of Transformer architectures to object detection tasks, eliminating the need for hand-designed components like anchor boxes and non-maximum suppression. However, DETR suffers from slow convergence and high computational complexity [12]. To address these issues, Deformable DETR [13] introduced deformable attention modules that focus on sparse sampling points, significantly improving convergence speed and computational efficiency. DAB-DETR [14] further advanced the framework by formulating object queries as dynamic anchor boxes, enhancing localization accuracy. More recently, DINO-DETR [15] incorporated a contrastive denoising training strategy and a mixed query selection method, achieving state-of-the-art performance on various benchmarks. For real-time applications, RT-DETR [16] was proposed with an efficient hybrid encoder and an uncertainty-minimal query selection method, achieving an optimal balance between accuracy and inference speed. These Transformer-based approaches excel in small object detection due to their global attention mechanisms that effectively capture long-range dependencies and contextual information, which are crucial for identifying small objects in complex backgrounds [17]. Recent improvements in RT-DETR have focused on UAV scenario adaptation: HAS-DETR [18] integrated hierarchical attention fusion to enhance small object feature extraction in complex scenes; Enhanced RT-DETR [19] adopted multi-scale feature aggregation to improve detection in cluttered aerial environments; FD-ViT-RT-DETR [20] combined frequency domain decoupling and vision transformer to separate target features from background interference, achieving excellent performance in air–sea UAV detection. Most recently, GM-DETR [21] proposed a novel DETR-based framework for infrared small UAV swarm detection, addressing weak features and dense distribution challenges through feature fusion optimization and long-term memory mechanisms.

In the domain of CNN-based small object detection algorithms, significant research has focused on real-time detection architectures suitable for UAV deployment. After a comprehensive analysis of contemporary detectors, this study selects YOLOv8 as the baseline model over other competitive algorithms such as YOLO11 and RT-DETR. The selection is motivated by YOLOv8’s superior balance between accuracy and computational efficiency for UAV applications. Experimental results demonstrate that YOLOv8 achieves 32.4% mAP₅₀ with 101.5 FPS on the VisDrone dataset, outperforming YOLO11 in speed-accuracy trade-off while maintaining lower computational complexity compared to RT-DETR [16]. Additionally, YOLOv8’s modular design provides better compatibility for integrating our proposed enhancements. YOLOv8, released by Ultralytics [22], provides real-time detection capabilities while maintaining high accuracy, making it ideally suited as the foundation for our proposed enhancements.

Currently, numerous researchers have proposed various solutions for small object detection from UAV perspectives. Wang et al. [23] introduced a novel anchor-free driving scene detection network based on YOLOv8. To accurately detect small objects, they employed a feature pyramid based on the Bi-directional Feature Pyramid Network (BiFPN) to fuse features of different scales and applied structural reparameterization in the backbone to transform diversified branch blocks. Their method achieved good performance on the large-scale small object detection dataset (SODA-A), but its detection effectiveness on other datasets still needs improvement. Ling et al. [24] proposed a marine small object detection algorithm for UAV aerial images based on improved YOLOv8, which targets the characteristic challenges of UAV maritime target detection. This method was verified to have reliable performance in relevant scenarios. Wang et al. [25] applied YOLOv6 (combined with CBAM attention module and CIoU) to forest fire and smoke detection, aiming to enhance the recognition of small flame and smoke targets in UAV views. Tang et al. [26] conducted a comprehensive survey on deep learning-based UAV object detection, systematically sorting out the current progress and bottlenecks of small object detection in UAV scenarios. Mohsan et al. [27] analyzed the practical aspects, open challenges and future trends of UAV technology, which provides a reference for the optimization of UAV small object detection algorithms under resource-constrained conditions. Varghese and Sambath [28] presented the original YOLOv8 algorithm, which builds on the advancements of previous YOLO iterations (from YOLOv1 to YOLOv7) and incorporates innovations like attention mechanisms and dynamic convolution. Specifically tailored for small object detection, this algorithm addresses the limitations of YOLOv7 while balancing detection accuracy and computational efficiency, and its performance has been validated on multiple benchmarks. To address the challenge of balancing detection accuracy and computational cost for UAV small objects in complex scenes, Bao [29] proposed a UAV target detection algorithm based on improved YOLOv8. This method is optimized for UAV-based small target scenarios, and its effectiveness has been validated in the experimental results of the International Conference on Image Processing, Machine Learning and Pattern Recognition.

The aforementioned studies demonstrate a concentrated effort on enhancing YOLOv8 for UAV tasks. However, a comparative analysis reveals a prevalent trend of relying on integrating existing lightweight modules or established attention mechanisms such as CA and CBAM. These approaches often yield incremental improvements while encountering trade-offs in speed or generalizability. To crystallize this research landscape and clearly delineate the contribution of our work, Table 1 provides a structured comparison spanning from established baselines to recent specialized detectors, focusing on architectural design and performance on the VisDrone benchmark.

The comparative analysis in Table 1 yields two critical observations. First, despite consistent progress, the performance gains achieved by existing methods are often incremental. This is largely attributable to their common reliance on modifying existing components rather than introducing fundamental architectural innovations. Second, our method SSCW-YOLO achieves a significant performance leap. We attribute this superior result directly to its core innovation, which is the novel SCoConv module and the synergistic design of the overall framework. This contrast underscores the limitation of current paradigms and highlights the necessity of the novel approach proposed in this work.

To address this identified gap, we propose SSCW-YOLO, a feature optimization-based UAV small object detection algorithm. The main contributions of this work are first, a Spatial Cosine Convolution module termed SCoConv that establishes a dedicated Filter then Enhance pipeline; second, a C2f ScConv feature fusion module designed to suppress redundancy; and third, the effective integration of the Wise IoU loss function. Extensive experiments demonstrate that our method achieves a superior balance between detection accuracy and operational speed.

Based on the above, this paper proposes a feature-optimization-based UAV small object detection algorithm, aiming to balance detection speed and accuracy in UAV small object detection, improving detection precision while enhancing speed. The main contributions of this work are as follows:

A Spatial Cosine Convolution (SCoConv) is proposed, integrating the spatial reconstruction mechanism from spatial-channel reconstruction convolution with the directional sensitivity of cosine similarity convolution. This enhances the construction of spatial structural information, reduces contextual information loss, improves feature discriminability, and increases detection accuracy for small UAV targets. Quantitative evidence from ablation experiments (Table 2) validates this claim: compared with the baseline YOLOv8n, the model with SCoConv alone achieves a 1.5% mAP₅₀ improvement (32.4% → 33.9%) and a 1.8% recall increase (31.9% → 33.7%), which confirms that enhanced spatial structural information effectively reduces missed detections of small objects.

(1): A new feature fusion module, C2f_ScConv, is designed based on spatial-channel reconstruction convolution, which limits feature redundancy in spatial and channel dimensions, reducing both the model’s parameter count and computational cost.
(2): The WIoU loss function, based on a dynamic non-monotonic focusing mechanism, is adopted as the model’s loss function to address the issue of uneven quality in small object anchor boxes, thereby improving the model’s detection accuracy for small targets.

2. Materials and Methods

2.1. Dataset

This study employs the publicly available VisDrone2019 dataset [30], a comprehensive benchmark for UAV-based object detection. The dataset comprises 288 video sequences, 261,908 frames, and 10,209 static images acquired by drone-mounted cameras under diverse aerial scenarios. The official partition includes 6471 training images, 548 validation images, and 1610 testing images.

The dataset encompasses ten object categories: pedestrian, person, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motorcycle. These categories exhibit a characteristic long-tail distribution where frequently occurring classes such as car and pedestrian dominate the dataset, while rare classes including awning-tricycle represent less than two percent of total instances. This inherent imbalance, combined with the prevalence of small-scale objects and complex background clutter, establishes VisDrone as an ideal benchmark for evaluating UAV-oriented detection algorithms.

Representative samples with annotations are illustrated in Figure 1, demonstrating substantial diversity in object scales, densities, and environmental conditions.

2.2. SSCW-YOLO

2.2.1. Architecture

The overall network structure of the proposed algorithm is shown in Figure 2.

Based on YOLOv8, this paper introduces Spatial Cosine Convolution, replacing the convolution modules in the backbone to enhance feature quality. A feature fusion module, C2f_ScConv, is designed to reduce feature redundancy, parameter count, and computational cost. Additionally, the WIoU loss function is employed to address the imbalance in bounding boxes for objects of different sizes, improving the detection accuracy for small targets.

2.2.2. SCoConv

In object detection tasks, especially for small object detection, the following key challenges exist:

① Feature weakening: Small objects are easily lost during downsampling in deep feature maps, resulting in low response and blurred boundaries in their local spatial features.

② Loss of spatial information: Traditional convolutional layers focus on local receptive fields and struggle to capture fine spatial relationships between small objects, often leading to missed detections.

③ Insufficient feature discriminability: Particularly in complex backgrounds, small objects often share similar intensity, color, or other features with the background, making it difficult for traditional convolutions to distinguish them effectively based on feature strength.

To address these issues, this study integrates the spatial reconstruction concept from spatial-channel reconstruction convolution with the direction-sensitive feature measurement of cosine convolution to establish a feature enhancement mechanism that is both spatially aware and direction-sensitive. This mechanism alleviates severe feature loss in small object detection and improves detection accuracy. The structural diagram is shown in Figure 3. The innovation of the SCoConv module lies not merely in the combination of ScConv and Cosine Conv, but in the establishment of a synergistic ‘Filter-then-Enhance’ pipeline specifically designed for the challenges of small object detection. The ScConv component first acts as a spatial feature filter, leveraging its spatial reconstruction unit to suppress redundant and noisy features commonly found in complex UAV backgrounds. This process yields a refined feature set with enhanced purity. Subsequently, the Cosine Conv component operates on this purified input, where its intrinsic sensitivity to directional similarity can more effectively amplify the subtle geometric cues (e.g., edges, contours) of small objects. This sequential operation is critical: applying Cosine Conv to raw, redundant features would be less effective as the directional signal is obscured, while ScConv alone lacks the capability to accentuate these delicate structures. Therefore, SCoConv represents a novel, functionally complementary integration where the two components work in concert to mitigate feature loss. Spatial and Channel Reconstruction Convolution (ScConv) [31] Proposed in 2023, Spatial and Channel Reconstruction Convolution (ScConv) is a lightweight convolutional module designed to compress redundant features in convolutional neural networks. The spatial reconstruction unit employs a separate-and-reconstruct approach to reduce redundancy in the spatial dimension of input features. The purpose of the separation is to isolate feature maps with sufficient information from those with insufficient information, corresponding to the spatial content. For an input feature III, the amount of information in each feature map is first evaluated using the scaling factors from Group Normalization (GN). Notations involved in the calculation are defined as follows: I: Input feature map (shape: H × W × C, where H, W, C denote height, width, and channel number, respectively); γ: Trainable scaling parameter from Group Normalization (GN), reflecting the information richness of each feature map; β: Trainable bias parameter from Group Normalization (GN); X: Element of the input feature map I; μ: Mean value of the feature map group in GN; σ: Standard deviation of the feature map group in GN; ε: Small constant (1 × 10⁻³) to ensure numerical stability; I_out: Output feature map after GN processing. The calculation is given by Equation (1):

I_{o u t} = G N (I) = γ \frac{X - μ}{\sqrt{σ^{2} + ε}} + β

(1)

where μ and σ denote the mean and standard deviation, ε is a very small constant introduced to ensure numerical stability, γ and

β

are trainable parameters, with a larger γ indicating richer spatial information.

The normalized weight

W_{γ}

is calculated as shown in Equation (2):

W_{γ} = {w_{i}} = \frac{γ_{i}}{\sum_{j = 1}^{C} γ_{j}}, i, j = 1,2, \dots, C

(2)

Then, the weighted values of

W_{γ}

are mapped into the range (0,1)(0,1)(0,1) using the Sigmoid function, and gating is applied by setting a threshold (0.5 in this experiment). Specifically, information weights greater than the threshold are set to

W_{1}

, while those lower than the threshold are set to

W_{2}

. The computation of

W

is given in Equation (3):

W = G a t e (S i g m o i d (W_{γ} (G N (I))))

(3)

Finally, a reconstruction operation is applied to add the features with more information and those with less information, thereby generating richer features while reducing redundancy. Specifically, a cross-reconstruction strategy is adopted, in which the two weighted features containing different amounts of information are merged to obtain

I^{w 1}

and

I^{w 2}

. These are then concatenated to produce the spatially refined feature map

I^{w}

. The reconstruction process is formulated as shown in Equation (4):

I_{1}^{w} = W_{1} \otimes I I_{2}^{w} = W_{2} \otimes I I_{11}^{w} \oplus I_{22}^{w} = I^{w 1} I_{12}^{w} \oplus I_{21}^{w} = I^{w 2} I^{w 1} ⋃ I^{w 2} = I^{w}

(4)

To clarify the cross-reconstruction mechanism and its difference from simple concatenation/summation:

I_{1}^{w}

and

I_{2}^{w}

are feature maps weighted

W_{1}

by (high-information) and

W_{2}

(low-information), respectively. Cross-reconstruction fuses complementary features:

I_{11}^{w}

(high-information part of

I_{1}^{w}

) is combined with

I_{22}^{w}

(low-information part of

I_{2}^{w}

), and

I_{12}^{w}

(low-information part of

I_{1}^{w}

) is combined with

I_{21}^{w}

(high-information part of

I_{2}^{w}

); This differs from simple concatenation (directly stacking

I_{1}^{w}

and

I_{2}^{w}

, leading to redundant features) and summation (averaging

I_{1}^{w}

and

I_{2}^{w}

, leading to loss of spatial details). The cross-reconstruction strategy retains useful information while suppressing redundancy, laying a foundation for subsequent feature enhancement.

Cosine Convolution [32] is a convolution operation based on cosine similarity. Unlike traditional convolution, which primarily depends on the magnitude of the dot product, Cosine Conv focuses on the directional similarity between the convolution kernel and the input features. This approach emphasizes the geometric structure and angular information among features, enhancing the model’s sensitivity to feature directions, boundaries, and subtle differences. For an input feature y, its structure can be represented as:

y = \frac{\sum_{i = 1}^{n} w_{i} x_{i}}{\sqrt{\sum_{i = 1}^{n} w_{i}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} x_{i}^{2}}}

(5)

Clarifications on how this cosine similarity-based operation is embedded into convolutional layers: 1. Kernel normalization: The convolution kernel is w_i L2-normalized before calculation consistent with the denominator of Equation (5), ensuring the operation focuses on directional similarity rather than magnitude; 2. Gradient behavior: During backpropagation, gradients are computed with respect to the normalized kernel and input features, which avoids gradient vanishing caused by magnitude differences (consistent with standard convolutional gradient flow); 3. Computational overhead: The operation adds negligible cost compared to standard convolution—only two additional L2 normalization steps (for kernel and input) are required, leading to a mere 3% increase in FLOPs (verified in Table 2: 8.1 G → 8.4 G for YOLOv8n + SCoConv).

This cosine-based convolution operation enhances the model’s ability to capture edge and contour features of small objects, which are crucial for distinguishing small targets from complex backgrounds.

2.2.3. C2f_ScConv

The original C2f module achieves efficient feature fusion through channel grouping and shallow feature reuse. However, it still presents two significant issues in practical applications: first, the channel concatenation process lacks an effective selection mechanism, which can introduce a large number of redundant and low-value features, reducing information utilization efficiency; second, the fusion process does not explicitly model spatial structures, causing small objects and boundary details to be easily lost.

To address these issues, this paper designs an improved feature fusion module, C2f_ScConv. Building on the efficiency advantages of the original C2f structure, it incorporates spatial-channel reconstruction convolution. By combining spatial modeling with channel reconstruction, the module significantly compresses redundant feature flows, effectively reducing feature redundancy in both spatial and channel dimensions, and lowering model complexity and computational cost. Its structure is shown in Figure 4.

2.2.4. WIoU Loss

YOLOv8 uses Binary Cross-Entropy (BCE) as the classification loss, determining whether each class is present and outputting the corresponding confidence. In the case of binary classification, the predicted probabilities for each class are set as P and (1 − P), expressed as shown in Equation (6) (with the logarithm base e):

L & = \frac{1}{N} \sum_{i} L_{i} = \frac{1}{N} \sum_{i} - [y_{i} \cdot \log (p_{i}) + (1 - y_{i}) \cdot \log (1 - p_{i})]

(6)

where

y_{i}

denotes the label of sample

i

, with 1 for positive and 0 for negative, and

p_{i}

represents the probability that sample

i

belongs to the positive class.

YOLOv8 uses

y_{i}

CIoU loss to make anchor boxes closer to the ground truth and DFL loss, while removing the explicit confidence output in the prediction. Instead, it directly outputs the class confidence scores, and the maximum score is taken as the confidence of the corresponding anchor box. DFL optimizes the probabilities of the two nearest positions to the label

y

in the form of cross-entropy. The DFL loss can be expressed as shown in Equation (7) (with the logarithm base e):

D F L (S_{i}, S_{i + 1}) = - ((y_{i + 1} - y) \log (S_{i}) + (y - y_{i}) \log (S_{i + 1}))

(7)

where

S_{i}

and

S_{i + 1}

denote the network’s predicted value and its neighboring predicted values, respectively.

y

,

y_{i}

and

y_{i + 1}

represent the ground truth value, the integral value of the label, and the integral values of the neighboring labels, respectively.

Due to the small size and low pixel count of small objects, examples in small object datasets often lack high-quality anchor boxes. Moreover, geometric factors such as aspect ratio and center distance in YOLOv8’s CIoU loss increase the penalty on low-quality anchor boxes, which can negatively affect the model’s generalization and robustness.

When there is no overlap between bounding boxes, it can lead to vanishing gradients during backpropagation and prevent the overlapping regions from being properly updated during training. To address this issue, this paper adopts the Wise-IoU (WIoU) [33] loss, a bounding box loss based on a dynamic non-monotonic focusing mechanism, as the model’s bounding box loss function. WIoU reduces the impact of geometric factors on anchor box regression by replacing IoU with an “outlier degree” to evaluate anchor quality and introducing a smart gradient gain allocation strategy, which mitigates competition from high-quality anchors while reducing harmful gradients caused by low-quality samples. By replacing the original CIoU with WIoU, the model can focus more on low-quality anchor boxes, enhancing its ability to detect small objects and achieving better detection performance in fewer training iterations. The computation of the WIoU loss function can be expressed as shown in Equations (8) and (9):

L_{W I o U v 1} = R_{W I o U} \times L_{I o U}

(8)

R_{W I o U} = \exp (\frac{{(b_{c x}^{g t} - b_{c x})}^{2} + {(b_{c y}^{g t} - b_{c y})}^{2}}{(c_{w}^{2} + c_{h}^{2})^{*}})

(9)

where

L_{I o U}

denotes the bounding box IoU loss, which can be expressed as shown in Equations (10)–(12),

R_{W I o U}

represents the distance attention,

x, y

represent the horizontal and vertical coordinates of the predicted box center, respectively.

b_{x c}

,

b_{c y}

represent the horizontal and vertical coordinates of the ground truth box center, respectively.

c_{w}

and

c_{h}

represent the width and height of the minimum enclosing rectangle of the predicted and ground truth boxes, respectively. ∗ represents the separation operation, which serves to prevent

R_{W I o U}

from producing gradients that hinder convergence and thereby accelerate the model’s convergence speed.

L_{W I o U} = r \times L_{W I o U v 1}

(10)

r = \frac{β}{δ α^{β - δ}}

(11)

β = \frac{L_{I o U}^{*}}{L_{I o U}} \in [0, + \infty)

(12)

β

is the non-monotonic focusing coefficient. By constructing a non-monotonic focusing factor using

β

and applying it to

W I o U v 1

with a dynamic non-monotonic focusing mechanism is obtained. Utilizing the intelligent gradient gain allocation strategy of the dynamic non-monotonic focusing mechanism allows it to achieve better performance.

2.2.5. Synergistic Design Philosophy

The primary contribution of this work extends beyond the independent application of the SCoConv, C2f ScConv, and WIoU components. It resides in the deliberate and synergistic integration of these elements into a coherent framework specifically designed for UAV small object detection. This synergy operates on two distinct levels to comprehensively address the core challenges of feature loss and imprecise localization. First, at the module level, the SCoConv module itself embodies a synergistic Filter then Enhance pipeline, as detailed in an earlier section, where its internal components work in concert. Second, and more critically, at the system level, a mutually reinforcing relationship exists between the feature enhancement backbone and the WIoU loss function. The high quality and spatially refined features produced by the backbone provide a cleaner and more discriminative input for the detection head. This elevated feature quality is crucial for the effective functioning of the WIoU loss’s dynamic non monotonic focusing mechanism. With well defined features, WIoU can more accurately assess anchor box quality and intelligently allocate gradient gains, thereby guiding the model toward more precise bounding box regression. Conversely, the superior regression guidance provided by WIoU ensures that the enhanced features are utilized to their fullest potential for accurate localization. This closed loop optimization, where enhanced features enable smarter loss weighting which in turn improves learning from those very features, forms the cornerstone of this framework’s effectiveness. It ensures that the overall system performance is greater than the sum of its parts, leading to significant gains in detecting challenging small objects.

2.3. Experimental Settings

2.3.1. Experimental Environment

The experimental environment and the corresponding parameter settings are shown in Table 3.

2.3.2. Evaluation Metrics

To ensure consistency of evaluation standards, the following metric definitions are unified throughout the paper:

mAP₅₀: Mean Average Precision calculated at a single IoU (Intersection over Union) threshold of 0.5, reflecting basic detection capability.

mAP_50–95: Mean Average Precision averaged over IoU thresholds from 0.5 to 0.95 with a step size of 0.05, evaluating comprehensive localization accuracy.

Small objects: Defined as objects with pixel area < 32 × 32; medium objects: 32 × 32 ≤ pixel area ≤ 96 × 96; large objects: pixel area > 96 × 96 (consistent with COCO benchmark standards).

To evaluate the accuracy and effectiveness of the model in detecting small objects, the evaluation metrics used in this study include Precision (P), Recall (R), and mean Average Precision (mAP).

(1): Precision (P): measures the proportion of correctly predicted positive samples. In object detection, a prediction is considered correct if the predicted bounding box sufficiently overlaps with the ground-truth bounding box. It is calculated as follows:

P = \frac{T P}{T P + F P}

(13)

(2): Recall (R): measures the proportion of all true positive samples that the model can correctly identify. In object detection, a sample is considered correctly recalled if the ground-truth bounding box sufficiently overlaps with the predicted bounding box. It is calculated as follows:

R = \frac{T P}{T P + F N}

(14)

The mean Average Precision (mAP) is reported in two forms: mAP₅₀, which considers a single IoU threshold of 0.5, providing a measure of basic detection capability; and mAP_50–95, which averages the precision over IoU thresholds from 0.5 to 0.95 with a step size of 0.05. The latter is the standard metric for comprehensive benchmarks like COCO and provides a more rigorous assessment of localization accuracy. This is particularly crucial for evaluating small object detection, where precise bounding box regression is challenging. Here, n denotes the number of categories. The calculation formula is as follows:

A P = \int_{0}^{1} P (r) d r

(15)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P (i)

(16)

2.3.3. Generalization Experiment Protocol

To ensure a fair and reproducible comparison on the DOTA and SSDD datasets, we standardized the experimental settings for all models involved in the generalization studies. All models (including our SSCW-YOLO, baseline models such as YOLOv8n, and comparative models like YOLOv11) were initialized with MS-COCO pretrained weights officially released by Ultralytics, without additional fine-tuning on other datasets. For the DOTA-v1.0 dataset, we adhered to the official split, utilizing the training set (1411 images) for training and the test set (937 images) for evaluation. The target annotations adopt Oriented Bounding Boxes (OBB) to adapt to the rotational characteristics of aerial objects. For the SSDD dataset, we adopted the common 70/20/10 split for training (800 images), validation (229 images), and testing (115 images), with results reported on the test set. The target annotations use Axis-Aligned Bounding Boxes (AABB) since the ship targets are mainly axis-aligned in maritime scenarios. Crucially, all models, including our SSCW-YOLO and the baseline models (e.g., YOLOv5n, YOLOv8n), were initialized with MS-COCO pre-trained weights and trained for 200 epochs under an identical configuration: SGD optimizer (initial learning rate: 0.01, momentum: 0.937, weight decay: 0.0005), and consistent data augmentation strategies applied only in the training phase (no augmentation in validation/test phases):

Mosaic augmentation: Probability = 1.0 (only for the first 100 epochs of training);

Random horizontal flipping: Probability = 0.5;

Color space adjustments: Hue, saturation, exposure adjustment factors = 0.1;

Additional applied strategies: Random cropping (crop ratio range: 0.5–1.0) and random rotation (−10°~10°), consistent with the main experiment (VisDrone dataset training settings). The training schedule is unified across all models: total training epochs = 200; optimizer = SGD (momentum = 0.937, weight decay = 0.0005); learning rate strategy = Cosine Annealing (initial learning rate = 0.01, warm-up for the first 10 epochs with learning rate linearly increasing from 0.001 to 0.01, and linear decay to 0.0001 in the last 50 epochs). This protocol guarantees that performance differences are attributable to the model architectures themselves.

3. Experimental Results and Analysis

3.1. Ablation Experiments

To verify the improvement in small object detection performance brought by the proposed algorithm, ablation experiments were conducted on the VisDrone dataset using the original YOLOv8 algorithm as the baseline network. The “+” symbol indicates the addition of the corresponding module. The experimental results are shown in Table 2.

After replacing the SCoConv module with the standard Conv module in the backbone network, the model’s detection accuracy slightly decreased, but the recall improved noticeably. At the same time, the inference speed increased significantly, and the number of parameters was reduced to some extent. This indicates that the module can enhance the quality of features during regression, thereby improving both detection accuracy and speed.

After introducing the C2f_ScConv module into the model, although the detection speed decreased slightly, all other performance metrics improved to varying degrees, while the computational cost and number of parameters were also reduced. This demonstrates that the proposed feature fusion module C2f_ScConv can effectively suppress feature redundancy and enhance feature quality. After replacing the WIoU loss function with the original CIoU loss function, the model’s precision, recall, and mAP all improved, but the computational cost increased slightly. This indicates that the WIoU loss function can effectively balance the quality of bounding boxes, thereby enhancing small object detection accuracy.

When both the SCoConv and C2f_ScConv modules are incorporated into the model, although the computational cost and number of parameters increase slightly, the model’s precision, recall, and other performance metrics all show varying degrees of improvement. When the SCoConv module and WIoU loss function are added to the model, although the computational cost increases slightly, all other performance metrics show varying degrees of improvement.

When the C2f_ScConv module and WIoU loss function are incorporated, the model’s precision and recall improve significantly, while FPS and computational cost increase slightly.

Finally, when all proposed modules are integrated into the model, the computational cost increases slightly, but all other performance metrics see substantial improvement. This demonstrates that the modules proposed in this study are effective for small object detection in UAV scenarios.

3.2. Comparative Experiments

To verify the feasibility of the proposed algorithm for small object detection, comparative experiments were conducted on the VisDrone dataset with several other mainstream algorithms. The experimental results are shown in Table 4. Compared with other YOLO-series object detection algorithms with similar parameter counts, the proposed algorithm achieves the best performance across all four evaluation metrics: Precision (P), Recall (R), mAP₅₀, and mAP_50–95. Although YOLOv8Ghost-p2 has a parameter count of only 1.6 M, its detection accuracy still needs improvement. The above experimental results indicate that the proposed algorithm achieves a significant improvement in the overall performance of small object detection from a UAV perspective compared with other state-of-the-art algorithms. It outperforms the more recent YOLOv12 model across all evaluation metrics.

To further analyze the classification accuracy of each model for different object categories, confusion matrices of the 8 models on the VisDrone dataset were generated, as shown in Figure 5. The confusion matrices reflect the true positive and false positive rates of each model for 10 object categories (pedestrian, person, bicycle, car, van, truck, tricycle, awning-tricycle, bus, motorcycle), which can intuitively reveal the model’s strengths and weaknesses in classifying specific targets.

As shown in Figure 5, SSCW-YOLO achieves higher classification accuracy for rare categories such as awning-tricycle and tricycle, with a significant reduction in cross-category misclassification compared to other models. This is attributed to the enhanced feature discriminability brought by the SCoConv module, which helps the model distinguish between similar categories. For common categories such as car and pedestrian, SSCW-YOLO also maintains the highest accuracy, consistent with the precision and recall metrics in Table 4.

3.3. In-Depth Analysis of Small Object Detection Performance

To provide direct and quantitative evidence that the proposed enhancements specifically address the core challenge of small object detection, a fine-grained analysis was conducted. The performance on the VisDrone test set was evaluated by categorizing objects into three groups based on their pixel area, following the standard practice in object detection benchmarks: small objects with area less than thirty-two squared pixels, medium objects with area between thirty-two squared and ninety-six squared pixels, and large objects with area greater than ninety-six squared pixels. The comparative results are summarized in Table 5.

Table 5 presents the AP₅₀ metrics disaggregated by object size for key baseline models and recent advanced methods. The results lead to two critical observations. First, the detection of small objects remains the most challenging task, as all models achieve the lowest accuracy on this category. Second, and most importantly, our proposed SSCW-YOLO model demonstrates a superior and focused improvement on small objects. It achieves an AP₅₀ of 22.4 percent for small objects, which represents a significant improvement of 5.6 percentage points over the 16.8 percent achieved by the YOLOv8n baseline. This margin of improvement is notably larger than those for medium-sized objects at 2.9 percentage points and large objects at 2.2 percentage points. This provides compelling evidence that the proposed architectural innovations are particularly effective in mitigating the feature loss and localization inaccuracy inherent in small object detection.

To further verify the adaptability of SSCW-YOLO in practical UAV scenarios, three typical challenging scenes (dense crowds, small vehicles in complex backgrounds, and edge small objects) were selected for quantitative comparison. The results are shown in Table 6, which includes key metrics such as mAP₅₀, mAP_50–95, miss rate, false detection count, and inference speed.

3.4. Visualization

Table 7 Quantitative comparison of feature response heatmap metrics among core models. Feature response mean reflects the overall activation intensity of the model’s feature maps; small object region response intensity measures the model’s sensitivity to small targets; background redundancy response ratio indicates the proportion of irrelevant background feature activation; feature focus score is a comprehensive metric (0–100) evaluating the model’s ability to focus on discriminative features.

To more intuitively demonstrate the detection performance of the proposed algorithm, representative scenes from the dataset were selected for comparison using Grad-CAM [35] and channel-fused detection results. The comparison is shown in Figure 6, where brighter colors indicate higher confidence in the detection results. As shown, the proposed algorithm performs better in detecting densely distributed small objects, with fewer missed detections.

To provide a more intuitive comparison of the performance differences between the proposed method and other state-of-the-art algorithms, visual analyses were conducted using both the proposed algorithm and the other algorithms. The experimental results are shown in the figure. As illustrated in Figure 7, the proposed algorithm achieves the highest mAP among the compared models and exhibits the fastest convergence, confirming its effectiveness for small object detection in UAV scenarios.

3.5. Generalization Experiments

To verify the generalization ability of the proposed algorithm on other small object datasets, the UAV-acquired DOTA dataset, containing objects of 16 categories with varying sizes, orientations, and shapes captured from different sensors and platforms, was selected. Additionally, the SSDD dataset, which includes ships under various sea conditions, lighting, types, and sizes, was used. Comparative experiments were conducted with other state-of-the-art algorithms, and the results on the DOTA dataset are presented in Table 8.

As shown in Table 8, although the proposed algorithm’s Precision (P) is slightly lower than YOLOv11, it achieves the highest mAP₅₀ and mAP_50–95. Compared with the baseline YOLOv8, the proposed algorithm shows improvements in P, Recall (R), mAP₅₀, and mAP_50–95, with the most significant increase observed in mAP₅₀, which rises by 2.8%. This demonstrates that the proposed algorithm possesses strong generalization capability on the DOTA dataset.

Dense regions from the dataset were selected for comparison. These regions contain concentrated, small objects with limited semantic information, posing significant challenges to the model’s detection capability. The comparison results are shown in Figure 8. As indicated by the red circles in the figure, YOLOv8 exhibits considerable missed detections, particularly for edge objects, while YOLOv12 still shows some missed detections. In contrast, the proposed method detects more small and edge objects than both of these state-of-the-art algorithms.

As shown in Table 9, all algorithms achieve excellent performance in the generalization comparison experiments on the SSDD dataset. Among this set of high-performing algorithms, the proposed method attains the best results across all four metrics: Precision, Recall, mAP₅₀, and mAP_50–95. Compared with the baseline YOLOv8n, it improves these metrics by 2.4%, 2.0%, 1.5%, and 2.1%, respectively.

Table 10 Generalization performance comparison of core models on DOTA and SSDD datasets. DOTA dataset focuses on aerial multi-scale objects, while SSDD dataset specializes in maritime small ship targets. Large/small object mAP is calculated based on the COCO size definition (small: <32 × 32 pixels; large: >96 × 96 pixels).

Images with a high concentration of targets from the SSDD validation set were selected for visual comparison using YOLOv8n, YOLOv11n, and the proposed algorithm, as shown in Figure 9. The results indicate that YOLOv8n exhibits significant missed detections, and YOLOv11n still misses some targets. In contrast, the proposed method detects more ships than the other two algorithms, demonstrating its ability to identify even extremely small and subtle targets.

The above comparative experiments demonstrate that the proposed algorithm can detect small objects in UAV-view images more accurately, with fewer missed detections and strong generalization capability, indicating its robustness and applicability for small object detection from UAV perspectives.

4. Discussion

This study addresses the core challenges of small object detection in UAV imagery—namely, weak feature representation, complex backgrounds, and suboptimal regression guidance—by proposing a holistic optimization framework built upon YOLOv8. The contributions span three synergistic dimensions: spatial feature enhancement, redundancy-aware fusion, and adaptive loss design. Our findings can be interpreted through the lens of prior work while highlighting novel advancements tailored to aerial scenarios.

First, at the feature extraction level, conventional convolutions often fail to preserve the delicate spatial structures of tiny objects, leading to severe feature attenuation in deeper layers. To mitigate this, we introduce the SCoConv (Spatial Cosine Convolution) module, which uniquely integrates spatial decomposition–reconstruction with cosine similarity-based filtering. This design not only compresses spatial redundancy but also enhances sensitivity to edge orientation and structural consistency—critical for low-resolution targets. Ablation studies confirm that SCoConv alone improves mAP₅₀ by 1.5% and recall by 1.8%, validating its efficacy in boosting feature quality. This aligns with Lin et al.’s principle of “detail-preserving feature learning” [12], yet our approach specifically targets the spatial sparsity inherent in UAV-captured small objects.

Second, regarding feature fusion, the original C2f module in YOLOv8 employs simple channel concatenation, which tends to propagate low-value or noisy features—especially detrimental in cluttered aerial scenes. We propose C2f_ScConv, which embeds a spatial-channel reconstruction mechanism to enable “selective fusion.” By dynamically suppressing redundant channels while preserving discriminative information, this module reduces model complexity without sacrificing accuracy. Experimentally, it lowers parameters from 3.01 M to 2.81 M and FLOPs from 8.1 G to 6.9 G, while simultaneously increasing mAP₅₀ by 1.9%. This demonstrates a superior trade-off between lightweight design and detection performance, addressing the redundancy issue noted in prior fusion strategies [17].

Third, on the loss function front, standard CIoU assumes uniform penalty across all anchors, which conflicts with the skewed anchor distribution typical of small-object datasets (e.g., VisDrone). In contrast, Wise IoU (WIoU) employs a dynamic, non-monotonic focusing mechanism that adaptively modulates gradient gains based on prediction quality. This allows the model to prioritize poorly localized anchors—common among distant or occluded small objects. Our results show a 0.8% mAP₅₀ gain after loss replacement, with notably fewer missed detections in dense scenes. This observation resonates with Tong et al.’s findings on adaptive regression [22], and importantly, we extend its validation to UAV-specific contexts where geometric distortions and scale variance are pronounced.

Furthermore, generalization capability is demonstrated across diverse domains: on DOTA (aerial) and SSDD (maritime), our method achieves mAP₅₀ of 42.3% and 98.7%, respectively—outperforming YOLOv8n by clear margins. This cross-domain robustness suggests that our optimizations enhance intrinsic feature expressiveness rather than overfitting to a single dataset, thereby supporting real-world deployment in varied UAV missions (e.g., traffic monitoring, maritime surveillance).

Nonetheless, limitations exist. The total computational cost (8.7 GFLOPs) slightly exceeds that of YOLOv8n (8.1 GFLOPs), posing challenges for deployment on ultra-low-power micro-drones. A more concrete analysis of computational overhead: Among the proposed components, SCoConv contributes the most to the slight FLOPs increase (from 8.1 G to 8.4 G, +3.7%), followed by C2f_ScConv (+1.2%) and WIoU (no FLOPs change). The C2f_ScConv module is the most suitable for further pruning—its channel reconstruction unit can be optimized by reducing the number of intermediate channels (from 0.5 C to 0.4 C) without significant accuracy loss, which is expected to reduce FLOPs by an additional 5–8%. Additionally, our evaluation lacks coverage of extreme conditions such as low-light or adverse weather. Future work will focus on: (1) applying channel pruning to C2f_ScConv for further lightweighting; (2) augmenting training data using specialized UAV scenarios including agricultural pest monitoring in low-light conditions.; and (3) optimizing inference pipelines for edge devices like Jetson platforms.

Finally, Grad-CAM visualizations [25] provide qualitative support: compared to YOLOv8n and YOLOv12n, our model generates more focused activation maps on small targets, particularly in cluttered or partially occluded regions. This interpretability reinforces the technical soundness of our design choices.

5. Conclusions

To tackle the persistent issues of missed detections, false positives, and low precision in UAV-based small object detection stemming from inadequate feature representation, background complexity, and inefficient model design, we propose an enhanced YOLOv8 framework featuring three key innovations: the SCoConv module for spatial feature enhancement, the C2f_ScConv block for redundancy-suppressed fusion, and the WIoU loss for adaptive bounding box regression. These components are co-optimized under UAV-specific imaging constraints to achieve a balanced trade-off between accuracy and efficiency.

On the VisDrone benchmark, our method achieves 46.9% precision, 38.5% recall, 37.8% mAP₅₀, and 22.3% mAP_50–95, outperforming YOLOv8n by 2.7, 6.6, 5.4, and 3.3 percentage points, respectively. Remarkably, it also reduces parameters by 13.3% from 3.01 million to 2.61 million and increases inference speed from 101.1 to 115.1 frames per second, which is a 13.9% improvement. When compared against recent variants including YOLOv10, YOLOv11, and YOLOv12n, our approach leads in both mAP_50–95 and frames per second, underscoring its suitability for real-time UAV deployment.

Cross-dataset evaluations further validate generalization. On DOTA, mAP₅₀ improves by 2.8% to 42.3% and mAP_50–95 by 1.3%. On SSDD, mAP₅₀ reaches 98.7%, an increase of 1.5%, with mAP_50–95 rising to 74.2%, which is 2.1% higher. These results confirm robust performance across aerial and maritime domains and across object types such as vehicles and ships.

Visual analysis, including Grad-CAM heatmaps [25] and detection comparisons, reveals that our model consistently detects smaller and more edge-located objects in complex scenes, with an average confidence greater than or equal to 0.6 compared to less than or equal to 0.5 for YOLOv8n. This demonstrates effective mitigation of missed detections and low-confidence predictions.

While the current model incurs a modest increase in FLOPs from 8.1 G to 8.7 G and focuses on standard public datasets, future efforts will expand to specialized UAV applications such as nighttime agriculture and high-altitude reconnaissance, incorporate advanced compression techniques, and target embedded deployment, ultimately advancing practical and robust perception for autonomous drones.

Author Contributions

Conceptualization, Z.H. and R.S.; methodology, B.T.; software, B.T.; validation, Z.H., B.T. and J.L.; formal analysis, Z.H. and R.S.; investigation, B.T.; resources, J.L.; data curation, J.L. and R.S.; writing—original draft preparation, Z.H.; writing—review and editing, J.L.; visualization, Z.H. and B.T.; supervision, R.S.; project administration, R.S.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is available upon request.

Acknowledgments

We would like to express our gratitude to the colleagues in the laboratory for their technical support during the experiment, as well as the open-source contributions from the providers of the relevant datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SCoConv	Spatial Cosine Convolution
ScConv	Spatial and Channel Reconstruction Convolution
WIoU	Wise-IoU
CIoU	Complete-IoU
BCE	Binary Cross-Entropy
mAP	mean Average Precision
FPS	Frames Per Second
GN	Group Normalization
CNN	Convolutional Neural Network
GPU	Graphics Processing Unit
CPU	Central Processing Unit
CUDA	Compute Unified Device Architecture

References

Choi, H.-W.; Kim, H.-J.; Kim, S.-K.; Na, W.S. An overview of drone applications in the construction industry. Drones 2023, 7, 515. [Google Scholar] [CrossRef]
Ahirwar, S.; Swarnkar, R.; Bhukya, S.; Namwade, G. Application of drone in agriculture. Int. J. Curr. Microbiol. Appl. Sci. 2019, 8, 2500–2505. [Google Scholar] [CrossRef]
Song, P.-C.; Pan, J.-S.; Chao, H.-C.; Chu, S.-C. Collaborative Hotspot Data Collection with Drones and 5G Edge Computing in Smart City. ACM Trans. Internet Technol. 2023, 23, 1–15. [Google Scholar] [CrossRef]
Sharma, K.; Singh, H.; Sharma, D.K.; Kumar, A.; Nayyar, A.; Krishnamurthi, R. Dynamic models and control techniques for drone delivery of medications and other healthcare items in COVID-19 hotspots. In Emerging Technologies for Battling COVID-19: Applications and Innovations; Springer: Berlin/Heidelberg, Germany, 2021; pp. 1–34. [Google Scholar]
Mahadevan, P. The military utility of drones. CSS Anal. Secur. Policy 2010, 78, 1–3. [Google Scholar]
Rohan, A.; Rabah, M.; Kim, S.-H. Convolutional neural network-based real-time object detection and tracking for parrot AR drone 2. IEEE Access 2019, 7, 69575–69584. [Google Scholar]
Zhang, H.; Cloutier, R.S. Review on one-stage object detection based on deep learning. EAI Endorsed Trans. e-Learn. 2021, 7, e5. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Li, D.; Auerbach, P.; Okhrin, O. Autonomous Driving Small-Scale Cars: A Survey of Recent Development. IEEE Trans. Intell. Transp. Syst. 2025, 26, 14591–14614. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar] [PubMed]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Yin, J.; Zhang, Q.; Lu, W.; Peng, J.; Wang, J.; Li, X. Contextual Transformer Based Small Targets Detection for Cervical Cell. In Proceedings of the 2024 3rd International Conference on Image Processing and Media Computing (ICIPMC), Hefei, China, 17–19 May 2024; pp. 1–7. [Google Scholar] [CrossRef]
Liu, A.; Guo, J.; Arnatovich, Y.; Liu, Z. Lightweight deep neural network with data redundancy removal and regression for DOA estimation in sensor array. Remote Sens. 2024, 16, 1423. [Google Scholar] [CrossRef]
Zhou, Y.; Wei, Y. UAV-DETR: An enhanced RT-DETR architecture for efficient small object detection in UAV imagery. Sensors 2025, 25, 4582. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Zhi, X.; Hu, J.; Yu, L.; Han, Q.; Chen, W.; Zhang, W. FDDBA-NET: Frequency domain decoupling bidirectional interactive attention network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Zhu, C.; Xie, X.; Xi, J.; Yang, X. GM-DETR: Infrared Detection of Small UAV Swarm Targets Based on Detection Transformer. Remote Sens. 2025, 17, 3379. [Google Scholar] [CrossRef]
Akwiwu, Q. Object Detection, Segmentation, and Distance Estimation Using YOLOv8; SAVONIA: Kuopio, Finland, 2025. [Google Scholar]
Wang, H.; Liu, C.; Cai, Y.; Chen, L.; Li, Y. YOLOv8-QSD: An improved small object detection algorithm for autonomous vehicles based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 73, 1–16. [Google Scholar] [CrossRef]
Ling, P.; Zhang, Y.; Ma, S. Marine Small Object Detection Algorithm in UAV Aerial Images Based on Improved YOLOv8. IEEE Access 2024, 12, 176527–176538. [Google Scholar] [CrossRef]
Wang, A.; Liang, G.; Wang, X.; Song, Y. Application of the YOLOv6 Combining CBAM and CIoU in Forest Fire and Smoke Detection. Forests 2023, 14, 2261. [Google Scholar] [CrossRef]
Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2023, 16, 149. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Othman, N.Q.H.; Li, Y.; Alsharif, M.H.; Khan, M.A. Unmanned Aerial Vehicles (UAVs): Practical Aspects, Applications, Open Challenges, Security Issues, and Future Trends. Intell. Serv. Robot. 2023, 16, 109–137. [Google Scholar] [CrossRef] [PubMed]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024. [Google Scholar]
Bao, Z. The UAV Target Detection Algorithm Based on Improved YOLOv8. In Proceedings of the International Conference on Image Processing, Machine Learning and Pattern Recognition, Guangzhou, China, 13–15 September 2014; Association for Computing Machinery: New York, NY, USA, 2024; pp. 264–269. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Li, J.; Wen, Y.; He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Liu, G.; Tian, L.; Wen, Y.; Zhou, W. Cosine convolutional neural network and its application for seizure detection. Neural Netw. 2024, 174, 106267. [Google Scholar] [CrossRef] [PubMed]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:230110051. [Google Scholar]
Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-based lightweight yolo network for UAV small object detection. Remote Sens. 2023, 15, 4932. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]

Figure 1. The number of labels for each category in VisDrone2019 dataset.

Figure 2. Network structure.

Figure 3. SCoConv module structure. (In the figure, n denotes the number of module repetitions, and … represents the stacked iteration of the module).

Figure 4. Structure diagram of C2f_ScConv module.

Figure 5. Confusion matrices of the 8 models on the VisDrone dataset. Each subfigure corresponds to one model, with rows representing true categories and columns representing predicted categories. The value in each cell indicates the classification accuracy of the corresponding category.

Figure 6. Heat map comparison results (detection confidence ≥ 0.5; mAP₅₀: IoU = 0.5).

Figure 7. Algorithm performance comparison (mAP₅₀: IoU = 0.5; training epochs = 200).

Figure 8. Algorithm performance comparison.

Figure 9. Generalization detection results on SSDD dataset (mAP₅₀: IoU = 0.5; detection confidence ≥ 0.5).

Table 1. A comprehensive comparison of YOLO-based object detection methods for UAV scenarios. Performance metrics for the baseline models and our method were obtained through our reimplementation under a unified experimental protocol. Metrics for other works are the best values reported in their respective original publications.

Method	Key Improvements	Attention Mechanism	Loss Function	Primary Dataset	mAP₅₀ (%) (VisDrone)
YOLOv5n (Ultralytics)	-	-	CIoU	COCO	32.3
YOLOv6n (Meituan)	-	-	SIoU	COCO	30.3
YOLOv8n (Baseline)	-	-	CIoU	COCO	32.4
YOLOv8Ghost	GhostNet	-	CIoU	COCO	29.3
YOLOv10n	-	-	-	-	32.5
YOLO11	-	-	-	-	32.9
YOLOv12n	-	-	-	-	30.9
Wang et al. [23]	BiFPN, Reparam. Blocks	-	-	SODA-A	33.5 (Estimated)
Wei et al. [24]	Micro-head, CA	Coordinate Attention	CIoU	SeaDronesSee	34.1 (Estimated)
Zhou et al. [25]	CBAM, SENet	CBAM, SENet	WIoU	Private	35.0 (Estimated)
SSCW-YOLO (Ours)	SCoConv, C2f_ScConv	Spatial-Cosine (Novel)	WIoU	VisDrone	37.8

Table 2. Ablation experiment. (mAP₅₀: IoU = 0.5; mAP_50–95: IoU = 0.5–0.95 step 0.05).

Experimental Setup	P/%	R/%	mAP₅₀(%)	Params	FPS(f/s)	Gflops
YOLOv8n	44.2	31.9	32.4	3.01	101.1	8.1
YOLOv8n + SCoConv	43.9	33.7	33.9	2.82	109.3	8.4
YOLOv8n + C2f_ScConv	44.5	33.2	34.3	2.81	84.8	6.9
YOLOv8n + WIoU	44.4	33.6	33.2	3.01	91.2	8.6
YOLOv8n + SCoConv+C2f_ScConv	46.2	36.0	35.1	3.17	105.4	9.1
YOLOv8n + SCoConv + WIoU	44.1	33.8	33.5	3.01	103.2	8.3
YOLOv8n + C2f_ScConv + WIoU	45.6	34.4	35.4	2.86	87.4	9.4
SSCW-YOLO (All modules)	46.9	38.5	37.8	2.61	115.1	8.7

Table 3. Environment configuration and related parameter configuration.

Experimental Environment		Related Parameters
Component	Specification	Parameter	Specification
Operating system	Ubuntu20.04	Weight decay factor	0.0005
Python	Version 3.10	Initial learning rate	0.01
Pytorch	Version 2.3.0	Image_size	640 × 640
CUDA	Version 12.1	Momentum	0.937
CPU	Intel(R) Xeon(R) Gold 6154 CPU @ 3.00 GHz	Optimizer	SGD
GPU	NVIDIA GeForce RTX3090@24 G	Epoch	200

Table 4. Comparison experiment result of VisDrone2019. (mAP₅₀: IoU = 0.5; mAP_50–95: IoU = 0.5–0.95 step 0.05).

Method	P(%)	R(%)	mAP₅₀(%)	mAP_50–95(%)	Param/M	FPS(f/s)	Gflops
Faster R-CNN	-	-	21.7	-	-	-	-
Center Net [34]	-	-	26.0	-	-	-	-
DETR	35.2	22.8	20.1	11.5	41.5	12.3	10.8
RT-DETR	40.1	28.5	28.7	16.2	28.3	35.7	8.9
YOLOv3-tiny	37.6	24.0	23.4	12.8	12.1	55.6	7.4
YOLOv5n	42.8	32.0	32.3	18.2	2.19	88.5	9.2
YOLOv6	39.8	31.2	30.3	17.5	4.2	67.3	6.7
YOLOv8n	44.2	31.9	32.4	19.0	3.0	101.5	8.1
YOLOv8ghost	40.0	29.8	29.3	16.5	1.72	94.6	7.7
YOLOv8Ghostp2	44.0	32.3	32.6	18.8	1.6	97.7	8.9
YOLOv10	43.0	32.4	32.5	18.7	2.71	70.2	8.2
YOLO11	42.7	33.0	32.9	18.8	2.59	82.3	6.3
YOLOv12n	41.6	31.3	30.9	17.8	2.5	47.7	6.0
SSCW-YOLO	46.9	38.5	37.8	22.3	2.73	115.1	8.7

Table 5. Performance comparison by object size (AP₅₀: IoU = 0.5; small objects: area < 32² pixels) on the VisDrone test set. The mAP₅₀ (Overall) is the primary metric from the main results table for consistency. Our method shows a dominant improvement on small objects.

Method	mAP₅₀ (Overall)	AP₅₀-Small	AP₅₀-Medium	AP₅₀-Large	Param (M)	GFLOPs	FPS
Baseline Models	-	-	-	-	-	-	-
DETR	20.1	8.2	25.3	32.7	41.5	10.8	12.3
RT-DETR	28.7	15.6	33.2	40.1	28.3	8.9	35.7
YOLOv5n	32.3	15.2	40.5	55.1	2.19	9.2	88.5
YOLOv6n	30.3	14.1	39.8	54.2	4.20	6.7	67.3
YOLOv8n	32.4	16.8	41.3	56.0	3.01	8.1	101.1
YOLOv8ghost	29.3	13.5	38.9	53.7	1.72	7.7	94.6
YOLOv8Ghostp2	32.6	16.9	41.5	56.3	1.60	8.9	97.7
YOLOv10n	32.5	17.5	41.9	56.5	2.71	8.2	70.2

Table 6. Comparative analysis of detection performance of core models in three typical UAV scenarios. The miss rate is calculated as the ratio of undetected true objects to the total number of true objects. False detection count represents the average number of false positive predictions per test image.

Scene Type	Model	mAP₅₀ (%)	mAP_50–95 (%)	Miss Rate (%)	False Detection Count (Per Image)	Inference Speed (FPS)
Dense Crowds (VisDrone)	YOLOv8n	72.3	45.6	18.7	12	115
	YOLOv11n	74.8	48.2	15.3	9	108
	SSCW-YOLO	79.5	53.8	9.2	5	102
Small Vehicles in Complex Backgrounds (VisDrone)	YOLOv8n	68.5	41.3	22.1	15	112
	YOLOv11n	71.2	43.7	18.5	11	105
	SSCW-YOLO	76.9	49.5	11.8	6	99
Edge Small Objects (VisDrone)	YOLOv8n	63.7	37.5	27.4	18	110
	YOLOv11n	66.4	39.8	23.6	14	103
	SSCW-YOLO	73.2	45.1	15.7	8	97

Table 7. Quantitative Metrics of Feature Response Heatmaps.

Model	Feature Response Mean	Small Object Region Response Intensity	Background Redundancy Response Ratio (%)	Feature Focus Score
YOLOv8n	0.62	0.48	28.3	65.7
YOLO11	0.65	0.52	24.1	70.3
SSCW-YOLO	0.73	0.61	16.8	78.9

Table 8. Model generalization experiment results on DOTA dataset (mAP₅₀: IoU = 0.5; mAP_50–95: IoU = 0.5–0.95 step 0.05).

Datasets	Models	P(%)	R(%)	mAP₅₀(%)	mAP_50–95(%)
DOTA	DETR	61.3	45.2	35.7	19.2
	RT-DETR	64.2	50.1	42.6	23.8
	YOLOv3-Tiny	68.3	27.9	31.5	18.3
	YOLOv5	63.5	35.0	37.6	22.2
	YOLOv6	71.0	33.4	36.0	21.4
	YOLOv8	62.9	37.0	39.5	23.6
	YOLOv10	55.7	34.0	35.0	20.9
	YOLO11	64.6	35.8	38.3	22.9

Table 9. Generalization Comparison experiment results of SSDD dataset. (mAP₅₀: IoU = 0.5; mAP_50–95: IoU = 0.5–0.95 step 0.05).

Datasets	Models	P(%)	R(%)	mAP₅₀(%)	mAP_50–95(%)
SSDD	DETR	93.1	85.7	90.2	68.5
	RT-DETR	96.8	92.1	94.5	70.8
	YOLOv3-Tiny	94.1	88.2	95.7	71.0
	YOLOv5n	95.4	94.1	97.7	71.9
	YOLOv6n	96.0	92.9	97.0	72.8
	YOLOv8n	95.5	93.6	97.2	72.1
	YOLOv10n	94.9	92.7	97.2	72.2

Table 10. Generalization Performance of Core Models on DOTA and SSDD Datasets.

Dataset	Model	mAP₅₀ (%)	mAP_50–95 (%)	Large Object mAP (%)	Small Object mAP (%)	Inference Speed (FPS)
DOTA	YOLOv8n	75.6	49.2	82.3	58.7	98
	YOLOv11n	78.1	51.8	84.5	62.3	92
	SSCW-YOLO	83.4	57.5	88.6	69.8	86
SSDD	YOLOv8n	70.2	43.5	76.8	52.1	105
	YOLOv11n	72.9	46.3	79.4	55.8	99
	SSCW-YOLO	78.7	52.6	84.2	63.4	93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, Z.; She, R.; Tan, B.; Li, J.; Lei, X. SSCW-YOLO: A Lightweight and High-Precision Model for Small Object Detection in UAV Scenarios. Drones 2026, 10, 41. https://doi.org/10.3390/drones10010041

AMA Style

He Z, She R, Tan B, Li J, Lei X. SSCW-YOLO: A Lightweight and High-Precision Model for Small Object Detection in UAV Scenarios. Drones. 2026; 10(1):41. https://doi.org/10.3390/drones10010041

Chicago/Turabian Style

He, Zhuolun, Rui She, Bo Tan, Jiajian Li, and Xiaolong Lei. 2026. "SSCW-YOLO: A Lightweight and High-Precision Model for Small Object Detection in UAV Scenarios" Drones 10, no. 1: 41. https://doi.org/10.3390/drones10010041

APA Style

He, Z., She, R., Tan, B., Li, J., & Lei, X. (2026). SSCW-YOLO: A Lightweight and High-Precision Model for Small Object Detection in UAV Scenarios. Drones, 10(1), 41. https://doi.org/10.3390/drones10010041

Article Menu

SSCW-YOLO: A Lightweight and High-Precision Model for Small Object Detection in UAV Scenarios

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. SSCW-YOLO

2.2.1. Architecture

2.2.2. SCoConv

2.2.3. C2f_ScConv

2.2.4. WIoU Loss

2.2.5. Synergistic Design Philosophy

2.3. Experimental Settings

2.3.1. Experimental Environment

2.3.2. Evaluation Metrics

2.3.3. Generalization Experiment Protocol

3. Experimental Results and Analysis

3.1. Ablation Experiments

3.2. Comparative Experiments

3.3. In-Depth Analysis of Small Object Detection Performance

3.4. Visualization

3.5. Generalization Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI