DCS-YOLOv8: A Lightweight Context-Aware Network for Small Object Detection in UAV Remote Sensing Imagery

Zhao, Xiaozheng; Yang, Zhongjun; Zhao, Huaici

doi:10.3390/rs17172989

Open AccessArticle

DCS-YOLOv8: A Lightweight Context-Aware Network for Small Object Detection in UAV Remote Sensing Imagery

by

Xiaozheng Zhao

¹,

Zhongjun Yang

¹ and

Huaici Zhao

^2,*

¹

School of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

²

Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 2989; https://doi.org/10.3390/rs17172989

Submission received: 28 June 2025 / Revised: 22 August 2025 / Accepted: 26 August 2025 / Published: 28 August 2025

(This article belongs to the Special Issue Advanced Artificial Intelligence and Deep Learning for Remote Sensing (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

Small object detection in UAV-based remote sensing imagery is crucial for applications such as traffic monitoring, emergency response, and urban management. However, aerial images often suffer from low object resolution, complex backgrounds, and varying lighting conditions, leading to missed or false detections. To address these challenges, we propose DCS-YOLOv8, an enhanced object detection framework tailored for small target detection in UAV scenarios. The proposed model integrates a Dynamic Convolution Attention Mixture (DCAM) module to improve global feature representation and combines it with the C2f module to form the C2f-DCAM block. The C2f-DCAM block, together with a lightweight SCDown module for efficient downsampling, constitutes the backbone DCS-Net. In addition, a dedicated P2 detection layer is introduced to better capture high-resolution spatial features of small objects. To further enhance detection accuracy and robustness, we replace the conventional CIoU loss with a novel Scale-based Dynamic Balanced IoU (SDBIoU) loss, which dynamically adjusts loss weights based on object scale. Extensive experiments on the VisDrone2019 dataset demonstrate that the proposed DCS-YOLOv8 significantly improves small object detection performance while maintaining efficiency. Compared to the baseline YOLOv8s, our model increases precision from 51.8% to 54.2%, recall from 39.4% to 42.1%,

m A P_{0.5}

from 40.6% to 44.5%, and

m A P_{0.5 : 0.95}

from 24.3% to 26.9%, while reducing parameters from 11.1 M to 9.9 M. Moreover, real-time inference on RK3588 embedded hardware validates the model’s suitability for onboard UAV deployment in remote sensing applications.

Keywords:

UAV remote sensing; small object detection; YOLOv8; attention mechanism; dynamic loss; lightweight model; embedded deployment

1. Introduction

Rapid progress in Unmanned Aerial Vehicle (UAV) technology has facilitated its extensive utilization in various fields such as aerial photography, agricultural surveillance, space exploration, and others, due to its high precision, safety, and value for large-scale surveys [1,2]. The images captured by UAVs show a wide coverage, reduced target dimensions, and versatile viewing perspectives, offering crucial technical support to identify key areas, provide assistance, and monitor hazardous regions [3,4]. As a result, UAV aerial image object detection has significant potential in contemporary scientific research and practical applications. Deep learning-driven object detection surpasses conventional methods in accuracy, efficiency, and adaptability by enabling automated feature extraction, end-to-end optimization, and multiscale analysis, establishing itself as the predominant choice in both industrial applications (e.g., autonomous driving, security) and academia [5,6]. Despite its heavy reliance on extensive datasets and computational resources, the benefits of this approach far outweigh those of traditional methods in most real-world situations. However, challenges persist in detecting small targets, as these targets occupy a minimal pixel area in the image, possess limited detail information, and exhibit indistinct visual features. Similarly to human visual perception, where attention is naturally drawn to larger objects in images, object detectors tend to prioritize medium-to-large targets, leading to higher rates of missed and false detection for smaller objects [7,8]. Moreover, shallow layers in neural networks such as YOLOv8 can inadvertently filter out crucial spatial details necessary for detecting these small targets, resulting in data loss. Additionally, during the feature extraction process, small objects are at risk of being obscured by larger ones, leading to the loss of vital information required for accurate detection. Overcoming these obstacles is imperative for improving overall detection accuracy and reliability in practical applications.

Currently, the field of object detection is dominated by three main methodologies: two-stage models typified by the R-CNN series [9,10,11], end-to-end models like DETR-series [12,13,14,15], and single-stage models exemplified by the YOLO series [16,17,18]. Among the various iterations of YOLO, YOLOv8 has garnered attention because of its well-balanced performance in terms of accuracy and speed. Detection of targets in UAV images is challenging due to small targets, complex backgrounds, and varying altitudes. Traditional detectors struggle to maintain accuracy, especially for small objects. Missed detections in real-time UAV applications can lead to overlooking dangerous targets, while false positives increase manual review costs. Many recent approaches focus on feature extraction and post-processing, overlooking the spatial and scale characteristics of UAV images. To address these challenges, we propose DCS-YOLOv8, a lightweight detection framework with a new SCDown module and multi-scale feature enhancement for small target detection. Our method is optimized for embedded UAV deployment, balancing accuracy and efficiency. Based on these considerations, this study introduces several targeted innovations aimed at overcoming the above limitations and enhancing UAV small-target detection performance.

In response to these challenges, this article makes the following key contributions:

(1) We present the first lightweight architecture specifically designed for small object detection in UAV imagery: DCS-YOLOv8. Its ability to perceive small objects is jointly enhanced by three synergistic innovations:

P2 detection layer: A new high-resolution detection head (shallow feature map P2/4) is added in YOLOv8, which directly enhances the detail capture of targets smaller than 50 pixels
C2f-DCAM module: For the first time, a Dynamic Convolutional Attention Mechanism (DCAM) is embedded into the C2f structure. It realizes joint modeling through local multi-scale convolution (Lepe branch) and global sparse attention (Attention branch), thus solving the problem of insufficient long-range dependence in traditional CNNs.
SCDown downsampling: A lightweight downsampling unit with spatial-channel decoupling is proposed, which reduces parameters while maintaining accuracy.

(2) SDBIoU Loss Function: SDBIoU is proposed as a replacement for CIoU, allowing for dynamic adjustment of loss weights based on the scale of targets. This approach addresses challenges associated with traditional loss functions, such as handling label noise, scale sensitivity, and ensuring convergence stability.

(3) Hardware Validation: The model is deployed on the RK3588 platform to achieve real-time small target detection, showcasing the practical efficacy of the proposed methodology in enhancing detection performance for real-world applications.

2. Materials and Methods

Object detection is a fundamental undertaking in computer vision that involves the precise localization and classification of objects in images or videos through the delineation of bounding boxes. The detection of small targets, a prominent area of interest and complexity in this domain, poses a significant challenge due to their limited spatial extent, resulting in a reduced pixel footprint. This diminutive size leads to a sparsity of features following convolutional processes, thereby heightening the likelihood of overlooking these targets during detection.

Conventional approaches such as R-CNN [10] and Fast R-CNN [9] are constrained by utilizing single-scale feature maps, thereby restricting their detection efficacy. The introduction of the Region Proposal Network (RPN) in Faster R-CNN [11] has addressed this limitation by automating the generation of candidate regions and sharing convolutional features among subsequent classification and regression networks. This innovation has led to enhancements in both speed and accuracy. End-to-end object detection frameworks, particularly the DETR series [12], have received increasing attention due to their elegant design and simplified pipeline. Variants such as Deformable DETR [13], DINO [14], and RT-DETR [15] improve convergence speed and detection accuracy, achieving competitive results on general benchmarks. However, these methods often suffer from degraded performance when detecting small objects, primarily due to insufficient local detail preservation and the inherent challenge of sparse object representations in UAV imagery. YOLO encounters challenges in representing global contextual features due to its reliance on convolutional layers, which inherently capture local neighborhood information. While the integration of multiple convolutional layers can enlarge the receptive field, they fall short in establishing long-range pixel relationships, in contrast to the self-attention mechanisms in Transformers [19,20,21,22], which directly model global dependencies. Notably, YOLO may struggle to exploit contextual correlations for joint reasoning when objects are spatially distant in an image. Furthermore, YOLO’s partitioning of images into fixed grids (e.g., 7 × 7 or denser) results in a lack of global information sharing across predictions within these grids. Consequently, small targets confined to a few grids suffer from inadequate global context, leading to increased instances of missed or erroneous detection.

In recent years, researchers globally have conducted extensive studies aimed at improving the accuracy of small target detection. For example, YOLOv3 [23] was the first to introduce the Feature Pyramid Network (FPN) [24], facilitating multi-scale object detection and feature fusion, thereby significantly enhancing small target detection capabilities. Nevertheless, there remains limited global interaction across feature pyramid levels. The amalgamation of high-level semantic features (pertaining to large objects) and low-level detailed features (associated with small objects) primarily relies on rudimentary methods such as upsampling or concatenation, rather than globally optimized integration. Subsequent iterations such as YOLOv4 and YOLOv5 integrated the Path Aggregation Network (PAN) to establish a bidirectional feature pyramid (PAN-FPN), employing concatenation (in YOLOv4) or weighted fusion (in YOLOv5) to amalgamate multi-level features, thereby further enhancing the localization of small targets. Liu et al. [25] integrated Transformers with YOLO by incorporating self-attention layers into either the backbone network or detection heads, explicitly capturing global dependencies. Wei et al. [26] introduced Enhanced-YOLOv8, a novel model tailored for small target detection, which includes a specialized small target detection layer added to the original effective feature layers of YOLOv8. Expanding on the conventional Convolutional Block Attention Module (CBAM), they introduced a Position Attention Module (PAM) and a Fusion Convolutional Block Attention Module (FCBAM), in conjunction with a Semantic Fusion Network (SFN) based on residual networks, resulting in a substantial enhancement in detection accuracy. Xu et al. [27] devised a small target detection algorithm for UAV images using an enhanced YOLOv8 model (YOLOv8-MPEB). This method replaces the CSPDarknet53 backbone with a lightweight MobileNetV3 backbone, integrates an Efficient Multi-scale Attention (EMA) mechanism into the C2f module, and incorporates a Bidirectional Feature Pyramid Network (BiFPN) in the neck section. These adjustments effectively mitigate detection errors stemming from scale variations and complex scenes, thereby augmenting the model’s generalization capabilities.

YOLOv8 [18] is extensively utilized in various fields such as autonomous driving, medical imaging analysis, and security surveillance. It has emerged as a leading model in object detection due to enhancements in both architecture and algorithms. The model’s ability to achieve a balance between real-time performance and high precision makes it well suited for applications necessitating swift and accurate detection. Among the YOLO family, YOLOv8s was chosen as our baseline because it offers the best trade-off between accuracy and speed for resource-constrained UAV platforms, exposes modular components (C2f, PAN-FPN, anchor-free head) that facilitate the integration of our P2 layer and DCAM module, and serves as the de facto reference in recent UAV-oriented studies, ensuring fair and reproducible comparisons.

The network architecture of DCS-YOLOv8 is illustrated in Figure 1, which builds upon and enhances the YOLOv8s framework. The primary objective is to enhance the detection performance of the YOLOv8s model, particularly for small objects in drone images, which presents a significant challenge. Our study introduces three key enhancements to improve small-object detection. Firstly, we incorporate a dedicated small-object detection layer to augment the model’s capability in capturing features of small objects. Secondly, we introduce the DCAM module, which is integrated with the C2f module to create the C2f-DCAM module. This integration enhances global context feature representation and strengthens the fusion of local and global features. Additionally, we implement the SCDown module to accelerate computation and reduce parameters. Lastly, we substitute CIoU with SDBIoU, which dynamically adjusts the loss weights for targets of varying sizes, thereby addressing the limitations of conventional loss functions related to label noise, scale sensitivity, and convergence stability.

2.1. P2: Small Object Detection Head

To enhance the model’s performance across various scales, particularly for small and medium-sized objects, we introduce an additional detection layer (P2) specifically designed for small targets. The P2 layer, functioning as a lower-level feature map, offers detailed information and high-resolution features, enabling the model to detect both large and small objects effectively, thus enhancing its capability in small object detection. A key challenge in object detection is the wide range of target sizes. By integrating with other layers such as P3 and P4, the P2 layer empowers the YOLOv8 model to deliver superior overall performance in handling objects of diverse scales. This hierarchical architecture allows YOLOv8 to detect targets spanning from minuscule to extremely large sizes while maintaining a high level of accuracy.

2.2. DCS-Net Architecture

2.2.1. DCAM Module

Accurately modeling both local and global contextual information is critical for detecting small or obscured objects in UAV imagery. However, traditional convolutional architectures often fall short in this regard due to their inherently limited receptive fields. To address this challenge, we propose the Dynamic Convolution Attention Mixture (DCAM) module, as illustrated in Figure 2. DCAM is designed to simultaneously enhance local feature extraction and global dependency modeling within CNNs by integrating dilated convolution, multi-branch reparameterization, and multi-head attention mechanisms.The DCAM consists of two parallel branches: the Lepe branch and the Attention branch, each responsible for different aspects of feature representation. The subsequent section provides an elaborate exposition of this module:

Lepe Branch

The Lepe branch is built upon the Dilated Reparam Block (DBR) [28], which fuses large-kernel convolutions with parallel small-kernel convolutions employing various dilation rates. This design enables the module to capture both fine-grained local structures and broader spatial patterns efficiently. During training, the outputs from different branches are combined to enhance multi-scale representation. After training, these branches are consolidated into a single equivalent convolution via reparameterization, thus reducing inference complexity without sacrificing performance. By taking the input feature graph (

X^{i n p u t}

) and leveraging the convolution reparameterization block, we enhance the local multiscale features, as depicted in Equation (1):

X_{L e p e} = B N (D i l a t e d R e p a r a m B l o c k (X^{i n p u t}))

(1)

where

X_{L e p e}

is the output of the Lepe branch,

X^{i n p u t}

is the input feature. The Dilated Reparam Block (DBR) consists of multiple convolution branches with different dilation rates, which are merged into a single convolution operation through reparameterization.

B N

is employed to accelerate training convergence.

Attention Branch

In the attention branch, we first extend the input feature tensor

X^{i n p u t} \in R^{B \times C \times H \times W}

(2D image feature) to

F \in R^{B \times 3 C \times 1 \times H \times W}

by

1 \times 1 C o n v

. and generate query(Q), key(K), and value(V) by depth-wise convolution (DWConv),

(Q, K, V) \in R^{C \times H \times W}

. Next, we reshape Q into

Q^{*} \in R^{C \times H W}

, reshape K into

K^{*} \in K^{C \times H W}

, compute the inner product of

Q^{*}

and

K^{*}

, use dynamic temperature modulation, introduce temperature parameters T, apply the Softmax function S, and generate the attention map

A \in R^{C \times C}

. The computational burden is reduced instead of computing the huge regular attention map of size

R^{H W \times H W}

. Then, we splice V with the attention map and output it via

1 \times 1 C o n v

, as shown in Formulas (2) and (3):

X_{a t t} = W_{1 \times 1} A t t e n t i o n (Q^{*}, K^{*}, V^{*}) + X^{i n p u t}

(2)

A t t e n t i o n (Q^{*}, K^{*}, V^{*}) = V^{*} (S o f t m a x (\frac{{(Q^{*} K^{*})}^{T}}{τ})

(3)

where

X_{a t t}

is the output of the Attention Branch.

W_{1 \times 1}

denotes

1 \times 1 C o n v

.

A t t e n t i o n (Q^{*}, K^{*}, V^{*})

is the output of the scaled dot-product attention module. T is the temperature parameter, and

τ

is the temperature scaling factor.

Finally, the output of the DCAM module calculation is computed as:

X^{o u t p u t} = X_{L e p e} + X_{a t t}

(4)

2.2.2. C2f-DCAM

As shown in Figure 3, we integrate the DCAM module into the C2f block of YOLOv8 to form the C2f-DCAM module. The original C2f module enhances feature reuse and gradient propagation through residual connections and a multi-branch structure. By embedding DCAM, we further boost its capacity to fuse local and global features, which is particularly advantageous for small object detection. This hybrid module enables richer spatial representations and deeper contextual understanding, allowing the network to better distinguish subtle object boundaries and fine details in cluttered aerial scenes.

2.2.3. SCDown Module

In standard YOLO architectures, downsampling is typically performed using 3 × 3 convolutions with stride 2, which increases channel depth while reducing spatial resolution. However, this approach introduces significant computational overhead. To address this, we design the SCDown module, a lightweight downsampling block that separates spatial and channel operations for better efficiency. SCDown comprises two sequential components:

Pointwise Convolution(1 × 1): Compresses channel dimensions from $C_{1}$ to $C_{2}$ , reducing redundancy and emphasizing salient features.

Deptwise Convolution (k × k, stride s): Performs channel-wise convolution for spatial downsampling, enabling the network to capture scale-specific information with reduced parameter count.

This decoupled design minimizes information loss during downsampling by preserving fine-grained spatial cues while significantly lowering the computational burden. As a result, SCDown enhances the network’s ability to process multi-scale features effectively, which is vital for detecting small or overlapping targets in UAV imagery.

2.2.4. Overall Architecture

Figure 4 shows the overall network architecture of DCS-Net, which is designed for images with input size P = 640 × 640 and consists of four stages, each of which extracts and refines image features in turn. In order to further verify the correspondence between the structural design of each stage of DCS-Net and the perception ability of small targets, we visualized the feature maps of each key stage of the network and generated the thermal map shown in Figure 4. The results show that there are significant differences in the spatial attention area and feature response intensity in each stage, reflecting the synergy and complementarity between module functions:

Stage 1 The input image undergoes convolution and downsampling, reducing the feature map size by 1/4 while increasing the number of channels to 64–128. This process facilitates initial low-level feature extraction, focusing on aspects like edges and textures. The resulting thermal map primarily highlights the edge contours in the image, exhibiting heightened sensitivity to low-level features, such as textures and boundaries. This stage captures intricate spatial details, establishing the groundwork for further feature extraction.
Stage 2 The feature map is further reduced to 1/8th of its original size to emphasize extracting intricate local structural details. This process aids the model in discerning boundary and shape characteristics of small targets. The thermal map progressively narrows down to the target region, displaying pronounced highlights around small targets. This phenomenon suggests that the network is differentiating foreground from background areas and developing semantic recognition capabilities.
Stage 3 The feature map size is reduced through the integration of the C2f-DCAM module. This module, known as the Dynamic Convolutional Attention Blending Module (DCAM), enhances the contextual semantic representation of small targets by employing parallel local enhancement (Lepe branching) and global dependency modeling (Attention branching). These mechanisms notably enhance target detection in scenarios with occluded, dense, or complex backgrounds. Notably, the thermal map excels in its capacity to concentrate on specific areas: it significantly amplifies responses in regions containing small targets while preserving background structural information. This observation suggests that the DCAM module steers the network towards establishing prolonged dependencies on critical regions via a global attention mechanism, thereby enhancing the discernment of small targets within intricate backgrounds.
Stage 4 The feature map undergoes additional compression for downsampling efficiency, departing from conventional large-step convolution methods. The SCDown module operates through channel-wise spatial compression and separate dot-convolution channel compression, diminishing parameter volume while preserving essential spatial structures. This approach effectively addresses information loss concerns. Despite further reduction in thermal map spatial resolution, high responsiveness to small target areas is preserved. This outcome is credited to the SCDown module’s computational compression, which safeguards crucial spatial layout features and prevents excessive information loss. Finally, the SPPF module (Fast Spatial Pyramid Pooling) fuses feature maps from different scales to enhance the adaptability to multi-scale objects, especially for detecting large and small objects simultaneously.

The DCS-Net exhibits well-defined functions and tight integration across its stages. Its structural design significantly diminishes model complexity while upholding detection accuracy, enabling efficient real-time inference on edge devices like RK3588. The hierarchical arrangement of DCS-Net aligns closely with the spatial attention pattern of thermal maps. The design of each stage is pivotal in determining the attention span and discrimination capability for detecting small targets, thereby achieving dual optimization for precise localization and effective reasoning.

2.3. SDBIoU Loss Function

The YOLO loss function plays a crucial role in enhancing object detection accuracy by employing multi-task learning to refine bounding box localization, object confidence, and classification precision. While IoU-based loss functions are indispensable for assessing spatial overlap, their efficacy diminishes in the absence of overlap, resulting in minimal gradient signals. To mitigate this issue, several IoU-based loss functions have been devised, each characterized by distinct attributes and constraints.

DIoU (Distance-IoU) [29] serves as an enhanced regression loss function in the realm of object detection. It incorporates the concept of “centroid distance” into the loss computation, thereby augmenting the efficiency and precision of aligning predicted bounding boxes with their ground-truth counterparts. Nevertheless, it is important to note that DIoU primarily focuses on optimizing position and overlap metrics, neglecting the aspect of shape congruence. Consequently, this singular emphasis may result in bounding boxes that are aligned based on centroids but do not match in terms of shape, thereby compromising the overall quality of fitting. The DIoU formula is delineated in Equation (5):

L_{D I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}}

(5)

where

I o U

represents the intersection over union of the prediction and ground-truth boxes.

ρ (.)

denotes the Euclidean distance between their center points. c indicates the diagonal length of the smallest enclosing box covering both prediction and ground-truth boxes.

CIoU (Intersection over Union) [30] is a loss function utilized in object detection for refining bounding box regression. It enhances the conventional IoU metric by incorporating penalties for the distance between the centers of the predicted and ground-truth boxes, as well as for differences in the aspect ratio. This modification results in a more holistic evaluation of the discrepancy between the predicted and actual bounding boxes. Nevertheless, the dynamically adjusted weights in CIoU may exhibit fluctuations in scenarios where IoU values are either very high or very low, thereby introducing instability during the training process. Furthermore, CIoU demonstrates reduced efficacy in accurately detecting objects with highly irregular shapes. The mathematical expression for CIoU is presented in Equation (6):

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν, ν = \frac{4}{π^{2}} (a r t c t a n (\frac{ω_{g t}}{h_{g t}}) - a r c t a n (\frac{ω}{h}))

(6)

where

α

is the weight coefficient that controls the influence of

ν

,

ν

represents the aspect-ratio penalty, compensating for the difference in aspect ratios between prediction and ground-truth boxes.

(ω, h)

and

(ω_{g t}, h_{g t})

denote the width and height of the prediction and ground-truth boxes, respectively.

EIOU [31] is a refined loss function utilized in object detection for bounding-box regression, aiming to enhance both detection precision and training efficiency. It achieves this by breaking down geometric inaccuracies into three components: the intersection area, the distance between center points, and variations in width and height. Additionally, it incorporates a dynamic weighting mechanism to prioritize optimizing challenging instances. Despite its effectiveness in addressing size discrepancies, EIOU encounters difficulties related to anchor box expansion and slow convergence during regression. Its superior accuracy is counterbalanced by increased computational complexity, necessitating meticulous parameter adjustments and a higher implementation threshold. The calculation formula for EIOU is presented in Equation (7):

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + \frac{{(ω - ω_{g t})}^{2}}{ω_{c}^{2}} + \frac{{(h - h_{g t})}^{2}}{h_{c}^{2}}

(7)

where

ω_{c}

and

h_{c}

denote the width and height of the smallest enclosing box that covers both the prediction box and the ground-truth box.

We replace the CIoU with SDBIoU (Scale-based Dynamic Loss) [32] to overcome challenges associated with traditional loss functions related to label noise, scale sensitivity, and convergence stability. SDBIoU dynamically adjusts the loss weights based on the scale of targets to address issues such as disproportionate amplification of localization errors for small targets under fixed-weight loss calculations and convergence oscillations caused by gradients from targets of different scales interfering with each other. Moreover, annotation errors, such as bounding box offsets, are inaccurately magnified by fixed loss functions, compromising model robustness. The mathematical formulation of SDBIoU is presented in Equations (8)–(11), where it integrates IoU and centroid distance by introducing a dynamic coefficient

δ

to adjust the weight range. SDBIoU incorporates a scale-aware mechanism to calculate influence coefficients based on target area and implements collaborative optimization through separate scale loss (Sloss) and localization loss (Lloss) components.

L_{B S} = 1 - I o U + α υ, L_{B L} = \frac{ρ^{2} (b_{p} - b_{g t})}{c^{2}}

(8)

where

I o U

represents the Intersection over the Union of the predicted and ground truth BBox,

α υ

measures the aspectratio consistency of the BBox,

ρ (.)

is the Euclidean distance,

b_{p}

and

b_{g t}

are the centroids of the predicted BBox

B_{p}

and target BBox

B_{g t}

, and c is the diagonal length of two BBoxes.

R_{o c} = \frac{ω_{0} \times h_{o}}{ω_{c} \times h_{c}}

(9)

where

R_{o c}

represent the scaling factors for width and height between the original image and the current feature map.

(ω_{0}, h_{o})

denote the width and height of the original image, while

(ω_{c}, h_{c})

denote the width and height of the current feature map.

β_{B} = m i n (\frac{B_{g t}}{B_{g t m a x}} \times R_{o c} \times δ, δ)

(10)

where

β_{B}

represents the influence coefficient of the BBox.

B_{g t m a x} = 81

is determined by the maximum size of IRST defined by the International Society for Optics and Photonics (SPIE), and

δ

represents the dynamic coefficient.

β_{L_{B S}} = 1 - δ + β_{B}, β_{L_{B L}} = 1 + δ - β_{B}, L_{S D B} = β_{L_{B S}} \times L_{B S} + β_{L_{B L}} \times L_{B L}

(11)

where

β_{L_{B S}}

and

β_{L_{B L}}

are the influence factors of

L_{B S}

and

L_{B L}

, respectively.

3. Results

The section begins by introducing the datasets utilized in the study, followed by comprehensive descriptions of the experimental setting and training methodologies. It also delineates the evaluation metrics utilized to gauge model performance. The efficacy of the proposed approach is validated through comparative analysis against leading models, with YOLOv8 serving as the reference point. Furthermore, this section assesses the model’s efficacy in demanding real-world scenarios, including the detection of distant and small objects situated far from the camera.

3.1. Dataset

The VisDrone2019 dataset [33], a renowned compilation of unmanned aerial vehicle (UAV) aerial photographs, was collaboratively developed by the Machine Learning and Data Mining Laboratory at Tianjin University and the AISKYEYE data mining group. This dataset consists of 288 video segments (amounting to 261,908 frames) and 10,209 still images. It was acquired using multiple cameras mounted on drones in over a dozen cities across China.

DCS-YOLOv8 underwent assessment using the VisDrone 2019 dataset, offering a diverse and comprehensive sample collection. The dataset’s image distribution, label count, and object size variability are depicted in Figure 5, illustrating various inherent complexities. Target detection faces challenges due to factors like dense clustering, motion blurring, diminutive target dimensions, and class indistinctness. For instance, distinguishing between branches and individuals can be arduous in certain scenarios. Moreover, aerial imagery from UAVs introduces notable variations in object sizes and increased occlusions, further complicating detection tasks. The dataset exhibits class imbalance, notably with the Car category comprising 144,867 annotations, while the Awning-tricycle category only has 3246. This imbalance necessitates a robust detection model capable of effectively handling underrepresented classes. Additionally, the distribution of anchor sizes reveals that although medium and large anchors predominate in categories like trucks, cars, and buses, the majority of objects are smaller than 100 × 100 pixels, with a significant portion even smaller than 50 × 50 pixels, presenting additional challenges for detecting small objects.

A crucial feature of the VisDrone2019 dataset is the inclusion of numerous small objects of varying sizes, depicted from multiple viewpoints and within different contexts. This diversity increases the dataset’s intricacy and difficulty level, distinguishing it as a particularly demanding benchmark in the realm of computer vision.

Figure 5 illustrates the process of manually annotating objects in the VisDrone2019 dataset.

3.2. Experimental Environment and Training Strategy

The experiments were conducted on a Ubuntu 22.04 system equipped with an Intel i7-12800HX CPU, NVIDIA A10 GPU, and 32GB RAM. The deep learning environment is based on PyTorch 2.3.1, utilizing CUDA 12.1 for computational acceleration, and Python 3.10 as the programming language. The experiment configured 150 training epochs, set the batch size at 16, and determined the image size as 640. All the other hyper-parameters were established at their original default values, and the baseline model was YOLOv8s.

3.3. Evaluation Metrics

To assess our enhanced model’s detection performance, we used several evaluation metrics: precision, recall,

m A P_{0.5}

,

m A P_{0.5 : 0.95}

, and the model parameter count. The formulas for these metrics are as follows:

Precision is an indicator of the ratio of true positives to the total predicted positives, as shown in Equation (12):

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

True positives (

T P

) are instances where the model correctly predicts positive cases. False positives (

F P

) are cases where the model incorrectly predicts positive instances. False negatives (

F N

) refer to instances where the model fails to detect actual positive cases.

Recall measures the ratio of correctly predicted positive samples to all actual positive samples, as shown in Equation (13):

R e c a l l = \frac{T P}{T P + F N}

(13)

Average precision (AP) represents the area under the precision–recall curve, as shown in Equation(14):

A P = \int_{0}^{1} P r e c i s i o n (R e c a l l) d (R e c a l l)

(14)

Mean average precision (

m A P

) indicates the average of

A P

across all classes, showing the model’s detection performance on the entire dataset. as shown in Equation (15):

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(15)

where

A P_{i}

denotes the Average Precision for the category indexed by i, and N represents the total number of categories in the training dataset.

$m A P_{0.5}$ is calculated at an IoU threshold of 0.5.
$m A P_{0.5 : 0.95}$ is computed by averaging the AP values across IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05.

3.4. Experiment Results

This section provides an in-depth evaluation of the DCS-YOLOv8 model through a series of targeted experiments. We begin by comparing the proposed SDBIoU loss with several widely used IoU-based loss functions. The results demonstrate that SDBIoU effectively mitigates issues such as label noise and scale sensitivity, particularly benefiting small object detection. We then assess the model’s performance relative to different YOLO variants and state-of-the-art detectors, including Faster R-CNN [11], Cascade R-CNN [34], Swin Transformer [35], and CenterNet [36]. DCS-YOLOv8 consistently achieves a better trade-off between accuracy and model complexity, outperforming many larger models in both precision and recall while maintaining a lightweight structure. Ablation studies further validate the individual contributions of each enhancement module—namely, the P2 detection head, the DCAM attention mechanism, SCDown downsampling, and the SDBIoU loss. These components each offer measurable improvements, and their combined integration results in a significant overall performance boost. Finally, real-world deployment on RK3588 embedded hardware confirms that DCS-YOLOv8 supports real-time inference, validating its practical utility for UAV-based scenarios. The model demonstrates not only competitive accuracy but also strong adaptability to resource-constrained environments, confirming its value for real-time small object detection in aerial imagery.

3.4.1. Comparison of Loss Functions

We conducted a comprehensive comparison between the proposed Scale-based Dynamic Balanced IoU (SDBIoU) and several commonly used IoU-based loss functions within the YOLOv8 framework, including CIoU, DIoU, EIoU, and MPDIoU, using identical training settings to ensure fairness. As shown in Table 1, SDBIoU demonstrates competitive or superior detection performance across key evaluation metrics. While its precision shows a slight decrease compared to CIoU, it achieves notable gains in recall, indicating improved sensitivity to true positive detections—especially for small or low-resolution targets.

This trade-off ultimately leads to a net improvement in the overall detection capability. The enhanced performance of SDBIoU can be attributed to its dynamic loss weighting mechanism, which adjusts the influence of each sample based on object scale and spatial characteristics. By incorporating a scale-aware influence coefficient, SDBIoU effectively reduces the disproportionate penalization of small objects and mitigates instability caused by inconsistent IoU gradients. This enables the model to maintain more stable convergence and greater robustness during training, particularly in scenarios involving densely packed or variably scaled targets.

3.4.2. Comparison with Different Mainstream Models

To comprehensively verify the performance advantages of DCS-YOLOv8, we conducted comparative experiments with variants of the YOLO series and other mainstream object detection frameworks. The results in Table 2 and Table 3 show that the proposed model achieves a better balance between accuracy, efficiency, and robustness in small object detection scenarios for UAV remote sensing images, with particularly outstanding performance in small object detection tasks.

Comparison with YOLO Series Models

YOLOv8 and previous versions: Compared with previous models, such as YOLOv3, YOLOv5s, and YOLOv7, DCS-YOLOv8 has significantly improved detection accuracy while having a lower or comparable parameter scale. For example, although YOLOv7 achieves lightweight with 3.1 M parameters, its $m A P_{0.5}$ is only 40.2%, while DCS-YOLOv8 reaches a $m A P_{0.5}$ of 44.5% with 9.9 M parameters, demonstrating that it balances compactness and detection reliability in complex aerial photography scenarios.
YOLOv10 series: The parameter scales of YOLOv10n and YOLOv10s are 2.7 M and 8.1 M, respectively, which have obvious lightweight advantages, but their $m A P_{0.5}$ percentages are only 34.0% and 40.8%, far lower than the 44.5% of DCS-YOLOv8. This indicates that simply reducing parameters may sacrifice the ability to perceive small objects, while DCS-YOLOv8 achieves accuracy improvement with a 1.2M reduction in parameters through the optimization of the SCDown module and C2f-DCAM structure.
YOLOv11 series: YOLOv11s achieves a $m A P_{0.5}$ of 40.6% with 9.4 M parameters, while DCS-YOLOv8, with slightly fewer parameters (9.9 M), increases $m A P_{0.5}$ by 3.9 percentage points and $m A P_{0.5 : 0.95}$ from 24.8% to 26.9%. This benefits from the preservation of high-resolution features by the P2 detection layer of DCS-YOLOv8 and the dynamic adaptation of the SDBIoU loss to the scale of small objects.
YOLOv12 series: The $m A P_{0.5}$ of YOLOv12s is 41.4%, slightly lower than the 44.5% of DCS-YOLOv8, and its inference time (11.6 ms) is longer than the 10.1 ms of DCS-YOLOv8. Although both adopt lightweight designs, the SCDown module of DCS-YOLOv8 better preserves the spatial details of small objects while reducing computational overhead by separating spatial and channel operations.

In addition, compared with models that are improved based on YOLOv8 (such as PVswin-YOLO and CoT-YOLO), DCS-YOLOv8 has more competitive comprehensive performance. For example, the precision of PVswin-YOLO is 54.5%, but its

m A P_{0.5}

(43.3%) and

m A P_{0.5 : 0.95}

(26.4%) are both lower than those of DCS-YOLOv8, and its parameter scale (10.1 M) is larger, verifying the advantages of the proposed model in balancing accuracy and efficiency.

Comparison with Other Mainstream Detection Frameworks

Table 3 shows that DCS-YOLOv8 also performs excellently among non-YOLO series models:

Two-stage models: The $m A P_{0.5}$ of Faster R-CNN and Cascade R-CNN are 36.6% and 39.4%, respectively, far lower than the 44.5% of DCS-YOLOv8. This is because two-stage models rely on region proposal mechanisms, which are prone to missed detections when dealing with dense and small-scale targets in UAV images.
Transformer-based models: The $m A P_{0.5}$ of Swin Transformer is 39.2%, but its window attention mechanism is prone to information discontinuity when the target scale changes drastically. In contrast, the DCAM module of DCS-YOLOv8 better captures global–local dependencies through the fusion of dynamic convolution and attention.
Single-stage anchor-free models: The $m A P_{0.5}$ of CenterNet is 39.7%, but its positioning accuracy for small targets in complex backgrounds is insufficient. DCS-YOLOv8 enhances the ability to distinguish low-resolution targets through the P2 layer and SDBIoU loss.

In conclusion, while maintaining lightweight (9.9 M parameters), DCS-YOLOv8 surpasses mainstream models such as YOLOv11 and YOLOv12 through structural innovations (P2 detection layer, DCS-Net, SDBIoU loss). Its advantages in small object detection accuracy and real-time performance make it more suitable for UAV deployment scenarios with limited resources.

3.5. Ablation Experiments

To quantitatively assess the contribution of each proposed component in DCS-YOLOv8, we conducted a series of ablation experiments based on the YOLOv8s baseline model. The results, summarized in Table 4, demonstrate that each enhancement—namely, SDBIoU, the P2 detection layer, DCAM, and the DCS-Net backbone—provides measurable performance gains across multiple object categories, particularly for small and dense targets.

Initially, replacing the standard CIoU with SDBIoU yields a modest yet consistent improvement. Although the overall

m A P_{0.5}

only increases from 40.6% to 40.8%, category-level gains are more evident—such as a 1.2% increase in the Tricy category (from 28.5% to 29.7%) and a 3.1% rise in Bus detection (from 57.8% to 60.9%). This highlights SDBIoU’s effectiveness in enhancing localization robustness for medium-to-large-scale objects and improving recall for hard-to-detect instances.

The integration of the P2 detection layer has a more substantial impact. When added on top of YOLOv8s-SDBIoU, it increases the overall percentage from 40.8% to 43.3%, representing a 2.5% absolute gain. This enhancement particularly benefits small object categories: Pedestrian detection improves from 44.1% to 50.0%, Bicycle from 14.4% to 16.6%, Car from 79.6% to 83.3%, and Motorcycle from 44.9% to 50.6%. In this study, under the drone shooting conditions (flight altitude 60–120 m, GSD=2.5–4.8 cm/pixel), pedestrians or bicycles typically appeared in sizes ranging from 8 × 20 to 35 × 80 pixels. These dimensions fell within the effective field of perception of our new small target detection layer (P2/4 scale). Consequently, the improvement in detection capability benefited small categories across the dataset without compromising accuracy despite the lower altitude. These results confirm that the P2 layer effectively captures high-resolution spatial features essential for detecting low-scale targets.

Replacing the original YOLOv8 backbone with the proposed DCS-Net, which integrates the C2f-DCAM and SCDown modules, further improves detection performance. The full model—YOLOv8s-SDBIoU-P2 with DCS-Net—achieves an overall level of 44.5%, the highest among all tested variants. Category-wise, additional improvements are observed in ‘People’ (42.5%, up from 34.3% in baseline), ‘Van’ (48.5%, up from 45.5%), and ‘A-tricy’ (18.1%, up from 16.6%), additional decreases are observed in ‘Truck’(39.4%, down from 40.2%) and ‘Bus’ (59.8%, down from 60.2%). The

m A P_{0.5}

of truck and bus decreases slightly because the improvement (P2 small target detection layer + DCAM) mainly enhances the features of small targets, while these two classes have larger size and small feature differences in VisDrone. The new global–local fusion module offers limited help for large targets, but slightly weakens their feature weights in feature assignment. This decline is an acceptable trade-off: overall

m A P_{0.5}

increases and small target categories, such as pedestrians and bicycles, gain significant increases. These results validate DCS-Net’s effectiveness in enhancing both local detail preservation and global context awareness.

Table 5 further highlights the cumulative effect of these enhancements. Compared to the baseline YOLOv8s, the final DCS-YOLOv8 model improves overall precision from 51.8% to 54.2%, recall from 39.4% to 42.1%, from 40.6% to 44.5%, and from 24.3% to 26.9%. Moreover, these improvements are achieved while reducing the parameter count from 11.1 million to 9.9 million, and maintaining real-time inference with an average latency of 10.1 ms per image. The SCDown lightweight downsampling module, while sacrificing some fine-grained spatial information, results in only a minor drop in overall accuracy. However, it reduces the number of parameters, enhances inference speed, and improves metrics such as

m A P_{0.5}

and

m A P_{0.5 : 0.95}

. In UAV small target scenarios, lightweight modules must balance speed and accuracy, and this trade-off is deemed acceptable.

These findings confirm that the incremental integration of SDBIoU, the P2 layer, and DCS-Net modules not only enhances the detection accuracy—particularly for small or complex objects—but also improves model compactness and inference speed, which are critical for UAV-based applications.

Figure 6 presents the training dynamics of DCS-YOLOv8 compared to the baseline YOLOv8s over 150 epochs, with evaluation metrics including

m A P_{0.5}

and

m A P_{0.5 : 0.95}

. Notably, DCS-YOLOv8 begins to outperform YOLOv8s in both metrics from approximately epoch 22. The performance gap steadily widens thereafter, and DCS-YOLOv8 stabilizes earlier—around epoch 50—indicating faster convergence and greater training stability.

3.6. Visual Assessment

To further analyze classification accuracy, we employed a confusion matrix comparing DCS-YOLOv8 and YOLOv8s across 10 object categories. As shown in Figure 7, the diagonal elements of DCS-YOLOv8’s matrix exhibit consistently darker shades compared to those of YOLOv8s, indicating a higher frequency of correct predictions. In contrast, the off-diagonal elements—especially in the final row corresponding to background misclassification—appear lighter, suggesting that DCS-YOLOv8 makes significantly fewer false negative errors.

For instance, in the ‘Pedestrian’ category, the accuracy of DCS-YOLOv8 increases from 44.2% to 51.5%, while ’People’ improves from 34.3% to 42.5%, and ‘Motorcycle’ from 44.8% to 50.6%. These improvements underscore the model’s enhanced sensitivity to small-scale, human-related targets. Additionally, the confusion matrix reveals that misclassifications into the ’background’ class are greatly reduced. This is particularly evident in categories such as ’Bus’ and ‘Tricycle’, which often suffer from occlusion and scale ambiguity in UAV images.

Nevertheless, some challenges remain. Categories such as ‘Bicycle’ (17.2%), ‘Tricycle’ (33.3%), and ‘A-tricycle’ (18.1%) continue to exhibit notable confusion, frequently misidentified either among themselves or as background. This is likely due to similar visual structures and dense scene contexts. Even so, the overall improvement across most categories highlights DCS-YOLOv8’s superior feature discrimination and contextual reasoning capabilities.

As illustrated in Figure 8, we conducted a comparative analysis of feature activation heatmaps produced by the baseline model, YOLOv8s, and our proposed DCS-YOLOv8 across a variety of UAV-based detection scenarios. These include environments with heavy occlusion, densely cluttered backgrounds, and poor lighting conditions—all of which pose significant challenges to conventional object detection systems. DCS-YOLOv8 demonstrates stronger and more focused feature activation in relevant semantic regions than baseline models and YOLOv8s, indicating enhanced spatial attention and improved differentiation of target features from background noise. Enhanced detection of small or partially occluded targets: The heat map from DCS-YOLOv8 reveals a sharper focus on small or partially occluded targets, minimizing irrelevant or scattered activation. This suggests that DCS-YOLOv8 excels at extracting semantically meaningful features and achieving precise localization. These improvements are due to the DCS-Net, which enhances global and local feature representation, and the P2 detection layer, which maintains high-resolution detail at shallow feature levels. Together, these mechanisms enhance the model’s perception in complex aerial environments.

These results demonstrate that the proposed model effectively learns to allocate attention toward task-critical areas, particularly for small and hard-to-detect objects. The observed improvements can be attributed to the inclusion of the DCAM module, which enhances both global and local feature representations, and the P2 detection layer, which enables high-resolution detail preservation at shallow feature levels. Altogether, these mechanisms contribute to the model’s superior perception capability in complex aerial environments.

As shown in Figure 9, we qualitatively evaluated the detection performance of the baseline model, YOLOv8s, and our DCS-YOLOv8 across multiple challenging UAV scenarios. These scenarios include high-altitude surveillance views, night-time urban scenes with low illumination, crowded intersections with overlapping objects, and visually complex backgrounds with high texture similarity. In all scenarios, DCS-YOLOv8 demonstrated notable enhancements in detecting small, occluded, and distant objects that were frequently overlooked by the baseline and YOLOv8s models. Specifically, in scenes featuring distant targets, DCS-YOLOv8 exhibited improved detection of small distant targets in the first and second rows, with reduced false positives in the second row. Moreover, DCS-YOLOv8 consistently achieved robust detection performance characterized by higher Intersection over Union (IoU) values and fewer false positives. In low-light conditions depicted in the third row of the image, DCS-YOLOv8 exhibited enhanced detection confidence through improved alignment of bounding boxes, leading to the detection of a greater number of small objects. Furthermore, in densely populated environments, DCS-YOLOv8 effectively distinguished overlapping objects such as pedestrians and cyclists, ensuring high detection accuracy even at the image peripheries.

These improvements highlight the effectiveness of DCS-YOLOv8 in extracting semantically meaningful features and achieving fine-grained localization across diverse real-world conditions. The integration of structural innovations—namely, the P2 detection layer, DCAM attention mechanism, and the scale-sensitive SDBIoU loss—play a pivotal role in enabling the model to generalize across varying object scales and densities. Collectively, the results confirm the enhanced adaptability and detection reliability of DCS-YOLOv8 in practical UAV applications.

3.7. Real-Time Object Detectio

To validate the real-time feasibility of our model, we deployed DCS-YOLOv8 on an embedded UAV detection platform comprising the Orange Pi RK3588 board and the DJI Mavic 3 drone, as illustrated in Figure 10. The RK3588 is powered by a 64-bit octa-core Rockchip RK3588S processor and features an integrated 3D GPU, delivering a high-performance yet compact solution for edge AI inference.The DJI Mavic 3 drone, equipped with a 4/3 CMOS Hasselblad main camera and a 28× hybrid zoom telephoto lens, served as the image acquisition device for aerial traffic monitoring. This system architecture ensures low-latency data transmission and onboard processing, making it suitable for time-sensitive applications, such as intelligent traffic surveillance, emergency response, and security patrols.

Figure 11 presents UAV-captured target detection results across various scenes and flight altitudes, illustrating the model’s capability to recognize multi-scale targets in complex spatial environments.

The figure presents images captured from various angles and altitudes, covering scenarios such as dense traffic areas, urban streets, and crowded locations. The results indicate that DCS-YOLOv8 consistently delivers stable detection performance for large, well-defined targets like cars, vans, and trucks, regardless of flight altitude. This suggests the model’s robust scaling capabilities, with precise target boundaries and accurate positioning. Conversely, for smaller targets like pedestrians, people, and motorcycles, detection effectiveness is more altitude-dependent. At lower altitudes, higher resolution and detailed textures enhance detection, while at higher altitudes, reduced pixel proportions and blurred details lead to decreased accuracy, resulting in incomplete boundaries or increased missed detections. The figure illustrates DCS-YOLOv8’s stability in handling real-world interference factors like background complexity and target occlusion. In low-altitude images, despite challenges such as multi-target aggregation and angle tilt, the model effectively distinguishes and accurately identifies various targets, demonstrating high robustness in dense scenes. Even in high-altitude images, where target sizes are further reduced, the model accurately detects main vehicles, evidencing its strong generalization capability for small, distant targets.

Figure 11 demonstrates DCS-YOLOv8’s proficiency in detecting multi-scale targets across varying flight altitudes and scenarios, confirming the efficacy of its multi-scale feature fusion and small target perception mechanisms, and reinforcing its utility in UAV monitoring tasks. On the RK3588 embedded platform, DCS-YOLOv8 is competitive among small-object detection models of similar complexity. However, the latency is still higher than desirable for latency-critical UAV applications, particularly those requiring frame rates under strict power constraints. This indicates that while our architecture balances accuracy and efficiency, there is still room for further optimization to meet ultra-low-latency requirements on resource-limited hardware.

3.8. Generalization Test

To further validate the generalization capability of the proposed DCS-YOLOv8, we extended the evaluation to two additional remote sensing datasets, SSDD and NWPU VHR-10, using the same hyperparameters and training protocols as employed for the VisDrone 2019 dataset. The SSDD dataset is dedicated to ship detection in satellite imagery, containing 116 high-resolution images with 2456 annotated ship instances. The NWPU VHR-10 dataset comprises 800 very-high-resolution remote sensing images from Google Earth and Vaihingen, covering 10 object categories with 3651 instances, where the average object size accounts for approximately 6.4% of the total area.

Experimental results, as shown in Table 6, demonstrate that DCS-YOLOv8 outperforms YOLOv8s, YOLOv10s, YOLOv11s, and YOLOv12s in both

m A P_{0.5}

and

m A P_{0.5 : 0.95}

on both datasets. On the SSDD dataset, DCS-YOLOv8 achieves a

m A P_{0.5}

of 97.2% and a

m A P_{0.5 : 0.95}

of 64.4%. Compared with YOLOv11s, which has a

m A P_{0.5}

of 96.5% and a

m A P_{0.5 : 0.95}

of 62.7%, DCS-YOLOv8 is 0.7% and 1.7% higher, respectively. It also surpasses YOLOv12s, which has a

m A P_{0.5}

of 96.6% and a

m A P_{0.5 : 0.95}

of 61.1%, with DCS-YOLOv8 being 0.6% and 3.3% higher in the two metrics. On the NWPU VHR-10 dataset, DCS-YOLOv8 reaches a

m A P_{0.5}

of 92.6% and a

m A P_{0.5 : 0.95}

of 59.4%. In contrast, YOLOv11s has a

m A P_{0.5}

of 87.4% and a

m A P_{0.5 : 0.95}

of 54.1%, so DCS-YOLOv8 is 5.2% and 5.3% higher. Compared with YOLOv12s, which has a

m A P_{0.5}

of 87.5% and a

m A P_{0.5 : 0.95}

of 52.7%, DCS-YOLOv8 is 5.1% and 6.7% higher.

These results confirm the strong generalization ability of DCS-YOLOv8, indicating that the model’s enhancements—such as the P2 detection layer, DCS-Net architecture, and SDBIoU loss—are not limited to UAV imagery but can also effectively improve detection performance in other remote sensing scenarios involving small objects.

4. Discussion

The proposed DCS-YOLOv8 achieves competitive accuracy and favorable parameter efficiency, significantly improving small-object detection performance in UAV remote sensing imagery while maintaining suitability for embedded deployment. However, its inference latency on devices such as the RK3588 remains suboptimal for certain time-critical UAV missions. This limitation—primarily arising from the computational overhead introduced by multi-branch attention mechanisms (DCAM) and additional detection layers (P2)—is not fully addressed in the present work.

Future research will explore hardware-aware model compression, operator fusion, and sparsity-based acceleration to further reduce latency without sacrificing detection accuracy. In particular, the spatial distribution of small objects in UAV remote sensing imagery is often extremely sparse, with targets occupying only a tiny fraction of the image area. This characteristic has been successfully leveraged in prior studies to accelerate detection models. For example, QueryDet [43] adopts sparse proposal sampling with iterative post-refinement to reduce computational costs, while the HiEUM [44] variant of DETR introduces a sparsity-inducing attention mechanism to focus computation on likely object regions. In the field of remote sensing, the Sparsity framework [45] proposes a sparse feature pyramid optimization scheme using sparse convolution to skip large uniform background areas, substantially reducing computational overhead.

Incorporating such sparsity-aware strategies into DCS-YOLOv8 offers a promising avenue for further enhancement. Our lightweight backbone (SCDown) and small-object optimization modules (P2 detection layer and DCAM attention) already employ spatially selective computation, which could synergize effectively with sparse feature processing methods. By combining these approaches, it may be possible to achieve faster inference while preserving or even enhancing detection precision—particularly for small, occluded, or distant targets—thereby improving performance in demanding UAV-based applications, such as disaster response, security monitoring, and autonomous navigation.

Overall, while DCS-YOLOv8 provides a balanced solution between detection accuracy and computational efficiency, addressing inference latency through sparsity-aware optimization and hardware-adaptive design will be essential to fully unlock its potential for real-time, mission-critical UAV operations.

5. Conclusions

Unmanned Aerial Vehicle (UAV) image detection presents persistent challenges, primarily due to the prevalence of small-scale targets, complex and cluttered backgrounds, and variable lighting conditions. These factors often result in high rates of missed detections and false positives, limiting the effectiveness of conventional object detection frameworks. In response, this study proposes DCS-YOLOv8—a specialized object detection model optimized for UAV-based scenarios with an emphasis on small object recognition.

Built upon the YOLOv8 architecture, DCS-YOLOv8 integrates two core modules: C2f-DCAM, which fuses local and global contextual features via dynamic convolution and attention mechanisms; and SCDown, a lightweight spatial downsampling strategy that reduces computational overhead while preserving spatial detail. Additionally, the model incorporates a high-resolution detection head (P2) to enhance feature extraction for small targets. To improve loss optimization across diverse object scales, the traditional CIoU loss function is replaced with the proposed Scale-based Dynamic Balanced IoU (SDBIoU), enabling dynamic adjustment of loss weights based on target size and spatial distribution.

While experimental results on the VisDrone2019 dataset demonstrate significant improvements in precision, recall, and mAP metrics compared to baseline and state-of-the-art models, practical deployment still faces challenges. Specifically, although the model achieves competitive accuracy, its inference latency on embedded hardware platforms such as the RK3588 remains suboptimal, necessitating further optimization for low-resource environments. Looking ahead, future work will focus on three key directions: (1) extending the evaluation of DCS-YOLOv8 to additional aerial image datasets with varied geographic and environmental characteristics; (2) investigating the model’s robustness under adverse weather conditions, such as rain, fog, or low illumination; and (3) refining the DCAM module and backbone design to further reduce model complexity without compromising detection accuracy. These efforts aim to enhance the generalizability and deployment readiness of DCS-YOLOv8 for real-world UAV applications in smart cities, environmental monitoring, and emergency response.

Author Contributions

Conceptualization, X.Z. and Z.Y.; Methodology, X.Z. and Z.Y.; Software, X.Z. and Z.Y.; Validation, X.Z.; Formal analysis, X.Z.; Investigation, Z.Y.; Resources, Z.Y. and H.Z.; Writing—original draft, X.Z.; Writing—review and editing, Z.Y. and H.Z.; Supervision, Z.Y.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

This work was supported in part by the Liaoning Provincial Joint Science and Technology Program [2024-MSLH-380].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Adil, M.; Song, H.; Jan, M.A.; Khan, M.K.; He, X.; Farouk, A.; Jin, Z. UAV-Assisted IoT Applications, QoS Requirements and Challenges with Future Research Directions. ACM Comput. Surv. 2024, 56, 35. [Google Scholar] [CrossRef]
Cai, W.; Wei, Z. Remote Sensing Image Classification Based on a Cross-Attention Mechanism and Graph Convolution. IEEE Geosci. Remote Sens. Lett. 2020, 19, 8002005. [Google Scholar] [CrossRef]
Peng, C.; Zhu, M.; Ren, H.; Emam, M. Small Object Detection Method Based on Weighted Feature Fusion and CSMA Attention Module. Electronics 2022, 11, 2546. [Google Scholar] [CrossRef]
Feng, F.; Hu, Y.; Li, W.; Yang, F. Improved YOLOv8 algorithms for small object detection in aerial imagery. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102113. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, T.; Jiao, J.L. Remote Sensing Object Detection Meets Deep Learning: A metareview of challenges and advances. Geosci. Remote Sens. 2023, 11, 8–44. [Google Scholar] [CrossRef]
Jiang, Y.; Xi, Y.; Zhang, L.; Wu, Y.; Tan, F.; Hou, Q. Infrared Small Target Detection Based on Local Contrast Measure With a Flexible Window. IEEE Geosci. Remote Sens. Lett. 2024, 21, 7001805. [Google Scholar] [CrossRef]
Li, Z.; Dong, Y.; Shen, L.; Liu, Y.; Pei, Y.; Yang, H.; Zheng, L.; Ma, J. Development and challenges of object detection: A survey. Neurocomputing 2024, 598, 23. [Google Scholar] [CrossRef]
Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2024, 16, 29. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
Liu, X.; Li, H. A study on UAV target detection and 3D positioning methods based on the improved deformable DETR model and multi-view geometry. Adv. Mech. Eng. 2025, 17, 16878132251315505. [Google Scholar] [CrossRef]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.Y.; Yang, J.; Su, H.; Zhu, J.J. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
Wang, H.; Ma, J.; Chen, W.; Han, Q.; Lin, J.; Li, J.; Yao, Z. Personal Protective Equipment Detection for Industrial Environments: A Lightweight Model Based on RTDETR for Small Targets; IOP Publishing Ltd.: Bristol, UK, 2025. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger; IEEE: New York, NY, USA, 2017; pp. 6517–6525. [Google Scholar]
Terven, J.; Cordova-Esparza, D.M.; Romero-Gonzalez, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Bi, J.; Zhu, Z.; Meng, Q. Transformer in Computer Vision. In Proceedings of the 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China, 24–26 September 2021; pp. 178–188. [Google Scholar] [CrossRef]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Shah, S.; Tembhurne, J. Object detection using convolutional neural networks and transformer-based models: A review. J. Electr. Syst. Inf. Technol. 2023, 10, 1–35. [Google Scholar] [CrossRef]
Islam, S.; Elmekki, H.; Pedrycz, R.W. A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst. Appl. 2024, 241, 122666.1–122666.48. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, K.; Tang, H.; He, S.; Yu, Q.; Xiong, Y.; Wang, N. Performance Validation of Yolo Variants for Object Detection. In Proceedings of the BIC 2021: 2021 International Conference on Bioinformatics and Intelligent Computing, Harbin, China, 22–24 January 2021. [Google Scholar]
Wei, L.; Tong, Y. Enhanced-YOLOv8: A new small target detection model. Digit. Signal Process. 2024, 153, 104611. [Google Scholar] [CrossRef]
Xu, W.; Cui, C.; Ji, Y.; Li, X.; Li, S. YOLOv8-MPEB small target detection algorithm based on UAV images. Heliyon 2024, 10, 18. [Google Scholar] [CrossRef]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition; IEEE: New York, NY, USA, 2023. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. arXiv 2019, arXiv:1911.08287. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. arXiv 2020, arXiv:2005.03572. [Google Scholar] [CrossRef]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. arXiv 2021, arXiv:2101.08158. [Google Scholar] [CrossRef]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection. arXiv 2024, arXiv:2412.16986. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar] [CrossRef]
Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Li, Y.; Zhou, Z.; Pan, Y. YOLOv11-BSS: Damaged Region Recognition Based on Spatial and Channel Synergistic Attention and Bi-Deformable Convolution in Sanding Scenarios. Electronics 2025, 14, 1469. [Google Scholar] [CrossRef]
Tanrıverdi, V.; Alemdar, K.D. Comparative Analysis of Data Augmentation Strategies Based on YOLOv12 and MCDM for Sustainable Mobility Safety: Multi-Model Ensemble Approach. Sustainability 2025, 17, 5638. [Google Scholar] [CrossRef]
Tahir, N.U.A.; Long, Z.; Zhang, Z.; Asim, M.; Elaffendi, M. PVswin-YOLOv8s: UAV-Based Pedestrian and Vehicle Detection for Traffic Management in Smart Cities Using Improved YOLOv8. Drones 2024, 8, 84. [Google Scholar] [CrossRef]
Wang, Y.; Pan, F.; Li, Z.; Xin, X.; Li, W. CoT-YOLOv8: Improved YOLOv8 for Aerial images Small Target Detection. In Proceedings of the 2023 China Automation Congress (CAC), Chongqing, China, 17–19 November 2023; pp. 4943–4948. [Google Scholar] [CrossRef]
Zhang, H.; Li, G.; Wan, D.; Wang, Z.; Dong, J.; Lin, S.; Deng, L.; Liu, H. DS-YOLO: A dense small object detection algorithm based on inverted bottleneck and multi-scale fusion network. Microelectron. J. 2024, 4, 100190. [Google Scholar] [CrossRef]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
Xiao, C.; An, W.; Zhang, Y.; Su, Z.; Li, M.; Sheng, W.; Pietikäinen, M.; Liu, L. Highly efficient and unsupervised framework for moving object detection in satellite videos. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11532–11539. [Google Scholar] [CrossRef]
Wu, S.; Xiao, C.; Wang, Y.; Yang, J.; An, W. Sparsity-Aware Global Channel Pruning for Infrared Small-target Detection Networks. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5615011. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed DCS-YOLOv8 network. The network builds upon YOLOv8s by integrating the C2f-DCAM module in the backbone to enhance both local and global feature fusion. The SCDown module performs efficient downsampling while preserving spatial detail. An additional P2 detection layer is introduced to capture high-resolution features, thereby improving small object detection. Arrows indicate the flow of feature maps through the backbone, neck, and multi-scale detection heads. The overall design achieves a balance between lightweight efficiency and enhanced detection performance for small targets.

Figure 2. A schematic diagram of our proposed dynamic convolution (DCAM). DCAM consists of two parallel branches: the Lepe branch, which uses a Dilated Reparameterization Block (DBR) to capture multi-scale local features, and the Attention branch, which applies a multi-head attention mechanism to establish long-range dependencies and strengthen global semantic representation. The fused output from both branches improves detection robustness in dense, occluded, and complex background scenarios.

Figure 3. Structure of the C2f-DCAM module. This module combines the original C2f block from YOLOv8 with the DCAM module, retaining the multi-branch residual connections and gradient propagation advantages of C2f while further enhancing the fusion of local and global features. C2f-DCAM enables richer spatial representations and sharper boundary delineation for small objects in cluttered UAV imagery.

Figure 4. Overall architecture of DCS-Net with heatmap visualization. The network processes 640 × 640 input images through four stages, each extracting and refining features. Heatmaps illustrate spatial attention changes: Stage 1 focuses on low-level features, such as edges and textures; Stage 2 highlights small-object regions; Stage 3, enhanced by C2f-DCAM, combines local and global features to emphasize targets in complex backgrounds; Stage 4 uses SCDown for efficient downsampling while retaining critical spatial structures. The results confirm the complementary roles of each module in enhancing small-object perception.

Figure 5. Information regarding the manual annotation process for objects in the VisDrone2019 dataset. (a) Category distribution. (b) Distribution of label frame length and width. (c) Spatial distribution of object centers. (d) Distribution of object width and height.

Figure 6. (a,b) show the training epochs of DCS-YOLOv8’s

m A P_{0.5}

and

m A P_{0.5 : 0.95}

. The model performance was improved by gradually introducing improvement modules (SDB-IoU, P2 layer, DCAM, SCDown), and the final version performed best on

m A P_{0.5}

and

m A P_{0.5 : 0.95}

.

Figure 6. (a,b) show the training epochs of DCS-YOLOv8’s

m A P_{0.5}

and

m A P_{0.5 : 0.95}

. The model performance was improved by gradually introducing improvement modules (SDB-IoU, P2 layer, DCAM, SCDown), and the final version performed best on

m A P_{0.5}

and

m A P_{0.5 : 0.95}

.

Figure 7. (a) Confusion matrix of DSC-YOLOv8; (b) confusion matrix of YOLOv8s. The confusion matrix visually represents the agreement between a model’s predictions and the true labels in classification tasks. It consists of rows for true categories and columns for predicted categories. Diagonal elements indicate correctly classified samples (True Positives), while off-diagonal elements represent misclassifications (False Positives and False Negatives). This matrix enables an easy evaluation of the model’s accuracy in classification and identification of error-prone areas for performance improvement.

Figure 8. Feature activation maps generated by (a) the baseline model, (b) YOLOv8s, and (c) the proposed DCS-YOLOv8 under complex UAV scenarios, including dense occlusion, background clutter, and low illumination. Compared to (a) and (b), DCS-YOLOv8 exhibits more concentrated and context-aware attention, enabling more accurate identification of small and obscured objects.

Figure 9. Detection results produced by (a) the baseline model, (b) YOLOv8s, and (c) DCS-YOLOv8 across various real-world aerial scenarios, such as high-altitude views, occluded urban scenes, and low-light environments. DCS-YOLOv8 outperforms the other models in detecting small, overlapping, and visually ambiguous objects, while maintaining better localization accuracy and fewer false positives.

Figure 10. Hardware setup of the real-time UAV detection system integrating DCS-YOLOv8. (a) for DJI Mavic 3 drone, (b) for Orange Pi RK3588 hardware.

Figure 11. Detection results of DCS-YOLOv8 under different UAV flight altitudes. (a) contains dense, large-scale targets such as vehicles; (b) shows sparse, small-scale targets like pedestrians and riders. Large objects are detected reliably across altitudes, while small object detection benefits from lower-altitude imaging.

Table 1. Detection results of YOLOv8s with different bounding box loss functions, shown as percentages (best outcomes in bold).

Metrics	Precision	Recall	${mAP}_{0.5}$	${mAP}_{0.5 : 0.95}$
CIoU	51.8	39.4	40.6	24.3
DIoU	52	38.9	40.6	24.5
EIoU	49.8	39.4	40	24.3
MPIoU [37]	52.1	39	40.7	23.9
SDBIoU (d = 0.5)	51.1	39.5	40.4	24.3
SDBIoU (d = 0.7)	51.2	39.3	40.2	24.1
SDBIoU (d = 0.3)	51.6	40.1	40.8	24.5

Table 2. Different YOLO models’ results, presented as percentages. (The best-performing outcomes are highlighted in bold).

Models	Precision	Recall	${mAP}_{0.5}$	${mAP}_{0.5 : 0.95}$	Time/ms	Parameter/ $10^{6}$
YOLOv3	53.8	43.1	42.2	23.2	210	18.4
YOLOv5s	46.7	34.9	34.5	19.4	14.1	12.0
YOLOv7	51.6	42.3	40.2	21.9	73.3	1.7
YOLOv8n	45.9	34.2	34.5	19.8	5.7	3.1
YOLOv8s	51.8	39.4	40.6	24.3	7.1	11.1
YOLOv8m	55.8	42.6	44.5	26.6	16.8	25.9
YOLOv10n	45.5	33.5	34	19.8	8	2.7
YOLOv10s	51	39.4	40.8	24.6	7.6	8.1
YOLOv11n [38]	45.9	33.4	34.3	20.1	4.6	2.6
YOLOv11s	52.1	39.4	40.6	24.8	8.0	9.4
YOLOv12n [39]	43.4	34.6	33.7	19.8	6.6	2.5
YOLOv12s	52.5	40.4	41.4	25	11.6	9.2
PVswin-YOLO [40]	54.5	41.8	43.3	26.4	8.8	10.1
CoT-YOLO [41]	53.2	41.1	42.7	25.7	12.2	10.6
DS-YOLO [42]	52.4	41.6	43.1	26.0	19.7	9.3
DCS-YOLOv8	54.2	42.1	44.5	26.9	10.1	9.9

Table 3. Results from different widely used models, presented as percentages. (The best-performing outcomes are highlighted in bold).

Models	${mAP}_{0.5}$	${mAP}_{0.5 : 0.95}$
Faster R-CNN [11]	36.6	21.1
Swin Transformer [35]	39.7	23.1
CenterNet [36]	39.2	22.7
Cascade R-CNN [34]	39.4	24.2
RT-DETR-R18 [15]	42.5	25.4
DION [14]	41.3	24.1
DCS-YOLOv8	44.5	26.9

Table 4. Comparative experiments between the enhanced model and YOLOv8s across various categories. A:YOLOv8s, B:YOLOv8s-SDBIoU, C:YOLOv8s-SDBIoU-P2, D:YOLOv8s-SDBIoU-P2-DCAM, E:YOLOv8s-SDBIoU-P2-DCS-Net, with percentages presented (best-performing outcomes highlighted in bold).

Models	Ped	People	Bicycle	Car	Van	Truck	Tricy	A-Tricy	Bus	Motor	${mAP}_{0.5}$
A	44.2	34.3	13.9	80	45.5	40.2	28.5	16.6	57.8	44.8	40.6
B	44.1	34.0	14.4	79.6	45.8	38.4	29.7	15.8	60.9	44.9	40.8
C	50	40.7	16.6	83.3	46.7	39.7	29.1	16	60.5	50.6	43.3
D	51	40.8	16.9	83.8	47.5	39.3	32.3	17	59.4	50.5	43.9
E	51.5	42.5	17.2	83.8	48.5	39.4	33.3	18.1	59.8	50.6	44.5

Table 5. Detection results following the adoption of different improvement strategies, presented as percentages.

Baseline	SDBIOU	P2	DCAM	SCDown	Precision	Recall	${mAP}_{0.5}$	${mAP}_{0.5 : 0.95}$	Time/ms	Parameter/ $10^{6}$
✓					51.8	39.4	40.6	24.3	7.1	11.1
✓	✓				51.6	40.1	40.8	24.5	5.5	11.1
✓	✓	✓			53.9	41.2	43.3	26.1	6.6	10.6
✓	✓	✓	✓		54.8	41.8	43.9	26.5	9.7	11.3
✓	✓	✓	✓	✓	54.2	42.1	44.5	26.9	10.1	9.9

Table 6. Comparative results on SSDD and NWPU VHR-10 datasets.

Dataset	Models	Precision	Recall	${mAP}_{0.5}$	${mAP}_{0.5 : 0.95}$
SSDD	YOLOv8s	95.2	92.4	95.8	62.9
	YOLOv10s	85.3	83.8	91	57.2
	YOLOv11s	89.1	92.9	96.5	62.7
	YOLOv12s	91.6	92	96.6	61.1
	DCS-YOLOv8	94.1	93.1	97.2	64.4
NWPU VHR-10	YOLOv8s	92.2	85.4	91.3	57.6
	YOLOv10s	67.5	69.5	72.8	43.9
	YOLOv11s	90.2	79.3	87.4	54.1
	YOLOv12s	88.7	80.3	87.5	52.7
	DCS-YOLOv8	91.2	88	92.6	59.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, X.; Yang, Z.; Zhao, H. DCS-YOLOv8: A Lightweight Context-Aware Network for Small Object Detection in UAV Remote Sensing Imagery. Remote Sens. 2025, 17, 2989. https://doi.org/10.3390/rs17172989

AMA Style

Zhao X, Yang Z, Zhao H. DCS-YOLOv8: A Lightweight Context-Aware Network for Small Object Detection in UAV Remote Sensing Imagery. Remote Sensing. 2025; 17(17):2989. https://doi.org/10.3390/rs17172989

Chicago/Turabian Style

Zhao, Xiaozheng, Zhongjun Yang, and Huaici Zhao. 2025. "DCS-YOLOv8: A Lightweight Context-Aware Network for Small Object Detection in UAV Remote Sensing Imagery" Remote Sensing 17, no. 17: 2989. https://doi.org/10.3390/rs17172989

APA Style

Zhao, X., Yang, Z., & Zhao, H. (2025). DCS-YOLOv8: A Lightweight Context-Aware Network for Small Object Detection in UAV Remote Sensing Imagery. Remote Sensing, 17(17), 2989. https://doi.org/10.3390/rs17172989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DCS-YOLOv8: A Lightweight Context-Aware Network for Small Object Detection in UAV Remote Sensing Imagery

Abstract

1. Introduction

2. Materials and Methods

2.1. P2: Small Object Detection Head

2.2. DCS-Net Architecture

2.2.1. DCAM Module

2.2.2. C2f-DCAM

2.2.3. SCDown Module

2.2.4. Overall Architecture

2.3. SDBIoU Loss Function

3. Results

3.1. Dataset

3.2. Experimental Environment and Training Strategy

3.3. Evaluation Metrics

3.4. Experiment Results

3.4.1. Comparison of Loss Functions

3.4.2. Comparison with Different Mainstream Models

3.5. Ablation Experiments

3.6. Visual Assessment

3.7. Real-Time Object Detectio

3.8. Generalization Test

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI