DGBL-YOLOv8s: An Enhanced Object Detection Model for Unmanned Aerial Vehicle Imagery

Wang, Chonghao; Yi, Huaian

doi:10.3390/app15052789

Open AccessArticle

DGBL-YOLOv8s: An Enhanced Object Detection Model for Unmanned Aerial Vehicle Imagery

by

Chonghao Wang

^1,2 and

Huaian Yi

^1,2,*

¹

Guangxi Universities Key Laboratory of Advanced Manufacturing and Automation Technology, Guilin University of Technology, Guilin 541006, China

²

College of Mechanical and Control Engineering, Guilin University of Technology, Guilin 541006, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2789; https://doi.org/10.3390/app15052789

Submission received: 13 February 2025 / Revised: 27 February 2025 / Accepted: 28 February 2025 / Published: 5 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned aerial vehicle (UAV) imagery often suffers from significant object scale variations, high target density, and varying distances due to shooting conditions and environmental factors, leading to reduced robustness and low detection accuracy in conventional models. To address these issues, this study adopts DGBL-YOLOv8s, an improved object detection model tailored for UAV perspectives based on YOLOv8s. First, a Dilated Wide Residual (DWR) module is introduced to replace the C2f module in the backbone network of YOLOv8, enhancing the model’s capability to capture fine-grained features and contextual information. Second, the neck structure is redesigned by incorporating a Global-to-Local Spatial Aggregation (GLSA) module combined with a Bidirectional Feature Pyramid Network (BiFPN), which strengthens feature fusion. Third, a lightweight shared convolution detection head is proposed, incorporating shared convolution and batch normalization techniques. Additionally, to further improve small object detection, a dedicated small-object detection head is introduced. Results from experiments on the VisDrone dataset reveal that DGBL-YOLOv8s enhances detection accuracy by 8.5% relative to the baseline model, alongside a 34.8% reduction in parameter count. The overall performance exceeds most of the current detection models, which confirms the advantages of the proposed improvement.

Keywords:

YOLOv8; UAV; small object detection; feature fusion; lightweighting

1. Introduction

In recent years, UAVs have become extensively utilized across military, commercial, and everyday sectors due to their small size, affordability, and high practicality. Among these applications, target detection from a UAV perspective has emerged as a critical research direction. Target detection primarily relies on high-resolution sensors (e.g., cameras, infrared cameras) mounted on UAVs to capture large volumes of image data, which are then automatically detected and recognized using computer vision and machine learning techniques [1,2].

Target detection methodologies can be broadly classified into two distinct categories: conventional image processing techniques and approaches leveraging deep learning frameworks. Conventional methods, such as frame difference and background subtraction, leverage the dynamic changes in moving targets relative to the background for detection. These methods offer advantages such as low computational complexity, fast processing speed, and simplicity. However, their accuracy and real-time performance are often insufficient to meet practical requirements. In contrast, deep learning-based methods autonomously learn useful information from image and video datasets through feature extraction, followed by target classification and recognition, offering higher accuracy and stronger generalization capabilities [3]. Deep learning-based target detection algorithms are broadly divided into two categories based on the detection stages: two-stage and one-stage methods. Two-stage methods, represented by algorithms such as R-CNN [4], Fast R-CNN [5], and Faster R-CNN [6], first generate candidate regions that may contain targets and then classify and regress bounding boxes using deep learning models to improve detection accuracy. However, these methods typically require substantial computational resources, resulting in slower processing speeds, which limits their suitability for real-time applications. On the other hand, one-stage methods integrate the target detection task directly into a single neural network, eliminating the need for candidate region generation and directly outputting target categories and location information. These methods exhibit faster inference speeds and stronger real-time performance. Representative one-stage algorithms include YOLO [7,8], SSD [9], and RetinaNet [10]. Compared to one-stage detection algorithms such as SSD and RetinaNet, the YOLO algorithm treats target detection as a regression problem, utilizing a single neural network for prediction. Its concise architecture offers faster inference speeds and lower requirements for hardware resources and computational power. It has undergone significant evolution since its initial introduction, with each iteration introducing improvements in accuracy, speed, and adaptability to various detection tasks. These advancements make it particularly suitable for real-time applications, including UAV-based target detection, where efficiency and accuracy are critical. However, in the domain of UAV-based target detection, the accurate identification of small targets, characterized by their limited pixel coverage and reduced feature discriminability, has emerged as a critical area of investigation in contemporary studies. Therefore, based on the use of the eighth generation of YOLO algorithm, this study proposes DGBL-YOLOv8s, an improved object detection model tailored for UAV perspectives based on YOLOv8s.

The main contributions of this study are as follows:

(1): Based on the Dilation-wise Residual (DWR) module from the DWRSeg network [11], a C2f-DWR module is designed to enhance multi-scale contextual information acquisition. This module enhances the C2f module’s capacity to fuse multi-level feature maps, allowing the model to extract finer target details and richer contextual information.
(2): A Global-to-Local Spatial Aggregation Bidirectional Feature Pyramid Network (GLSA-BiFPN) is designed for the neck structure. This structure combines the Bidirectional Feature Pyramid Network (BiFPN) [12] and the Global-to-Local Spatial Aggregation (GLSA) module [13] to strengthen feature fusion and improve small target detection capabilities.
(3): A lightweight shared convolutional detection head (LSCDH) is constructed based on shared convolution and group normalization (GN) [14], with an additional target detection head. This design enhances the network’s ability to extract shallow information while reducing model parameters.

2. Related Works

2.1. Methods Based on Network Architecture Improvements

Numerous algorithmic innovations have been proposed for small target detection in drone images using one-stage methods. For instance, Liu et al. [15] improved the residual blocks in YOLOv3’s Darknet by linking two residual units of equal dimensions, thereby boosting the network’s transfer learning efficiency. Additionally, they introduced extra convolutional operations in the early layers to enrich spatial information, effectively broadening the network’s receptive field and markedly enhancing the detection of small targets in UAV applications. Alexey et al. [16] proposed an optimized object detection approach by integrating CSPDarknet53 into the backbone to strengthen feature extraction. They utilized SPP and PANet for multi-scale feature fusion, improving contextual information extraction. To balance speed and accuracy, they introduced Mish activation, CIoU loss, and enhanced data augmentation. Experiments show that YOLOv4 achieves superior performance across multiple datasets. Wang et al. [17] optimized YOLOv7 by introducing trainable Bag-of-Freebies, enhancing both accuracy and efficiency in real-time object detection. On the one hand, they improved gradient propagation by extending ELANs (Efficient Layer Aggregation Networks) and employed planned re-parameterized convolution to refine feature representation. On the other hand, their novel compound scaling strategy effectively balances depth and width, significantly boosting detection performance while maintaining high-speed inference.

2.2. Methods Based on Attention Mechanisms and Feature Enhancement

Li et al. [18] proposed an improved YOLOv5-based framework by incorporating Convolution-Swin Transformer Blocks (CSTBs) and the Convolutional Block Attention Module (CBAM), significantly enhancing localization accuracy. Additionally, the framework integrates Bidirectional Feature Pyramid Networks (BiFPNs) and Fast Spatial Pyramid Pooling (SPPF) to improve average precision while maintaining a compact model size. Experimental results indicate that the enhanced framework achieves superior detection accuracy compared to YOLOv5, while maintaining a comparable model size. Wang et al. [19] proposed CPDD-YOLOv8, an optimized model for small object detection in aerial images. The model integrates Cross-Scale Feature Fusion and Dense Feature Enhancement modules to improve feature extraction for small objects. Additionally, an attention-based detection head is introduced to enhance recognition capability in complex scenes, while the feature pyramid and dynamic anchor mechanisms are optimized to improve inference speed. These enhancements effectively improve both detection accuracy and computational efficiency.

2.3. Methods Based on Loss Function and Training Strategy Optimization

Lin et al. [10] addressed the class imbalance in dense object detection by proposing Focal Loss, which down-weights easy samples and emphasizes hard examples during training, thereby improving the model’s focus on challenging instances. Additionally, they integrated this loss function into RetinaNet, effectively enhancing the detection of small and heavily occluded objects while maintaining computational efficiency. Zhao et al. [20] improved UAV image target detection by enhancing YOLOv8 with Wise-IoU loss, which optimizes the training process for better localization accuracy. Additionally, they designed a context enhancement module and a spatial-channel filtering module to refine feature representation, effectively boosting the detection performance of small targets in UAV imagery. Experimental results on the VisDrone2019 dataset demonstrated a 7.3 percentage point increase in mAP@0.5 compared to the original YOLOv8s model.

The aforementioned models have provided new methods and insights for improving the accuracy and speed of small target detection in UAV applications. However, existing models still exhibit certain limitations. First, in traditional convolutional neural networks (CNNs), each convolutional layer typically employs fixed and static kernels to extract image features. While this approach performs well for ordinary scene images, it struggles with drone aerial images, where targets are usually tiny in scale and numerous. The limited receptive field of traditional kernels results in weak feature extraction capabilities, adversely affecting detection performance. Second, the YOLOv8 neck structure suffers from a lack of diversity in feature fusion methods, making it ineffective in integrating critical features for multi-scale and complex aerial images. Additionally, in the YOLOv8s model, the input image size is 640 × 640, with a minimum detection scale of 80 × 80. However, in UAV aerial scenarios, small targets are numerous and even smaller in scale, making it difficult for traditional networks to recognize target features within grids, thereby impairing small target detection capabilities.

3. Methods

3.1. Introduction of DGBL-YOLOv8 Object Detection Algorithm

YOLOv8, a high-performance object detection framework, was introduced by Ultralytics on 10 January 2023. It offers five versions—YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x—to cater to diverse computational requirements and application scenarios. Given the limited computational power of UAVs, this study selects YOLOv8s, which balances higher accuracy with lower computational demand, as the base model. The DGBL-YOLOv8 model is then proposed. Firstly, a C2f_DWR module is designed within the backbone to efficiently capture multi-scale contextual information across the network. Secondly, a GLSA-BiFPN network architecture is introduced in the neck to enhance feature fusion capabilities. Furthermore, four detection heads equipped with lightweight shared convolutions are integrated to strengthen the model’s small target detection capability while minimizing computational complexity. The network architecture of DGBL-YOLOv8 is illustrated in Figure 1.

3.1.1. The C2f_DWR Module

Images taken by UAVs are frequently small-scale, taken in high quantities and at long distances. These characteristics challenge the C2f module in the YOLOv8s model, limiting its ability to extract adequate effective information from multi-scale feature maps and cutting down detection accuracy. To tackle this issue, this study introduces the C2f_DWR module as a replacement for the C2f module. This new module strengthens feature extraction and enhances the integration of multi-scale features. The design of the C2f_DWR module is depicted in Figure 2.

The DWR module employs a two-step approach to capture multi-scale contextual information. In the first step, Region Residualization generates a series of compact feature maps in the form of regions of varying sizes by utilizing a combination of 3 × 3 convolutions, batch normalization (BN) layers, and ReLU layers. This process transforms the input features into representations with distinct regional characteristics, thereby simplifying the subsequent extraction of multi-scale information. In the second step, Semantic Residualization applies morphological filtering to features of different region sizes using multi-rate dilated depth-wise convolutions. The multi-rate dilated depth-wise convolutions consist of three components: multi-rate, depth-wise separable [21], and dilated convolutions [22]. The initial phase of this stage utilizes dilated convolutions with diverse dilation rates to enhance the receptive field and capture richer contextual information. Following this, depth-wise separable convolutions are implemented to improve the model’s feature extraction capability by increasing the channel dimension, all while minimizing the number of parameters and the computational cost. The structural design of the DWR module is illustrated in Figure 3.

In standard convolution operations, the receptive field size depends on the kernel size and network depth. To obtain a larger receptive field, a model typically increases network depth or expands the convolution kernel size. However, this method results in elevated computational demands and a larger parameter volume, while substantially diminishing the spatial resolution of feature maps, ultimately compromising the model’s capacity to retain fine-grained image details and edge information. To address this issue, the C2f_DWR module employs multi-rate dilated convolution, which amplifies the receptive field without increasing computational complexity or reducing image resolution. This allows the model to capture an immense range of contextual information efficiently.

Specifically, dilated convolution enlarges the receptive field by inserting gaps (dilations) between kernel elements. The number of inserted gaps is determined by the dilation rate, which is calculated using the formula given in Equation (1):

a^{'} = a + (a - 1) \times (b - 1)

(1)

In the equation, a represents a convolution of size a × a, and a′ denotes a convolution of size a′ × a′. Parameter b refers to the dilation rate, which is set to b = 1, 3, 5 in this study, indicating the use of three different dilation rates for dilated convolution. By applying dilated convolution, the receptive field of the convolution operation is effectively expanded, enabling the model to capture a wider range of contextual information.

Additionally, the C2f_DWR module employs depth-wise separable convolution to reduce computational complexity and parameter count while enhancing the model’s feature extraction capability. This convolution module consists of depth-wise convolution and point-wise convolution, which are responsible for extracting spatial and channel features, respectively. The formulas for computing the parameter count and computational complexity of each convolution operation are presented in Table 1.

In Table 1, P and Q represent the height and width of the input image, while I_in and I_out denote the number of input and output channels before and after the convolution operation, respectively. The parameter m refers to the convolution kernel size. As shown in the table, the parameter count and computational complexity of standard convolution are 4–5 times and 8–9 times greater, respectively, compared to depth-wise separable convolution. Therefore, employing this type of convolution effectively reduces model complexity and enhances computational efficiency.

3.1.2. Structure of the GLBA-BiFPN

The neck structure of YOLOv8 is designed using a combination of a Feature Pyramid Network (FPN) [23] and a Path Aggregation Network (PAN) [24]. While FPN enhances feature fusion through a top-down pathway, it neglects the reverse transmission of fine-grained shallow features, limiting its ability to capture small object details. PAN, although incorporating a bottom-up pathway, relies on fixed-weight feature fusion, making it unable to dynamically adjust the contributions of different feature levels based on task requirements. To address these limitations, researchers proposed the Bidirectional Feature Pyramid Network (BiFPN), which integrates bidirectional feature fusion, weighted fusion mechanisms, and a simplified structure to enhance feature information integration across different levels. This enhances both the speed and precision of object detection algorithms. However, when dealing with small-scale, densely distributed, and distant objects, BiFPN’s use of standard convolution exhibits limited capability in feature extraction and contextual information fusion.

To overcome this issue, this study builds upon the BiFPN architecture by introducing the Global-to-Local Spatial Aggregation (GLSA) module, which optimizes standard convolution to enhance feature extraction and information fusion. The resulting GLSA-BiFPN structure improves small object detection performance. The architecture of GLSA-BiFPN is illustrated in Figure 4.

The Global-to-Local Spatial Aggregation (GLSA) module consists of two significant components: the Global Spatial Attention (GSA) module and the Local Spatial Attention (LSA) module. These modules work together to enhance feature representation by integrating both global and local spatial information. The architecture of the GLSA module is illustrated in Figure 5.

Specifically, the input feature map is first evenly split into two equal parts along the channel dimension. And its number of channels is 64. Each subset of feature maps is independently processed through the Global Spatial Attention (GSA) module and the Local Spatial Attention (LSA) module. Subsequently, the outputs are combined by a 1 × 1 convolutional layer along the channel dimension to generate the final feature map. The corresponding mathematical formulations are presented in Equations (2) and (3):

A_{i}^{1}, A_{i}^{2} = Split (A_{i})

(2)

A_{i}^{'} = C_{1 \times 1} (Concat (X_{s a} (A_{i}^{1}), Y_{s a} (A_{i}^{2})))

(3)

In the equations, Xsa and Ysa represent the Global Spatial Attention and Local Spatial Attention, respectively. A_i represents the feature map with 64 channels, where A_i is denoted by {A_i|i∈(2, 3, 4)}, A_i¹ and A_i² represent denote two 32-channel feature maps {A_i¹, A_i²|i∈(2, 3, 4)}, respectively, A_i′ represents the output feature.

The Global Spatial Attention (GSA) module focuses on capturing long-range relationships between pixels in the spatial domain, serving as a complement to Local Spatial Attention (LSA). The mathematical formulations for GSA are provided in Equations (4) and (5):

Z t t_{X} (A_{i}^{1}) = Softmax (Transpose (C_{1 \times 1} (A_{i}^{1})))

(4)

X_{s a} (A_{i}^{1}) = M L P (Z t t_{X} (A_{i}^{1}) \otimes A_{i}^{1}) + A_{i}^{1}

(5)

In the equations, Ztt_X represents the global attention operation, and C_1×1 denotes a 1 × 1 convolution. The symbol ⊗ denotes matrix multiplication, and MLP(·) signifies a multi-layer perceptron comprising two fully connected layers, a ReLU activation function, and a normalization layer. The first layer of the MLP projects the input into a higher-dimensional space with an expansion factor of two, while the second layer restores the dimensionality to match the original input size.

The Local Spatial Attention (LSA) module extracts local features from regions of interest within the spatial dimensions of a given feature map. The mathematical formulations for LSA are provided in Equations (6) and (7):

Z t t_{Y} (A_{i}^{2}) = σ (C_{1 \times 1} (A_{C} (A_{i}^{2})) + A_{i}^{2}))

(6)

Y_{s a} = Z t t_{Y} (A_{i}^{2}) ⊙ A_{i}^{2} + A_{i}^{2}

(7)

In the equations, Ac(⋅) represents a sequence of three 1 × 1 convolution layers followed by a 3 × 3 depth-wise convolution layer. Ztt_Y denotes the local attention operation, while σ(⋅) represents the Sigmoid activation function. The symbol ⊙ indicates element-wise multiplication.

The GLSA module effectively obtains both local spatial details and global semantic information from an image. Additionally, its dual-stream design enables the model to process local features while simultaneously capturing global relationships and dependencies. By separating channels, the module can also balance detection accuracy and computational efficiency, making it adaptable to different resource constraints.

3.1.3. Lightweight Shared Convolutional Detection Head

The general YOLOv8 model’s detection head employs a decoupled architecture, isolating classification and regression tasks to enable distinct feature extraction for each objective. This design reduces mutual interference between tasks and improves detection accuracy. Additionally, YOLOv8 employs an anchor-free detection strategy, replacing the traditional anchor-based approach, thereby reducing computational complexity, enhancing detection speed, and improving the detection capability for small and densely packed objects.

However, the detection head of YOLOv8 includes three detection scales: 80 × 80, 40 × 40, and 20 × 20. During multi-scale detection, the independent use of convolutional operations in each detection head significantly increases computational load and model parameters. Moreover, the batch normalization (BN) operation in the detection head, which relies on batch size, can lead to unstable normalization effects when trained with small batch sizes. This also results in suboptimal handling of sparse small-object features.

To address these problems, this study introduces a lightweight shared convolutional detection head (LSCDH) based on the principles of shared convolutions and group normalization. Additionally, to accommodate the characteristics of UAV-captured scenes, which often feature numerous small and distant targets, the LSCDH incorporates an extra detection head with a 160 × 160 scale to further enhance detection accuracy. The architecture of the LSCDH is illustrated in Figure 6.

As illustrated in the figure, the LSCDH first processes the four layers of features output from the neck through independent convolutional operations before feeding them into a shared convolutional layer. The shared convolutional layer facilitates information transfer across detection layers, leveraging multi-level features to predict target locations and categories. Subsequently, a Scale Layer adjusts the output features to meet the requirements of the four detection heads of different sizes. Notably, the LSCDH replaces batch normalization (BN) with group normalization (GN) in the convolutional operations to eliminate dependency on batch size. Group normalization (GN) partitions the channel dimension of input feature maps into several groups, calculating the mean and variance for normalization within each group. This method effectively mitigates the instability associated with normalization processes and improves the preservation of sparse features corresponding to small objects. Consequently, it enhances both the accuracy and robustness of the detection head.

The computation process of group normalization (GN) is formulated as Equation (8):

{\hat{e}}_{h} = \frac{1}{λ_{h}} (e_{h} - θ_{h})

(8)

In the equation, e represents the computed feature of a layer, and h denotes the index. For a 2D image, h = (h_R, h_S, h_T, h_U) is a four-dimensional index following the order (R, S, T, U), where R delegates the batch size, S denotes the number of channels, and T and U correspond to the spatial height and width, respectively.

In Equation (8), θ and λ represent the mean and standard deviation, respectively. Their computation is defined by Equations (9) and (10):

θ_{h} = \frac{1}{n} \sum_{j \in V_{h}} e_{j}

(9)

λ_{h} = \sqrt{\frac{1}{n} \sum_{j \in V_{h}} {(e_{j} - θ_{h})}^{2} + ω}

(10)

In the equations, ω represents a small constant to ensure numerical stability. V_h denotes the set of pixels used for computing the mean and standard deviation, while n represents the size of this set.

Additionally, group normalization (GN) incorporates learnable γ and β parameters for each channel, which are defined in Equation (11):

y_{h} = γ {\hat{e}}_{h} + β

(11)

In the equation, γ and β represent trainable scaling and shift parameters, respectively. These parameters are always indexed by h_S; however, for notation simplicity, the h_S subscript is omitted.

The formula for group normalization (GN) is given in Equation (12):

V_{h} = \{j |j_{R} = h_{R}, ⌊\frac{j_{S}}{\frac{S}{W}}⌋ = ⌊\frac{h_{S}}{\frac{S}{W}}⌋\}

(12)

In the equation, W represents the number of groups, which is a predefined hyperparameter (default W = 32). S/W denotes the number of channels per group, and ⌊·⌋ indicates the floor function, which rounds the value down to the nearest integer.

Assuming that channels in each group are stored sequentially along the S dimension, the condition ⌊j_S/S/W⌋ = ⌊h_S/S/W⌋ ensures that indices h and j belong to the same group of channels. The computation of group normalization (GN) is performed along the (T, U) spatial dimensions and within each group of S/W channels to determine θ (mean) and λ (standard deviation).

Therefore, group normalization (GN) for V_h is collectively defined by Equations (8)–(11). Specifically, all pixels within the same group are normalized using the same θ (mean) and λ (standard deviation), while each channel learns its own γ (scaling factor) and β (shift parameter).

4. Experiment and Result Analysis

4.1. VisDrone2019 Dataset

To demonstrate the efficacy of the enhanced DGBL-YOLOv8s model for UAV aerial image detection, experiments were conducted using the VisDrone2019 dataset as the benchmark. This dataset, created by the AISKYEYE team at Tianjin University, is one of the mainstream datasets for UAV aerial imagery. The dataset comprises images captured under various environmental conditions, including different scenarios, object densities, weather conditions, and lighting conditions. The dataset comprises 8599 images, distributed across training, validation, and test sets with 6471, 548, and 1610 images, respectively. It encompasses 10 distinct categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor.

4.2. Experimental Environment

The detailed information regarding the software, hardware, and experimental environment used in the experiments is presented in Table 2, while the relevant training parameters are summarized in Table 3.

4.3. Evaluation Index

This study evaluates model performance using five metrics: precision (P), recall (R), mean average precision (mAP), number of parameters (Parameters, Para), and model size (Model Size).

Precision (P) represents the proportion of correctly classified positive samples, reflecting the model’s ability to classify accurately. Here, TP (True Positive) represents correctly classified positive samples, while FP (False Positive) denotes negative samples mistakenly identified as positive. The formula for precision is given by Equation (13):

P = \frac{T P}{T P + F P}

(13)

Recall (R) represents the proportion of correctly predicted positive samples out of the total number of actual positive samples, reflecting the model’s ability to comprehensively detect targets. It quantifies the samples missed during target identification. FN (False Negative) represents positive samples incorrectly classified as negative, i.e., missed positives. The formula for recall is given by Equation (14):

R = \frac{T P}{T P + F N}

(14)

mAP (mean average precision) refers to the average precision of all targets detected by the model, reflecting the ability of the predicted bounding boxes to overlap with the ground truth labels. A higher mAP value indicates better detection performance across different categories. AP (average precision) quantifies the model’s precision for each category at a given IoU (Intersection over Union) threshold, where n represents the number of categories. Note that mAP in this paper refers to mAP50, indicating the mAP at IoU = 0.5. The formula for mAP is given by Equation (15):

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P, A P = \int_{0}^{1} P (R) d r

(15)

Parameters refer to the total number of trainable parameters in the model, including the weights and biases of convolutional layers, as well as parameters from other specific layers, such as fully connected layers and normalization layers.

Model size is used to evaluate the complexity of the model. Generally, a smaller model size indicates that the model requires less computational power, making it easier to deploy on edge devices.

4.4. Process of Experiment

4.4.1. Ablation Experiment

To assess and analyze the effectiveness of the proposed enhancement strategies, ablation studies were performed using the VisDrone dataset, with the corresponding results detailed in Table 4.

As evidenced by the data presented in Table 4, the following observations can be made.

Experiment v8s-0 represents the original YOLOv8s algorithm, achieving a baseline mAP of 36.3%. This serves as the reference point for all subsequent improvements.

Experiment v8s-1 replaces the original C2f module with the proposed C2f-DWR module. Without increasing computational cost, the mAP improves by 0.9% (from 36.3% to 37.2%), demonstrating the effectiveness of the C2f-DWR module in enhancing feature representation.

Experiment v8s-2 replaces the neck part of the traditional YOLOv8 structure with the proposed GLSA-BiFPN module. Compared to v8s-0, the mAP improves by 1.7% (from 36.3% to 38.0%), while reducing the number of parameters by 28.8% and the model size by 27.9%. This indicates that the GLSA-BiFPN module not only enhances feature extraction, but also significantly reduces model complexity.

Experiment v8s-3 introduces the lightweight shared convolutional detection head (LSCDH) on the basis of v8s-0. Compared to v8s-0, the mAP improves by 6.3% (from 36.3% to 42.6%), proving that the LSCDH can better capture diverse semantic information in the features, leading to a substantial performance boost.

Experiment v8s-4 adopts the GLSA-BiFPN structure on the basis of v8s-1. Compared to v8s-0, the mAP improves by 1.6% (from 36.3% to 37.9%), while reducing the model size by 30.2%. This further validates the efficiency of the GLSA-BiFPN module in balancing performance and model complexity.

Experiment v8s-5 adds the lightweight shared convolutional detection head (LSCDH) on the basis of v8s-1. Compared to v8s-0, the mAP improves by 6.5% (from 36.3% to 42.8%), highlighting the significant contribution of the LSCDH to detection accuracy.

Experiment v8s-6 integrates both the GLSA-BiFPN module and the lightweight shared convolutional detection head (LSCDH) on the basis of v8s-2. Compared to v8s-0, the mAP improves by 8.4% (from 36.3% to 44.7%), while reducing the model size by 31.6%. This demonstrates the synergistic effect of combining the GLSA-BiFPN module and the LSCDH, achieving the best performance among all experiments.

The data recorded in the ablation experiments above indicate that each improvement proposed in this study enhances the model’s detection performance to a certain extent.

4.4.2. Comparative Experiments

To illustrate the superiority of the proposed improved algorithm, DGBL-YOLOv8s, comparative experiments were conducted on the VisDrone dataset against two-stage algorithms (Faster-RCNN, SSD) and single-stage algorithms (YOLOv3-Tiny, YOLOv4, YOLOv5). The experimental results are shown in Table 5.

In the above table, the proposed DGBL-YOLOv8s outperforms several popular baseline models in key performance metrics, demonstrating significant improvements in both detection accuracy and model efficiency. Specifically, DGBL-YOLOv8s achieves an impressive mAP50 of 44.8%, which is a substantial improvement over YOLOv8s (36.3%) and other models such as YOLOv5 (32.9%) and YOLOv4 (15.7%). Additionally, it shows remarkable improvements in precision (P) and recall (R), reaching 53.6% and 42.6%, respectively, surpassing YOLOv8s (47.0% and 35.7%) and YOLOv5 (43.4% and 34.4%). In terms of computational efficiency, DGBL-YOLOv8s maintains a small model size (14.2 MB) and a low parameter count (7.2 M), making it more lightweight than many alternatives, like Faster-RCNN (260.8 MB, 136.7 M parameters) and YOLOv4 (118.2 MB, 61.9 M parameters), while still achieving superior performance.

4.5. Analysis of Results

To validate the robustness of the DGBL-YOLOv8s algorithm, a visual analysis of the effective receptive field in the backbone network was performed, comparing scenarios before and after the enhancements. Figure 7 illustrates this comparison, with the left image (a) depicting the baseline network and the right image (b) showcasing the network integrated with the C2f_DWR module:

To provide a clearer explanation of the changes in the receptive field before and after the improvements, a line chart is used to illustrate the variations in the receptive field of the model’s backbone network. (Here, “Thresh” represents the threshold, set to 0.2, 0.3, 0.5, and 0.99, respectively; “Rectangle Side Length” denotes the side length of the rectangle; and “Area Ratio” indicates the corresponding area ratio.) The comparative line chart is shown in Figure 8.

From the line chart, it is evident that replacing the backbone network with the C2f_DWR module enlarges the model’s receptive field, enabling it to capture richer feature information and significantly improving its feature extraction capabilities.

Subsequently, the VisDrone dataset was selected for visualization under various lighting conditions, different scenarios, and environments with dense targets and complex categories. The left image (a) displays the detection results of YOLOv8s, while the right image (b) presents those of DGBL-YOLOv8s. As illustrated in Figure 9, DGBL-YOLOv8s is capable of identifying more target categories during the daytime compared to the original model, indicating its superior detection capability. From the comparison of images in Figure 10I,II, it is evident that DGBL-YOLOv8s can detect more target categories and quantities in the dark, proving its effectiveness in target detection tasks under varying lighting conditions. Furthermore, as shown in the comparisons in Figure 11 and Figure 12, DGBL-YOLOv8s significantly reduces the rates of false positives and missed detections compared to the original model, demonstrating its robustness and accuracy.

To demonstrate the efficacy of the enhanced algorithm, this paper compares the mAP (mean average precision) of each category before and after the model improvement, as shown in Figure 13. From the figure, it can be observed that the improved DGBL-YOLOv8s model exhibits varying degrees of enhancement in detection accuracy across all categories compared to the original model.

5. Discussion

The experimental results indicate that the DGBL-YOLOv8s model has accomplished significant improvements in scenarios captured by drones, particularly in detecting targets with small scales, large quantities, and long distances. However, despite the model’s impressive detection accuracy and performance enhancements, there are still some issues worthy of in-depth discussion. Firstly, the model exhibits certain limitations in detecting high-density target areas or extremely small objects. Especially in scenarios with intricate backgrounds or overlapping multiple targets, the improved model may experience some interference, leading to a decline in detection accuracy for small objects. Secondly, although this paper introduces the GLSA-BiFPN structure in the neck part and the concept of lightweight shared convolution and group normalization in the detection head to shorten the model’s computational complexity, the data in Table 6 reveal that the addition of a p2 layer small-target detection head significantly increases the model’s complexity. Consequently, deploying the model on devices with limited computational resources could impose a substantial computational burden. The computational complexity (GFLOPs) of the model after incorporating the improvements is shown in Table 6.

I: The baseline YOLOv8s model has a computational complexity of 28.5 GFLOPs.

II: The introduction of the DWR module reduces the computational cost slightly to 28.1 GFLOPs. This indicates that it enhances feature extraction and contextual information capture without significantly increasing computational complexity.

III: The GLSA-BiFPN module further reduces the computational cost to 26.7 GFLOPs. This demonstrates that its structure not only improves multi-scale feature fusion, but also optimizes computational efficiency.

IV: Due to the introduction of additional detection layers and new mechanisms in LSCDH to improve the detection accuracy of small objects, its computational overhead increases to 45.7GFLOPs.

V: The final proposed model, DGBL-YOLOv8s, has a computational cost of 53.7 GFLOPs. This increase is primarily attributed to the combination of the DWR module, GLSA-BiFPN, and LSCDH. Despite the higher computational cost compared to the baseline, the DGBL-YOLOv8s model achieves significant improvements in detection accuracy (8.5% increase in mAP) and model efficiency (35.1% reduction in parameters and 33.9% decrease in model size).

Consequently, future research should focus on incorporating advanced model compression methods, including pruning, quantization, knowledge distillation, and lightweight architecture design, to minimize the model’s parameters and computational demands. This would facilitate its deployment on resource-constrained devices, such as drones and mobile platforms. Additionally, to address the challenges posed by complex drone-captured scenarios, it is worth considering the integration of reinforcement learning with meta-learning or online learning techniques. This would enable the model to autonomously adjust its parameters and strategies in dynamic environments, enhancing real-time performance and robustness. Furthermore, exploring multi-modal data fusion, such as combining visible light, infrared, and LiDAR data, could significantly improve detection performance in complex scenarios. Future studies should also prioritize small object detection by investigating higher-resolution inputs, refined feature extraction methods, and specialized loss functions. Finally, developing larger and more diverse datasets, along with comprehensive evaluation metrics, would provide a stronger foundation for validating model generalization capabilities in real-world applications.

6. Conclusions

This study proposes an improved object detection model, DGBL-YOLOv8s, designed to address challenges in drone-captured scenarios, such as weak feature extraction, poor information fusion, and low detection accuracy caused by small target scales, large quantities, and long distances. Firstly, the YOLOv8 backbone network is augmented by incorporating the Dilated Residual Module (DWR) into the C2f module, enhancing its feature extraction and contextual information capture capabilities. Secondly, the neck structure employs a Global-to-Local Spatial Attention Bidirectional Feature Pyramid Network (GLSA-BiFPN), which improves multi-scale feature fusion by effectively combining local spatial details with global spatial semantics. Additionally, the detection head incorporates the concepts of shared convolution and group normalization, along with the introduction of a small target detection head, which promotes detection accuracy while shortening computational complexity and the number of parameters. The experimental results indicate that the DGBL-YOLOv8s model achieves an 8.5% increase in mean average precision (mAP) over the baseline model, alongside a 35.1% reduction in parameters (from 11.1 M to 7.2 M) and a 33.9% decrease in model size (from 21.5 MB to 14.2 MB). These improvements confirm the efficacy and advantages of the proposed enhancements for drone-based object detection. Future efforts will concentrate on further refining the model architecture to lower computational demands and investigating its applicability in more intricate scenarios.

Author Contributions

Conceptualization, C.W.; Methodology, C.W.; Software, C.W.; Validation, C.W.; Investigation, H.Y.; Resources, H.Y.; Writing—original draft, C.W.; Writing—review and editing, C.W.; Visualization, C.W.; Supervision, H.Y.; Project administration, H.Y.; Funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All of the data are contained within the article. To request the data and code, please send an email to the first or corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mittal, P.; Singh, R.; Sharma, A. Deep learning-based object detection in low-altitude UAV datasets: A survey. Image Vis. Comput. 2020, 104, 104046. [Google Scholar] [CrossRef]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Lyu, Z.; Jin, H.; Zhen, T.; Sun, F.; Xu, H. Small object recognition algorithm of grain pests based on SSD feature fusion. IEEE Access 2021, 9, 43202–43213. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Wei, H.; Liu, X.; Xu, S.; Dai, Z.; Dai, Y.; Xu, X. DWRSeg: Rethinking Efficient Acquisition of Multi-Scale Contextual Information for Real-Time Semantic Segmentation. arXiv 2022, arXiv:2212.01173. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Tang, F.; Xu, Z.; Huang, Q.; Wang, J.; Hou, X.; Su, J.; Liu, J. DuAT: Dual-aggregation transformer network for medical image segmentation. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; pp. 343–356. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. Uav-yolo: Small object detection on unmanned aerial vehicle perspective. Sensors 2020, 20, 2238. [Google Scholar] [CrossRef] [PubMed]
Bochkovskiy, A.; Wang, C.Y.; Liao HY, M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao HY, M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Li, Z.; Fan, B.; Xu, Y.; Sun, R. Improved YOLOv5 for aerial images based on attention mechanism. IEEE Access 2023, 11, 96235–96241. [Google Scholar] [CrossRef]
Wang, J.; Gao, J.; Zhang, B. A small object detection model in aerial images based on CPDD-YOLOv8. Sci. Rep. 2025, 15, 770. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.D.; Zhen, G.Y.; Chu, C.Q. Unmanned Aerial Vehicle Image Target Detection Algorithm BasedonYOLOv8. J. Comput. Eng. 2024, 50, 113–120. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]

Figure 1. Architecture of the DGBL-YOLOv8.

Figure 2. C2f_DWR.

Figure 3. Architecture of the DWR module.

Figure 4. Architecture of the GLBA-BiFPN.

Figure 5. Architecture of the GLSA module.

Figure 6. Architecture of the LSCDH.

Figure 7. Comparison of the changes in the receptive field of the backbone network.

Figure 8. Line graph of receptive field contrast.

Figure 9. Daytime detection results: (a) unimproved detection results and (b) improved detection results.

Figure 10. (I) Night detection results: (a) unimproved detection results and (b) improved detection results. (II) Night detection results: (a) unimproved detection results and (b) improved detection results.

Figure 11. Comparison of false detection: (a) unimproved detection results and (b) improved detection results.

Figure 12. Comparison of missed detection: (a) unimproved detection results and (b) improved detection results.

Figure 13. mAP Comparison between different model categories.

Table 1. Formulas for convolution parameters and computational complexity.

	Parameters	Computation
Standard Convolution	I_in × I_out × m × m	P × Q × I_in × I_out × m × m
Depth Convolution	I_in × m × m	P × Q × I_in × m × m
Point-wise Convolution	I_in × I_out	I_in × I_out

Table 2. Experimental environment.

Environment	Type
Operation system	Ubuntu 11.4.0
GPU	NVIDIA GeForce RTX 3090
Programming language	Python 3.10.14
CUDA	12.1
Module platform	Pytorch 2.2.2
IDE platform	Pycharm

Table 3. Experimental training parameters.

Parameter Names	Settings
Epoch	100
Pretrain	Closed
Input images size	640 × 640
Batchsize	16
Workers	8
lr0	0.01
Optimizer	SGD
Momentum	0.937
Weight_decay	0.0005
Data augmentation	mosaic

Table 4. Ablation experiments.

Baseline	C2f_DWR	GLSA-BiFPN	LSCDH	P/%	R/%	mAP50/%	Para/M	Model Size/MB
v8s-0				47.0	35.7	36.3	11.1	21.5
v8s-1	√			48.9	36.4	37.2	10.9	21.0
v8s-2		√		49.0	37.5	38.0	7.9	15.5
v8s-3			√	52.0	41.6	42.6	9.6	18.7
v8s-4	√	√		49.3	37.2	37.9	7.7	15.0
v8s-5	√		√	50.6	41.7	42.8	9.4	18.3
v8s-6		√	√	53.9	42.9	44.7	7.4	14.7
v8s-7	√	√	√	53.6	42.6	44.8	7.2	14.2

Note: Bold letters indicate the best experimental results.

Table 5. Comparative experiments.

Baseline	P/%	R/%	mAP50/%	Para/M	Model Size/MB
Faster-RCNN	34.6	17.0	14.4	136.7	260.8
SSD	35.0	16.7	14.2	22.8	45.6
YOLOv3-Tiny	26.3	17.4	14.6	64.4	122.8
YOLOv4	30.5	16.0	15.7	61.9	118.2
YOLOv5	43.4	34.4	32.9	7.0	16.0
YOLOv8s	47.0	35.7	36.3	11.1	21.5
DGBL-YOLOv8s	53.6	42.6	44.8	7.2	14.2

Note: Bold letters indicate the best experimental results.

Table 6. Computational complexity of the improved model.

Number	Baseline	GFLOPs
I	YOLOv8s	28.5
II	DWR	28.1
III	GLSA-BiFPN	26.7
IV	LSCDH	45.7
V	DGBL-YOLOv8s	53.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, C.; Yi, H. DGBL-YOLOv8s: An Enhanced Object Detection Model for Unmanned Aerial Vehicle Imagery. Appl. Sci. 2025, 15, 2789. https://doi.org/10.3390/app15052789

AMA Style

Wang C, Yi H. DGBL-YOLOv8s: An Enhanced Object Detection Model for Unmanned Aerial Vehicle Imagery. Applied Sciences. 2025; 15(5):2789. https://doi.org/10.3390/app15052789

Chicago/Turabian Style

Wang, Chonghao, and Huaian Yi. 2025. "DGBL-YOLOv8s: An Enhanced Object Detection Model for Unmanned Aerial Vehicle Imagery" Applied Sciences 15, no. 5: 2789. https://doi.org/10.3390/app15052789

APA Style

Wang, C., & Yi, H. (2025). DGBL-YOLOv8s: An Enhanced Object Detection Model for Unmanned Aerial Vehicle Imagery. Applied Sciences, 15(5), 2789. https://doi.org/10.3390/app15052789

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DGBL-YOLOv8s: An Enhanced Object Detection Model for Unmanned Aerial Vehicle Imagery

Abstract

1. Introduction

2. Related Works

2.1. Methods Based on Network Architecture Improvements

2.2. Methods Based on Attention Mechanisms and Feature Enhancement

2.3. Methods Based on Loss Function and Training Strategy Optimization

3. Methods

3.1. Introduction of DGBL-YOLOv8 Object Detection Algorithm

3.1.1. The C2f_DWR Module

3.1.2. Structure of the GLBA-BiFPN

3.1.3. Lightweight Shared Convolutional Detection Head

4. Experiment and Result Analysis

4.1. VisDrone2019 Dataset

4.2. Experimental Environment

4.3. Evaluation Index

4.4. Process of Experiment

4.4.1. Ablation Experiment

4.4.2. Comparative Experiments

4.5. Analysis of Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI