LMEC-YOLOv8: An Enhanced Object Detection Algorithm for UAV Imagery

Tai, Xuchuan; Zhang, Xinjun

doi:10.3390/electronics14132535

Open AccessArticle

LMEC-YOLOv8: An Enhanced Object Detection Algorithm for UAV Imagery

by

Xuchuan Tai

^* and

Xinjun Zhang

School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2535; https://doi.org/10.3390/electronics14132535

Submission received: 6 May 2025 / Revised: 9 June 2025 / Accepted: 11 June 2025 / Published: 23 June 2025

(This article belongs to the Special Issue Deep Learning for Computer Vision, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Despite the rapid development of UAV (Unmanned Aerial Vehicle) technology, its application for object detection in complex scenarios faces challenges regarding the small target sizes and environmental interference. This paper proposes an improved algorithm, LMEC-YOLOv8, based on YOLOv8n, which aims to enhance the detection accuracy and real-time performance of UAV imagery for small targets. We propose three key enhancements: (1) a lightweight multi-scale module (LMS-PC2F) to replace C2f; (2) a multi-scale attention mechanism (MSCBAM) for optimized feature extraction; and (3) an adaptive pyramid module (ESPPM) and a bidirectional feature network (CBiFPN) to boost fusion capability. Experimental results on the VisDrone2019 dataset demonstrate that LMEC-YOLOv8 achieves a 10.1% improvement in mAP50, a 20% reduction in parameter count, and a frame rate of 42 FPS compared to the baseline YOLOv8n. When compared to other state-of-the-art algorithms, the proposed model achieves an optimal balance between accuracy and speed, validating its robustness and practicality in complex environments.

Keywords:

object detection; YOLOv8; UAVs; feature fusion; attention mechanism

1. Introduction

Unmanned Aerial Vehicle (UAV) technology has revolutionized object detection in applications ranging from surveillance to disaster management due to its wide coverage and mobility [1]. However, UAV imagery poses significant challenges, including small target sizes (<50 pixels), environmental interference (e.g., occlusion, electromagnetic noise, etc.), and complex backgrounds [2]. Furthermore, UAVs operating in complex electromagnetic environments are susceptible to electromagnetic interference (EMI), which can manifest as various forms of image noise (e.g., salt-and-pepper noise, banding noise, pixel value anomalies, etc.) [3] during image acquisition. This sensor noise significantly degrades image quality, obscures critical target features, particularly for small objects with limited visual cues, and introduces spurious artifacts that can lead to false positives or missed detections, posing an additional critical challenge for reliable object detection. These factors severely degrade the performance of existing detection algorithms, particularly for critical transportation targets such as pedestrians and vehicles [4]. Enhancing small object detection accuracy while maintaining real-time efficiency thus remains an urgent research priority.

Recent advances focus on lightweight architectures and multi-scale fusion. Du et al. [5] employed sparse convolutions to reduce computational load but sacrificed accuracy for small targets. Safaldin et al. [6] integrated Bi-FPN with GhostBlock modules, improving long-range dependencies yet achieving suboptimal mAP50. Similarly, Cai et al. [7] proposed GSConv for parameter reduction, but introduced high complexity. While these methods enhance efficiency, they struggle to balance accuracy and speed under resource constraints.

Attention mechanisms and feature fusion strategies offer complementary benefits. Wang et al. [8] embedded CBAM into YOLOv5, enhancing robustness in cluttered scenes but with limited multi-scale capability. Wang et al. [9] combined BiFormer with focal blocks to reduce missed detections, yet incurred prohibitive training latency. For feature fusion, BiFPN [10] enables bidirectional cross-scale connections but lacks hierarchical interactions; Wang et al. [11] used structural re-parameterization to preserve contextual information at the cost of 23% higher FLOPs.

Despite progress, three critical gaps persist:

Speed–accuracy trade-off: lightweight models (e.g., YOLOv11n [12] and SC-YOLO [13]) achieve >30 FPS but suffer >5% mAP50 degradation versus larger models.

Occlusion/EMI robustness: existing methods [8,14] show sensitivity to occluded targets and electromagnetic interference (EMI), causing false positives.

Ultra-small target detection: sub-10px targets exhibit ≤ 35% recall in state-of-the-art models [3,12] due to ineffective feature aggregation.

To address these limitations, we propose LMEC-YOLOv8, an enhanced framework with four key innovations:

Lightweight multi-scale fusion: the proposed LMS-PC2F module, integrating hierarchical multi-scale convolutions and Partial Convolution (PConv), achieved a 20% parameter reduction while enhancing small-target feature representation.
Multi-scale attention mechanism: MSCBAM, a lightweight attention module combining multi-scale spatial operations, was developed, improving mAP50 by 2.3% with 36.7% fewer parameters.
Efficient feature pyramid design: ESPPM and CBiFPN were designed to optimize multi-scale feature fusion, addressing challenges of occlusion and scale variation in UAV imagery while maintaining real-time inference (42 FPS).
Practical validation: the framework demonstrated superior performance on the VisDrone2019 dataset, achieving a 10.1% mAP50 improvement over YOLOv8n, with deployment feasibility on edge devices.

2. Related Works

The core advantages of the YOLO series algorithms lie in their exceptional real-time performance and computational efficiency, making them particularly suitable for application scenarios requiring rapid responses. Through continuous iterative optimization, this series of models maintains its speed advantage while progressively enhancing detection accuracy. In recent years, scholars domestically and internationally have conducted extensive research on the YOLO-series object detection algorithms, implementing improvements across multiple dimensions including lightweight design, multi-scale learning, attention mechanisms, and data augmentation.

For instance, Yu et al. [15] designed the Patch-Wise Co-Attention (PWCA) module, which effectively improves detection accuracy by integrating feature fusion with attention mechanisms; however, this approach exhibits high computational resource demands and suboptimal real-time performance. Concurrently, Wang et al. [8] proposed the lightweight detection model MFP-YOLO through optimizations to YOLOv5, incorporating inverted residual blocks and introducing the Convolutional Block Attention Module (CBAM) to enhance detection effectiveness and real-time performance in complex environments, though without achieving significant gains in detection accuracy. Wang et al. [9] enhanced the YOLOv8 backbone network by integrating the BiFormer attention mechanism, strengthening the model’s capability to extract critical information, while designing the Focal FasterNet Block (FFNB) feature processing module to facilitate comprehensive fusion between shallow and deep network layers, thereby reducing the missed detection rate; nevertheless, the modified model requires extended training times and is unsuitable for real-time tasks. In lightweight-oriented research, Du et al. [5] employed sparse convolutions to optimize the detection head, reducing computational load, but their method demonstrates insufficient detection accuracy. Min et al. [16] integrated global residual mechanisms and local fusion mechanisms into the YOLO-DCTI model, fusing spatial contextual information with channel features to effectively enhance model robustness in complex background detection tasks, albeit at the cost of increased structural complexity and higher computational resource consumption. Safaldin et al. [6] augmented the multi-scale feature fusion capability of YOLOv8 by incorporating a Bi-FPN module, mitigating false detections and missed detections of small objects, while utilizing a GhostBlock module to optimize the backbone network for capturing long-range feature dependencies and improving generalization capability—though detection accuracy remains suboptimal. Cai et al. [7] reduced parameter counts in YOLOv8 via GSConv, designing a Gather-and-Distribute (GD) module to fuse global multi-scale features and distribute global information to deeper network layers, thereby enhancing the model’s information fusion capacity without increasing latency, despite introducing higher computational complexity. Additionally, Wang et al. [11] employed structural re-parameterization techniques to refine the YOLOv8 backbone network, reinjecting layers with poor localization capability to ensure comprehensive retention of contextual information during feature extraction, though this incurs additional resource overhead.

Collectively, while these studies demonstrate focused advancements, they consistently reflect inherent trade-offs between enhancing accuracy, improving robustness, or optimizing architecture versus computational efficiency, resource consumption, and real-time performance.

3. Materials and Methods

3.1. The YOLOv8 Framework

YOLOv8 [14] has five versions, ranked from smallest to largest by model size: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. Given the constrained computational resources of edge devices, this study focuses on YOLOv8n—the most compact and fastest version—whose architecture is depicted in Figure 1.

The YOLOv8n model comprises three core components: the backbone, neck and head. The backbone performs iterative downsampling and feature extraction on input images, integrating multiple Conv modules, C2f modules, and a SPPF (Spatial Pyramid Pooling-Fast) module. Each Conv module consists of a convolutional layer, normalization layer, and SiLU activation function. This design standardizes feature distributions to stabilize training, accelerate convergence, reduce reliance on initial parameters, and mitigate issues like overfitting, gradient vanishing, or gradient explosion.

The C2f module, a novel lightweight structure compared to YOLOv5’s C3 module, introduces additional skip connections and split operations. These modifications reduce computational overhead while enriching gradient flow, significantly improving convergence efficiency. The SPPF module, optimized for speed, extracts and encodes multi-scale features, converting variable-sized feature maps into fixed-length vectors. Unlike the parallel architecture of the original SPP module, SPPF employs a serial structure to minimize computational redundancy, substantially boosting inference speed.

3.2. LMEC-YOLOv8

To address the limitations of inaccurate localization, missed detections, and false positives in drone imagery object detection with the YOLOv8n baseline model, this paper proposes LMEC-YOLOv8—an enhanced YOLOv8n-based model for UAV target detection, whose architecture is depicted in Figure 2. The methodology begins by designing a Lightweight Multi-scale Convolution Module (LMSCM) to replace the second convolution in the C2F bottleneck, integrating Partial Convolution (PConv) to form the Lightweight Multi-scale Fusion of PConv-based C2F (LMS-PC2F), thereby improving feature representation accuracy. Subsequently, a Multi-scale Convolutional Block Attention Module (MSCBAM) is developed through enhancements to CBAM, which is embedded into the backbone network to augment detection precision, robustness, and multi-scale feature extraction capabilities. Further innovation involves the design of an Enhanced Spatial Pyramid Pooling Module (ESPPM), where adaptive average pooling and Large Separable Kernel Residual Attention Mechanisms are introduced into the SPPF structure to preserve fine-grained image features while enabling efficient multi-scale feature fusion. Finally, the original neck network is replaced with a Cross-layer Bidirectional Feature Pyramid Network (CBiFPN), which facilitates hierarchical semantic interaction through weighted cross-level feature fusion, effectively addressing challenges posed by scale variations and occlusions. The research process is shown in Figure 3.

LMEC-YOLOv8 synergizes in the following ways: (1) LMS-PC2F reduces parameters while capturing multi-scale features; (2) MSCBAM suppresses noise via multi-scale attention; (3) ESPPM enhances long-range dependency modeling; and (4) CBiFPN resolves occlusion via cross-layer fusion. This unified design addresses UAV-specific challenges holistically.

3.2.1. LMS-PC2F Module

The C2F module in the YOLOv8n model enhances detection capabilities in complex backgrounds or for targets with rich details by improving feature fusion. However, due to the limited storage and computational capabilities of edge devices such as UAVs, detection models must prioritize lightweight design to maintain efficient object detection performance under constrained hardware resources. To address this, we propose the Lightweight Multi-Scale Convolution Module (LMSCM), as illustrated in Figure 4.

The overall structure of the LMSCM resembles the pyramid architecture of SPPELAN [6]. Let C denote the number of input channels and X the input feature map. Within this module, C is divided into three components: C₁ ∈

X^{w \times h \times c / 2}

, C₂ ∈

X^{w \times h \times c / 4}

, and C₃ ∈

X^{w \times h \times c / 8}

. The feature map is hierarchically partitioned three times: First of all, X is split into X₁, X₂. X₁ are fed into 1 × 1 convolutions to extract local features. Secondly, X₂ is further divided into X₂₁ and X₂₂. X₂₁ undergoes a 3 × 3 convolution to capture mid-scale features, balancing broader contextual information with fine local details. Finally, X₂₂ is subdivided into X₂₂₁ and X₂₂₂, processed by 5 × 5 and 7 × 7 convolutions, respectively, to extract larger-scale features tailored for detecting large or blurry targets.

This pyramid structure employs multi-scale convolutional operations to simultaneously extract features at varying receptive fields, significantly enhancing the recognition of diverse targets. Despite multiple convolutional layers, parallel computing minimizes computational redundancy, improves overall efficiency, and ensures adaptability to the complexity and variability of environments.

To enhance the detection performance of the C2F module, two key improvements are introduced. First, the second standard convolutional layer in the original bottleneck structure is replaced with the Lightweight Multi-Scale Convolution Module (LMSCM), forming the LMSCM-Bottleneck. This design leverages multi-scale convolutional kernels to capture rich spatial-contextual information, improving the model’s adaptability to target scale variations while maintaining parameter efficiency. It significantly boosts feature extraction capabilities without compromising computational resource constraints. Second, due to the complexity of UAV imaging environments, detected targets are often affected by occlusion. To address this, Partial Convolution (PConv) [17] is integrated into the C2F module. PConv optimizes feature extraction by focusing on meaningful regions and ignoring sparse or noisy data (e.g., invalid zero-value areas), thereby eliminating redundant computations. This not only enhances robustness against occluded targets but also improves overall detection accuracy in UAV-based scenarios.

The improved LMS-PC2F module is shown in Figure 5. Combining the advantages of LMSCM-Bottleneck and PConv, it can capture and fuse multi-scale features more efficiently, ensure that the output of each feature block is fully integrated, and further enhance the expressive ability and feature learning ability of the model. Additionally, the module optimizes occlusion handling in UAV imagery, improves detection precision, and reduces parameter count, making it highly suitable for deployment on storage-constrained edge devices.

3.2.2. Lightweight Multi-Scale Attention Mechanism MSCBAM

In UAV-based object detection tasks, the complex environment and overlapping targets significantly increase the difficulty of the task. Although high-precision detection models can improve detection performance, they often come with a large number of parameters. This leads to slower computation speeds, prolonged processing times, and increased detection latency, imposing additional burdens on edge detection devices like UAVs. To address this, Wu et al. [18] incorporated the CBAM attention mechanism into the backbone network to capture critical feature information and enhance the expressive power of the convolutional neural network. This approach effectively reduces parameter size while directing the model’s focus to essential information.

The CBAM [19] attention mechanism consists of two sequential submodules: a channel attention module and a spatial attention module. The basic framework of CBAM can be formulated as follows: first, the input feature map is processed through the channel attention module to obtain X′, which is then passed through the spatial attention module to obtain X″.

X^{'} = M_{c} (X) \cdot X

(1)

X^{″} = M_{c} (X^{'}) \cdot X^{'}

(2)

M_{c} (X) = σ (MLP (F_{avg}) + MLP (F_{\max}))

(3)

M_{s} (X^{'}) = σ (f^{7} (F_{c a t}))

(4)

Here,

M_{c} (X)

represents the generated channel attention weights, and

M_{s} (X^{'})

denotes the spatial attention weights. While CBAM effectively reduces parameter size, it relies on single-scale convolutional operations to capture spatial attention. This confines feature extraction to a fixed receptive field, limiting its performance in complex scenarios requiring multi-scale information capture, which ultimately leads to suboptimal results in UAV-based object detection. To address this, we propose MSCBAM, a lightweight multi-scale attention mechanism developed as an improvement on CBAM. It effectively captures multi-scale information, enhances model performance across varying target sizes and complex environments, improves robustness to diverse inputs, and reduces reliance on network depth for performance gains, thereby further minimizing parameter size. The MSCBAM architecture is illustrated in Figure 6.

After

M_{c} (X)

, a 1 × 1 convolution is applied to expand the fully connected layer in the spatial dimension, enabling cross-channel interaction and fusion. The channels are then split into four branches,

P_{1, 2, 3, 4} \in R^{w \times h \times c / 4}

. Two branches employ sequential convolution operations with dimensions 1 × 3 and 3 × 1, and 1 × 5 and 5 × 1, respectively. These horizontal and vertical convolutions help capture fine-grained local features and broader contextual information, where larger kernels better handle global features. For

P_{2} \in R^{w \times h \times c / 4}

, a 3 × 3 max pooling layer extracts the strongest local feature responses, highlighting critical edges and corner information. A subsequent 1 × 1 convolution adjusts channel dimensions and integrates salient features, enriching discriminative information for final fusion. For

P_{3} \in R^{w \times h \times c / 4}

, average pooling aggregates local details at smaller scales while suppressing noise, producing smoother features. A follow-up 1 × 1 convolution further refines channel relationships, extracts essential local patterns, and compresses output channels.

The multi-scale module in MSCBAM enhances the model’s capability to represent multi-scale features by capturing and fusing feature information at different scales, thereby improving its adaptability to complex scenes and diverse targets. This enables MSCBAM to achieve superior performance when processing images with varying scales and complex backgrounds. The multi-scale module can be formulated as follows:

X_{ms - in} = {Conv}_{1 \times 1} (X_{c})

(5)

F_{b 1} = {Conv}_{5 \times 1} ({Conv}_{1 \times 5} (X_{ms - in} / 4))

(6)

F_{b 2} = {Conv}_{1 \times 1} ({MaxPool}_{3 \times 3} (X_{ms - in} / 4))

(7)

F_{b 3} = {Conv}_{1 \times 1} ({AvgPool}_{3 \times 3} (X_{ms - in} / 4))

(8)

F_{b 4} = {Conv}_{3 \times 1} ({Conv}_{1 \times 3} (X_{ms - in} / 4))

(9)

X_{ms - out} = (Concat [F_{b 1, b 2, b 3, b 4}])

(10)

X^{″} = M_{s} (X_{ms - out}) \cdot X_{ms - out}

(11)

Here,

X_{ms - i n}

is the output of Xc after applying a 1 × 1 convolution.

F_{b 1, b 2, b 3, b 4}

represent the outputs of the four branches, and

X_{ms - out}

is obtained by concatenating the outputs from the four branches. X″ denotes the final output after applying spatial attention to

X_{ms - out}

.

MSCBAM reduces parameters and retains multi-scale information through parallel multi-branch lightweight design (such as 1 × 5 + 5 × 1 replacing the 5 × 5 standard convolution). Channel segmentation (C → C/4) further reduces the computational load.

3.2.3. ESPPM Module

(1): Large Separable Kernel Residual Attention Mechanism (LSKR)

Large Separable Kernel Attention (LSKA) [20] is a technique that achieves efficient feature capture and long-range dependency modeling by decomposing large-kernel convolution operations. It splits traditional convolution into smaller kernels while retaining the advantages of large-kernel convolution in capturing global features, making it particularly effective for processing sequential data.

In Figure 7, the left side illustrates the LSKA module, while the right side depicts the LSKR module (Large Separable Kernel Residual). The latter combines large-kernel decomposition with residual connections to mitigate the loss of critical information in long sequences.

Although LSKA excels at capturing long-range dependencies, its feature representation capability remains limited in complex scenarios, while residual connections have proven effective in mitigating the vanishing gradient problem in deep neural networks. Building on this, we propose the Large Separable Kernel Residual Attention Mechanism (LSKR). By integrating a residual structure [21] into LSKA, LSKR maintains a direct link between the original input and output, addressing the limitations of neural networks in processing long-sequence data.

The LSKR module is defined as

LSKR (X) = {Conv}_{1 \times 1} (F (X) + X)

(12)

where F(⋅) represents the large separable kernel decomposition operation. Specifically, for a given large kernel size K (e.g., 15 × 15), we decompose it into two smaller kernels (e.g., 15 × 1 and 1 × 15). The operation can be expressed as

F (X) = {Conv}_{1 \times K} ({Conv}_{K \times 1} (X))

(13)

Here, the two consecutive convolutions (first a vertical convolution and then a horizontal convolution) approximate the effect of a full K × K convolution but with reduced computational complexity. The output of F is then added to the original input X (residual connection), and a 1 × 1 convolution is applied to adjust the channel dimensions and integrate features.

This design allows LSKR to efficiently capture long-range dependencies while preserving the original feature information through the residual connection.

(2): ESPPM module

The Spatial Pyramid Pooling Module (SPPF) captures multi-scale spatial information by performing parallel multi-scale max-pooling operations (e.g., 1 × 1, 3 × 3, 5 × 5, and 7 × 7 pooling windows) on feature maps. The pooled features are then fused to generate multi-scale representations.

To address the challenges of UAV imagery—high resolution, complex background textures, and multi-scale targets—we integrate the Large Separable Kernel Residual Attention Mechanism (LSKR) into SPPF and combine it with adaptive average pooling (AdaptiveAvgPool), forming the Enhanced Spatial Pyramid Pooling Module (ESPPM), as illustrated in Figure 8.

The adaptive pooling branch generates multi-scale features by applying adaptive average pooling with different output sizes. For each scale k ∈ S_k ∈ S, the operation is

Z_{k} = {Conv}_{1 \times 1} ({AdaptiveAvgPool}_{k \times k} (X))

(14)

where AdaptiveAvgPool_k_×_k resizes the input feature map to k × k spatial dimensions by averaging over each region. The subsequent 1 × 1 convolution projects the pooled features to a lower-dimensional space (with C′ channels). To restore the spatial resolution for concatenation, each Z_k is upsampled to the original size H × W using bilinear interpolation, denoted as Z_k′.

The ESPPM module can be mathematically represented as follows: Let X ∈ R^C×H×W denote the input feature map, where C is the number of channels, and H and W are the height and width, respectively. The ESPPM module processes X through two parallel branches: the adaptive pooling branch and the LSKR branch. The outputs of these branches are concatenated to form the final output

Y = Concat ({\{{AdaptiveAvgPool}_{k} (X)\}}_{k \in S}, L S K R (X))

(15)

where S = {1,3,5,7} represents the set of pooling kernel sizes (equivalent to output sizes in adaptive pooling), AdaptiveAvgPool_k(⋅) performs adaptive average pooling to a fixed size k × k (followed by a 1 × 1 convolution to adjust channels to C′, typically C′ = C/4 for each scale), LSKR(⋅) denotes the Large Separable Kernel Residual Attention Mechanism (detailed in Equation (12) above), and Concat concatenates the feature maps along the channel dimension. The concatenated feature map Y is then passed through a 1 × 1 convolution to reduce the channel dimension to C.

The outputs from the adaptive pooling branch (for each scale) and the LSKR branch are concatenated along the channel dimension:

Y = Concat ({Z_{1}}^{'}, {Z_{3}}^{'}, {Z_{5}}^{'}, {Z_{7}}^{'}, LSKR (X))

(16)

Then, a 1 × 1 convolution is applied to fuse the concatenated features and reduce the channel dimension to match the input channel C

Y_{out} = {Conv}_{1 \times 1} (Y)

(17)

where Y_out is the final output of the ESPPM module.

The Enhanced Spatial Pyramid Pooling Module (ESPPM) leverages the advantages of the Large Separable Kernel Residual (LSKR) mechanism in long-sequence modeling. By decomposing large-kernel convolution operations, ESPPM maintains efficient feature extraction. The large-kernel convolution provides a broad receptive field, enabling the capture of long-range dependencies and global information, such as extensive backgrounds and multi-scale targets in UAV imagery. Small-kernel convolutions efficiently extract local features while reducing computational complexity. By splitting large kernels into smaller ones, LSKR achieves high precision while improving computational efficiency. The residual structure enhances model stability during the processing of long-sequence data and multi-scale feature aggregation. Adaptive average pooling operations perform pooling at different scales on the feature map, generating multi-scale information to ensure comprehensive representation of features across varying resolutions.

Within the ESPPM module, the multi-scale feature maps generated through adaptive average pooling are fused with the long-range dependency features produced by the LSKR module. This fusion strengthens the multi-scale feature aggregation capability of the feature maps. Through LSKR’s long-range dependency modeling, effective information interaction between features of different scales is ensured, optimizing the module’s ability to handle complex UAV imagery with diverse backgrounds, multi-scale targets, and high-resolution details.

The inference latency of ESPPM on NVIDIA Jetson TX2 is 4.3 ms (compared with 3.9 ms of SPPF), increasing by only 10%. However, due to the multi-scale feature gain, mAP50 improves by 5.1%.

3.2.4. Cross-Layer Weighted Bidirectional Feature Pyramid Network (CBiFPN)

The fusion structure of CBiFPN with the other three features is shown in Figure 8. In real-world scenarios, UAV imagery often suffers from cluttered backgrounds and interference from other objects. To address these challenges, this study replaces the PANet in YOLOv8n with a Cross-layer Weighted Bidirectional Feature Pyramid Network (CBiFPN).

Figure 9a shows the Feature Pyramid Network (FPN) [22], designed to extract multi-scale features. FPN introduces lateral connections and a top-down pathway into existing convolutional neural networks to construct a pyramid-like feature hierarchy. First, convolutional operations are applied to feature maps at different hierarchical levels. Next, features are laterally transferred into a top-down fusion pathway. Finally, upsampling merges high-level semantic features with low-level detailed features, enriching multi-scale representations.

Figure 9b depicts the Path Aggregation Network (PANet), which enhances traditional FPN with a more complex fusion mechanism combining top-down and bottom-up pathways. Bidirectional feature propagation and fusion allow prediction layers to integrate features from both high and low levels, shortening information paths and improving detection performance.

Figure 9c presents the Bidirectional Feature Pyramid Network (BiFPN) [10]. BiFPN optimizes PANet by selectively retaining effective blocks for multi-scale weighted fusion and cross-scale operations, enabling more efficient feature fusion and propagation. This reduces computational and memory costs, improves network efficiency, enhances inter-layer feature transfer, and better preserves fine-grained details and semantic information across hierarchical features.

However, as network depth increases, the aforementioned feature fusion structures lead to partial loss of detailed features across network layers, resulting in insufficient accuracy in recognizing small target characteristics within complex scenes and ultimately lower detection precision. To address this issue, this study introduces a Cross-layer Weighted Bidirectional Feature Pyramid Network (CBiFPN) to replace PANet. Building upon BiFPN, CBiFPN incorporates an additional cross-layer data flow connection (as shown in Figure 9d), which reduces the loss of fine-grained feature information for small targets and significantly enhances the network’s expressive capability.

Compared to BiFPN, CBiFPN demonstrates several distinct advantages. First, by introducing a cross-layer connection mechanism, it enables the cross-fusion of image features from different network hierarchies, enhancing feature diversity and richness, thereby achieving more thorough and balanced feature integration. Second, CBiFPN exhibits stronger information flow dynamics, effectively propagating and fusing low-level and high-level features. This addresses challenges such as target size variations, occlusions, blurred boundaries, and interference from other objects, significantly improving the detection performance for small targets in complex scenarios. Third, CBiFPN maintains lower computational complexity and memory consumption, enhancing the operational efficiency of the detection model while preserving performance. These innovations collectively optimize the model’s ability to handle cluttered UAV imagery with high precision and real-time responsiveness.

The structural details of CBiFPN (Figure 9d) highlight its bidirectional pathways and cross-layer interactions, emphasizing its adaptability to resource-constrained edge devices like UAVs. CBiFPN fuses shallow details and deep semantics through cross-layer connections (such as P3 → P5) to alleviate the feature loss of occluding targets. Experiments show that the recall rate increases by 7.3% in occluded scenarios.

CBiFPN reduces redundancy through the sparsification of cross-layer connections (only fusing key layers). Compared with BiFPN, the number of parameters increased by only 0.2 M, but mAP50 increased by 3.5%.

For training stability, cosine annealing learning rate scheduling (initial lr = 0.01) and gradient clipping (threshold = 1.0) were adopted. Loss function weighting (classification: regression = 1:2) was used to balance small objective optimization, and no gradient explosion or overfitting was observed.

4. Experimental Results and Analysis

All experiments in this study were conducted on the VisDrone2019 [23] dataset, which contains 6471 training images and 548 validation images captured by camera-equipped UAVs at various locations and altitudes. The dataset includes ten predefined categories: pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle, covering nearly all common transportation-related targets. It is widely recognized as a benchmark for small object detection tasks.

The experiments were implemented under the software environment of Python 3.8.13, PyTorch 1.7.1, and CUDA 10.1, with an NVIDIA GeForce RTX 3060 GPU as the hardware platform. All tests used identical hyperparameters to ensure consistency. The specific parameter settings are shown in Table 1.

This study evaluates model performance using precision, recall, mean average precision (mAP), and giga floating-point operations (GFLOPs). Precision, recall, and mAP are calculated according to the formulas provided in Equations (12), (13), and (15), respectively. Precision measures the proportion of correctly predicted positive instances among all predicted positives, while recall quantifies the proportion of actual positive instances correctly identified by the model. The mAP metric averages precision across all recall levels and object categories, providing a comprehensive assessment of detection accuracy. GFLOPs reflect the computational complexity of the model, representing the total number of floating-point operations required for inference. These metrics collectively ensure a balanced evaluation of both accuracy and efficiency, critical for real-world applications such as UAV-based detection systems operating in resource-constrained environments.

P r e c i s i o n = \frac{T P}{T P + F P}

(18)

R e c a l l = \frac{T P}{T P + F N}

(19)

A P = \int_{0}^{1} Precision (R e c a l l) d (R e c a l l)

(20)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(21)

mAP integrates accuracy under all categories and recall thresholds to avoid a single threshold deviation. For example, if one type has a low recall rate and high precision (such as vehicles), while the other type has the opposite (such as pedestrians), mAP provides a global assessment.

To validate the effectiveness of the proposed modules, ablation studies were performed to analyze the impact of individual components on small object detection performance. Visual comparisons of processed images further highlighted the differences introduced by these modules. Additionally, comparative experiments were conducted against classical object detection algorithms and state-of-the-art small object detection methods. These comparisons demonstrate the superiority of the proposed model in terms of accuracy, efficiency, and robustness for small object detection in complex scenarios. The results underscore the advancements achieved by integrating novel architectural improvements, particularly in balancing computational demands and real-time performance for edge-device deployment.

4.1. Ablation Experiment

To demonstrate the superiority of each improved module proposed in this study, ablation experiments were conducted. The experimental results are shown in Table 2.

The experimental data in Table 2 demonstrate the effectiveness of all proposed improvements. In the first group (baseline YOLOv8n model), the mAP50 achieved 33.5%.

For the second group, the introduction of LMS-PC2f into the backbone network resulted in a 0.6% mAP50 improvement while slightly reducing both parameter count and computational complexity.

The third group incorporated the MSCBAM module, where multi-scale feature fusion enhanced the capability to capture information across different scales. This modification yielded a 2.3% mAP50 increase with 36.7% parameter reduction, though computational cost remained comparable.

In the fourth group, the ESPPM module replaced the original SPPF block in the backbone network, achieving a 2.0% mAP50 improvement with marginal increases in parameters and computation. ESPPM slightly increases FLOPs (+0.2 G) due to the decomposition of large cores, but the number of parameters decreases by 1.2 M. Edge devices pay more attention to FLOPs (positively correlated with energy consumption), so a balance needs to be struck between the two.

The fifth group adopted CBiFPN as the feature fusion module, enabling more comprehensive integration of UAV image features and delivering a 3.2% mAP50 boost.

Final experimental results indicate that the improved YOLOv8 model outperformed the baseline YOLOv8n with 5.8% higher precision, 5.4% increased recall, 20% reduced parameters, comparable computational complexity, and 10.1% enhanced mAP50. Furthermore, the implemented system achieves a frame rate of 42 FPS on GPU hardware, fulfilling the requirements for real-time detection while maintaining computational stability. LMEC-YOLOv8 reduces FLOPs by 1.2% and params by 20% vs. YOLOv8n, enabling real-time (42 FPS) edge deployment.

The absolute mAP50 (43.6%) may appear limited compared to generic object detection benchmarks, but it reflects the extreme challenge of the VisDrone2019 dataset.

Target scale distribution: 63% of objects are <50 × 50 pixels.

Occlusion rate: 41% of targets are partially occluded.

Our model prioritizes real-time edge deployment, achieving 42 FPS with only 2.4 M parameters—a critical trade-off for UAV platforms.

To validate the impact of different kernel sizes within the LMSCM module on detection accuracy and computational costs, ablation studies were designed focusing on target dimensions, computational efficiency, and parameter quantization. The experimental results are presented in Table 3.

As shown in Table 3, the pyramid design in the LMSCM strategically allocates computational resources: Shallow Paths (1 × 1, 3 × 3 convolutions) process high-resolution features to preserve fine details for small targets. Deep Paths (5 × 5, 7 × 7 convolutions) operate on downsampled features (C/8 channels), minimizing memory overhead while capturing broad contextual patterns. Despite higher FLOPs than smaller kernels, the 7 × 7 branch’s asynchronous execution on edge GPUs (utilizing idle cores) minimizes latency impact—evidenced by 39 FPS real-time inference.

PConv filters out noisy regions by learning dynamic masks instead of using predefined binary masks. To validate its effectiveness, we evaluated PConv on occluded objects (e.g., partially visible vehicles) in the VisDrone2019 dataset, with results detailed in Table 4.

As shown in Table 4, PConv achieves improvements of 2.1% in precision, 1.1% in recall, and 0.7% in mAP50 compared to standard convolution, collectively demonstrating its enhanced adaptability to irregular occlusions.

4.2. Comparison Experiment

To evaluate the balance between detection speed and the accuracy of the proposed algorithm, we conducted comparative experiments with established small object detection methods: Faster-RNN [24], RetinaNet [25], CornerNet [26], YOLO-PWCA [15], YOLOv8n, YOLOv11 [12], Bi-YOLOv8s [27], and SC-YOLO [13]. As shown in Table 5, compared with the baseline model, the mAP50 of LMEC-YOLOv8 increased by 8.4%, and the FPS reached 42, meeting the real-time detection efficiency.

As evidenced by Table 5, the enhanced YOLOv8 algorithm surpasses all comparative methods across evaluation metrics, achieving significant improvements in both detection accuracy and computational efficiency.

Table 5 includes Params/FLOPs, highlighting LMEC-YOLOv8’s optimal accuracy–efficiency trade-off (43.6% mAP50 at 7.6 GFLOPs).

To validate the effectiveness of the MSCBAM module, comparative experiments with deformable convolution were conducted, with experimental results presented in Table 6.

As evidenced in Table 6, MSCBAM demonstrates significantly superior edge efficiency compared to dynamic kernels. With an edge latency of merely 4.3 ms, MSCBAM achieves a 37.7% reduction versus deformable convolution (6.9 ms). This improvement primarily stems from eliminating offset computation (which accounts for 21.4% latency in dynamic kernels) and reducing memory access by 48% (1.6 GB vs. 3.1 GB). These results confirm that the MSCBAM module satisfies the stringent real-time requirements of UAV applications (>25 FPS). Furthermore, even after 8-bit quantization, dynamic kernels exhibit lower accuracy than MSCBAM while maintaining 18.6% higher latency—fundamentally because the inherent overhead of offset computation cannot be eliminated through quantization.

To further analyze the rationale for using fixed kernels in MSCBAM, comparative experiments were conducted by replacing its fixed kernels with dynamic kernels (a lightweight variant based on deformable convolution). The experimental results are presented in Table 7.

As indicated in Table 7, while a large offset range (d = 3) yields only a 0.7% accuracy improvement, it incurs a 57% increase in FLOPs, consequently failing to meet the real-time requirements for UAV applications. Conversely, a small offset range (d = 1) exhibits lower accuracy than fixed kernels, demonstrating that naive dynamic variants provide no practical benefits.

It is worth noting that, when tested on the Jetson Xavier NX edge device, the model frame rate dropped to 28 FPS but still met the real-time requirements (>25 FPS).

To verify whether LMEC-YOLOv8 remains effective with complex backgrounds such as viaducts, we specifically selected 327 test images from the VisDrone2019 dataset containing viaducts and other complex backgrounds for specialized testing. The experimental results are shown in Table 8.

As shown in Table 8, although the detection accuracy of LMEC-YOLOv8 decreases in complex backgrounds, it still achieves significant improvements of up to 8% and 7% compared to the baseline models YOLOv8n and YOLOv11n. This enhancement stems from the multi-branch design of MSCBAM, which inherently simulates the human visual attention mechanism: while large-kernel convolution branches (5 × 5 or 7 × 7) suppress high-frequency texture noise (e.g., swaying leaves) in cluttered backgrounds, horizontal-vertical convolution pairs prove particularly effective at capturing regular structures (e.g., building edges). Meanwhile, the cross-layer connection in CBiFPN (P3 → P5) serves as the key solution to occlusion challenges. Whereas traditional FPN loses positional information in deeper layers, the pixel-level details preserved in the P3 layer supplement the P5 features through a direct connection path—effectively providing high-resolution details to high-level features.

4.3. Visualization of Experimental Results

In order to verify the application effect of the proposed method in the actual scene more clearly and intuitively, several pictures were randomly selected and different algorithms were used for object detection on them. Through the comparison of the visualization results and the detection results, the effectiveness of the improved algorithm can be clearly seen. The final detection results are shown in Figure 10.

It should be noted that the targets visualized in Figure 10 (e.g., pedestrians, small vehicles) from the VisDrone2019 dataset are predominantly moving at low speeds or are relatively static. Significant motion blur from high-speed targets (e.g., racing cars) was not a primary factor in these specific examples or in the overall quantitative evaluation of the dataset.

As shown in Figure 10, LMEC-YOLOv8 outperforms the baseline models YOLOv8n and YOLOv11n in both nighttime and crowded scenarios. In low-light conditions, the baseline models failed to detect small, obscured targets (e.g., pedestrians) in dark areas, whereas LMEC-YOLOv8 successfully identified these targets (as indicated by the red bounding boxes). Although some missed detections persist, this result sufficiently demonstrates the effectiveness of LMEC-YOLOv8.

Figure 11 presents detection outcomes under occlusion in complex environments featuring structures like overpasses. The visualization clearly demonstrates LMEC-YOLOv8’s capability to detect pedestrians within bridge pier shadows (red box) as well as small targets behind overpasses, proving the model’s significantly enhanced effectiveness compared to YOLOv8n and YOLOv11n when handling intricate scenes.

Although the effectiveness of LMEC-YOLOv8 has been verified, it has to be admitted that the current research still has limitations. While LMEC-YOLOv8 demonstrates significant improvements in detecting small, occluded objects under typical UAV imaging conditions present in datasets like VisDrone2019, it was not specifically designed or evaluated against the severe image degradation caused by strong environmental noise, particularly electromagnetic interference (EMI). The VisDrone2019 dataset primarily reflects challenges like scale variation, occlusion, and complex backgrounds, but does not explicitly contain images corrupted by simulated or real-world intense EMI noise (e.g., heavy salt-and-pepper noise, banding artifacts, etc.). Although components like the multi-branch design in MSCBAM might inherently offer some noise resilience by suppressing irrelevant activations, its effectiveness under pronounced EMI-induced noise remains unvalidated and constitutes a known limitation of the current model.

5. Conclusions

To address the challenges of low accuracy and model complexity in small object detection for UAV imagery, this paper proposes an enhanced YOLOv8 algorithm. First, we develop MSCBAM, a lightweight multi-scale attention mechanism modified from CBAM, to better capture multi-scale information. Second, the LMS-PC2F module is introduced to effectively integrate multi-scale features through pyramid-structured operations. Subsequently, we construct the ESPPM module and embed it into the backbone network, which reduces parameter redundancy while enhancing multi-scale and cross-channel feature extraction through enriched contextual information. Finally, the CBiFPN feature fusion network is incorporated to enable cross-hierarchical feature interaction, effectively addressing challenges including scale variation, object occlusion, boundary ambiguity, and environmental interference through global feature utilization.

While the primary evaluation focused on typical urban drone scenarios with mostly low-speed targets, the architectural components of LMEC-YOLOv8 (MSCBAM’s multi-scale robustness, ESPPM’s long-range context modeling, and CBiFPN’s cross-layer detail fusion) are designed to enhance performance under challenges inherent in moving targets, like mild motion blur.

Experimental results demonstrate that, under identical testing conditions, the improved YOLOv8 model achieves a 1.4% mAP50 improvement and a 38.6% FPS enhancement over the baseline, achieving optimal speed–accuracy balance for real-time detection tasks.

Despite its advancements, LMEC-YOLOv8 exhibits the following limitations:

(1)

Sensitivity to extreme EMI noise: The model was not explicitly optimized for severe electromagnetic interference (e.g., dense salt-and-pepper noise > 30% corruption). While the VisDrone2019 dataset primarily focuses on challenges like scale variation and occlusion, real-world UAV operations may encounter electromagnetic interference (EMI) leading to image degradation (e.g., salt-and-pepper noise, banding artifacts, etc.) [1,2]. Although our model was not explicitly trained on EMI-corrupted data, the proposed MSCBAM module inherently suppresses high-frequency noise through its multi-branch design:

(1): Large-kernel branches (5 × 5/7 × 7) smooth high-frequency artifacts.
(2): Horizontal–vertical convolution pairs preserve structural edges under banding noise.
(3): The PConv layer in LMS-PC2f ignores sparse corrupted pixels (Section 3.2.1).

However, dedicated EMI-robust training remains for future work.

(2)

Ultra-small targets (<10 px): mAP50 for sub-10 px objects remains ≤ 35% (vs. 43.6% overall), as feature aggregation for extreme-scale targets is insufficient (Table 2, Group 8).

(3)

High-speed motion blur: Evaluation focused on low-speed targets (VisDrone2019). Performance on rapidly moving objects (e.g., vehicles > 50 km/h) is untested.

(4)

Dataset bias: Training relied on VisDrone2019, which lacks dedicated EMI-corrupted samples. Generalizability to other UAV datasets (e.g., UAVDT) needs verification.

Future research will focus on the following:

(1): Enhancing detection performance in complex environments through super-resolution reconstruction techniques.
(2): Improving model adaptability to ultra-small targets and cluttered backgrounds.
(3): Developing more efficient feature fusion strategies.
(4): Reducing computational redundancy to accelerate inference while further minimizing parameter counts.
(5): Enhancing robustness to environmental noise: A critical future direction is to significantly improve the model’s resilience to severe environmental noise, particularly electromagnetic interference (EMI), which can severely degrade sensor data. This will involve: (a) incorporating robust data augmentation techniques simulating EMI-induced noise patterns (e.g., salt-and-pepper, banding, etc.) or utilizing datasets specifically capturing such artifacts; (b) exploring the integration of lightweight image denoising modules or designing noise-robust attention mechanisms within the network architecture; (c) investigating adversarial training strategies or noise-invariant loss functions to enhance model stability under noisy conditions; and (d) conducting rigorous evaluation on dedicated benchmark datasets containing diverse and challenging EMI noise scenarios. We will explore EMI-robust training using synthetic noise injection, hardware–filter collaboration (e.g., onboard EMI shielding), and dynamic noise adaptation modules for real-time denoising.

Enhancing robustness to high-speed targets: Future research will focus on evaluating and improving LMEC-YOLOv8’s performance in detecting very high-speed targets (e.g., racing cars), which often suffer from severe motion blur, under UAV surveillance. This will involve testing the model on the relevant datasets (e.g., UAVDT), exploring integration of motion compensation techniques (e.g., optical flow), temporal filtering, and specialized motion blur augmentation during training.

Limitations include sensitivity to extreme EMI-induced noise and lower mAP for ultra-small targets (<10 px). Future work will integrate super-resolution modules and EMI-robust training.

Author Contributions

Conceptualization, X.T.; Methodology, X.T.; Validation, X.T.; Writing—original draft, X.T.; Funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the General Research Project (General Program) of the Liaoning Provincial Department of Education, titled “Research on Adversarial Defense Methods Based on Adversarial Training” (Project No.: LJKMZ20220678).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liao, L.; Lu, W.; Ye, D.; Guo, Q.; Lu, J.; Liu, Z. Research progress on small target detection algorithms based on deep-learning. J. Zhejiang Sci. Technol. Univ. (Nat. Sci.) 2023, 49, 331–343. (In Chinese) [Google Scholar]
Ning, X.; Tian, W.; Yu, L.; Li, W. Brain-inspired CIRA-DETR full inference method for small target and occluded target detection. J. Comput. Sci. 2022, 45, 2080–2092. (In Chinese) [Google Scholar]
Jie, H.; Zhao, Z.; Zeng, Y.; Chang, Y.; Fan, F.; Wang, C. A review of intentional electromagnetic interference in power electronics: Conducted and radiated susceptibility. IET Power Electron 2024, 17, 1487–1506. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Du, B.; Huang, Y.; Chen, J. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 18–22 June 2023; pp. 13435–13444. [Google Scholar]
Safaldin, M.; Zaghden, N.; Mejdoub, M. An improved YOLOv8 to detect moving objects. IEEE Access 2024, 12, 59782–59806. [Google Scholar] [CrossRef]
Cai, Y.; Zhao, Y.; Wen, S. Improved YOLOv8 SAR image aircraft object detection method. In Proceedings of the 2024 7th International Symposium on Autonomous Systems 2024, Chongqing, China, 7–9 May 2024; pp. 1–6. [Google Scholar]
Wang, J.; Zhang, F.; Zhang, Y. Lightweight object detection algorithm for UAV aerial imagery. Sensors 2023, 23, 5786. [Google Scholar] [CrossRef] [PubMed]
Wang, G.; Chen, Y.; An, P. UAV-YOLOv8:a small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Tan, M.; Pang, R.; Le, Q. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Wang, H.; Liu, C.; Cai, Y. YOLOv8-QSD: An improved small object detection algorithm for autonomous vehicles based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 73, 1–16. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Li, Y.; Shi, W.; Feng, C. Lightweight YOLOv8 detection algorithm for small target detection in UAV aerial photography. Comput. Eng. Appl. 2024, 60, 167–178. Available online: http://kns.cnki.net/kcms/detail/11.2127.TP.20240605.1426.006.html (accessed on 5 April 2025).
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
Yu, A.; Wei, W.; Wang, P.; Zhang, J.; Ke, W. Small target detection algorithm for UAV based on block compound attention. J. Aircr. 2024, 45, 629148. [Google Scholar]
Min, L.; Fan, Z.; Lv, Q. YOLO-DCTI: Small object detection in remote sensing base on contextual transformer enhancement. Remote Sens. 2023, 15, 3970. [Google Scholar] [CrossRef]
Park, S.; Yeo, Y.; Shin, Y. PConv: Simple yet effective convolutional layer for generative adversarial network. Neural Comput. Appl. 2022, 34, 7113–7124. [Google Scholar] [CrossRef]
Wu, X.; He, Y.; Zhou, H.; Cheng, L.; Ding, M. Research on environmental personnel identification of monitored waters based on improved YOLOv7 algorithm. J. Electron. Meas. Instrum. 2023, 37, 20–27. [Google Scholar]
Woo, S.; Park, J.; Lee, J.; Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lau, K.; Po, L.; Rehman, Y. Large separable kernel attention: Rethinking the large kernelattention design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Cheng, H.; Qiao, Q.; Luo, X.; Yu, S. Object detection algorithm of UAV aerial photography image based on improved YOLOv8. Radio Eng. 2024, 54, 871–881. [Google Scholar]

Figure 1. YOLOv8n network structure.

Figure 2. LMEC-YOLOv8 network structure.

Figure 3. Study flowchart.

Figure 4. LMSCM model.

Figure 5. LMS-PC2f model.

Figure 6. MSCBAM model.

Figure 7. Comparison of the two modules.

Figure 8. ESPPM module.

Figure 9. Four feature fusion frameworks.

Figure 10. Performance comparison of three algorithms. (a) YOLOv8s; (b) YOLOv11n; (c) LMEC-YOLOv8.

Figure 11. Visual comparison charts in complex scenarios. (a) YOLOv8n; (b) YOLOv11n; (c) LMEC-YOLOv8.

Table 1. Experimental setup.

Parameters	Value
Training cycle	300 epochs
Batch size	16
Optimizer	AdamW
Initial learning rate	0.001
Hardware platform	RTX 3060 GPU

Table 2. Results of ablation experiment.

Groups	LMS-PC2f	MSCBAM	ESPPM	CBiFPN	Precision (%)	Recall (%)	Params (M)	FLOPs (G)	mAP50 (%)	FPS (CPU)
1	✕	✕	✕	✕	43.4	33.3	3.0	8.1	33.5	30
2	✓	✕	✕	✕	45.2	34.5	2.7	7.6	35.1	37
3	✕	✓	✕	✕	45.7	35.1	1.9	7.9	37.3	39
4	✕	✕	✓	✕	46.1	34.9	1.8	8.3	38.5	34
5	✕	✕	✕	✓	46.6	35.7	3.4	8.4	39.7	32
6	✓	✓	✕	✕	46.9	35.8	1.7	7.1	40.4	47
7	✓	✓	✓	✕	47.4	37.3	2.1	7.6	42.3	43
8	✓	✓	✓	✓	49.2	38.7	2.4	8.0	43.6	42

Table 3. Ablation study on LMSCM kernel configurations.

Kernel Combination	mAP50 (%)	Params (M)	FLOPs (G)	FPS(CPU)
Baseline (C2f)	33.5	3.0	8.1	30
1 × 1 + 3 × 3	35.8 (+2.3)	2.8 (−0.2)	7.2 (−0.9)	34 (+4)
1 × 1 + 3 × 3 + 5 × 5	38.1 (+4.6)	2.5 (−0.5)	7.6 (−0.5)	37 (+7)
1 × 1 + 3 × 3 + 5 × 5 + 7 × 7	40.4 (+6.9)	2.4 (−0.6)	7.9 (−0.2)	39 (+9)

Table 4. The results of the ablation experiment on occlusion targets using Pconv.

	Precision (%)	Recall (%)	mAP50 (%)
Baseline	43.4	33.3	33.5
PConv	44.5	34.4	34.2

Table 5. Performance comparison of various object detection algorithms.

	Precision (%)	Recall (%)	mAP50 (%)	Params (M)	FLOPs (G)	FPS (CPU)
RetinaNet	33.4	26.3	28.7	7.1	17.3	16
Faster-RNN	35.2	24.5	33.2	11.8	13.4	19
CornerNet	35.7	25.1	34.1	10.3	11.8	19
YOLOv8n	43.4	33.3	35.2	3.0	8.1	30
YOLOv11n	45.2	37.3	36.8	6.3	10.6	25
YOLO-PWCA	49.6	35.7	40.6	5.9	8.2	33
Bi-YOLOv8s	48.9	35.8	40.7	7.2	9.3	24
SC-YOLO	47.4	37.3	43.1	1.9	11.6	31
LMEC-YOLOv8(Ours)	49.2	38.7	43.6	2.4	7.6	42

Table 6. Comprehensive comparison: MSCBAM vs. deformable convolution.

Module	mAP50 (%)	Params (M)	FLOPs (G)	Memory Access (GB)	Latency (ms)
Baseline (CBAM)	37.3	1.9	7.9	1.8	4.2
MSCBAM (Ours)	39.6 (+2.3)	1.7 (−0.2)	8.1 (+0.2)	1.6 (−0.2)	4.3 (+0.1)
Deformable Conv	40.1 (+2.8)	2.4 (+0.5)	11.9 (+4.0)	3.1 (+1.3)	6.9 (+2.7)
Deformable + Quant	39.8 (+2.5)	1.9	9.2 (+1.3)	2.5 (+0.7)	5.1 (+0.9)

Table 7. Ablation on adaptive kernels.

Design	mAP50 (%)	FLOPs (G)
Fixed Kernels (MSCBAM)	39.6	8.1
Adaptive Kernels (d = 3)	40.3	12.7
Adaptive Kernels (d = 1)	38.9	9.4

Table 8. Comparison of detection results in complex backgrounds.

Design	Precision (%)	Recall (%)	mAP50 (%)
YOLOv8n	41.4	29.5	32.8
YOLOv11n	42.7	31.9	34.5
LMEC-YOLOv8	47.8	35.9	41.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tai, X.; Zhang, X. LMEC-YOLOv8: An Enhanced Object Detection Algorithm for UAV Imagery. Electronics 2025, 14, 2535. https://doi.org/10.3390/electronics14132535

AMA Style

Tai X, Zhang X. LMEC-YOLOv8: An Enhanced Object Detection Algorithm for UAV Imagery. Electronics. 2025; 14(13):2535. https://doi.org/10.3390/electronics14132535

Chicago/Turabian Style

Tai, Xuchuan, and Xinjun Zhang. 2025. "LMEC-YOLOv8: An Enhanced Object Detection Algorithm for UAV Imagery" Electronics 14, no. 13: 2535. https://doi.org/10.3390/electronics14132535

APA Style

Tai, X., & Zhang, X. (2025). LMEC-YOLOv8: An Enhanced Object Detection Algorithm for UAV Imagery. Electronics, 14(13), 2535. https://doi.org/10.3390/electronics14132535

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LMEC-YOLOv8: An Enhanced Object Detection Algorithm for UAV Imagery

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. The YOLOv8 Framework

3.2. LMEC-YOLOv8

3.2.1. LMS-PC2F Module

3.2.2. Lightweight Multi-Scale Attention Mechanism MSCBAM

3.2.3. ESPPM Module

3.2.4. Cross-Layer Weighted Bidirectional Feature Pyramid Network (CBiFPN)

4. Experimental Results and Analysis

4.1. Ablation Experiment

4.2. Comparison Experiment

4.3. Visualization of Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI