LCW-YOLO: An Explainable Computer Vision Model for Small Object Detection in Drone Images

Liao, Dan; Bi, Rengui; Zheng, Yubi; Hua, Cheng; Huang, Liangqing; Tian, Xiaowen; Liao, Bolin

doi:10.3390/app15179730

Open AccessArticle

LCW-YOLO: An Explainable Computer Vision Model for Small Object Detection in Drone Images

by

Dan Liao

¹,

Rengui Bi

¹,

Yubi Zheng

¹,

Cheng Hua

²

,

Liangqing Huang

²

,

Xiaowen Tian

^1,* and

Bolin Liao

^2,*

¹

College of Physics, Mechanical and Electrical Engineering, Jishou University, Jishou 416000, China

²

College of Computer Science and Engineering, Jishou University, Jishou 416000, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9730; https://doi.org/10.3390/app15179730

Submission received: 11 August 2025 / Revised: 2 September 2025 / Accepted: 2 September 2025 / Published: 4 September 2025

(This article belongs to the Special Issue Explainable Artificial Intelligence Technology and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

Small targets in drone imagery are often difficult to accurately locate and identify due to scale imbalance and limitations, such as pixel representation and dynamic environmental interference, and the balance between detection accuracy and resource consumption of the model also poses challenges. Therefore, we propose an interpretable computer vision framework based on YOLOv12m, called LCW-YOLO. First, we adopt multi-scale heterogeneous convolutional kernels to improve the lightweight channel-level and spatial attention combined context (LA2C2f) structure, enhancing spatial perception capabilities while reducing model computational load. Second, to enhance feature fusion capabilities, we propose the Convolutional Attention Integration Module (CAIM), enabling the fusion of original features across channels, spatial dimensions, and layers, thereby strengthening contextual attention. Finally, the model incorporates Wise-IoU (WIoU) v3, which dynamically allocates loss weights for detected objects. This allows the model to adjust its focus on samples of average quality during training based on object difficulty, thereby improving the model’s generalization capabilities. According to experimental results, LCW-YOLO eliminates 0.4 M parameters and improves mAP@0.5 by 3.3% on the VisDrone2019 dataset when compared to YOLOv12m. And the model improves mAP@0.5 by 1.9% on the UAVVaste dataset. In the task of identifying small objects with drones, LCW-YOLO, as an explainable AI (XAI) model, provides visual detection results and effectively balances accuracy, lightweight design, and generalization capabilities.

Keywords:

YOLOv12; small object detection; explainable computer vision; WIoU; explainable AI (XAI)

1. Introduction

Small object detection is a key research direction in the field of drone image recognition. Its objective is to accurately locate and identify small objects in complex environments while addressing various interference conditions. For example, in search and rescue operations, this capability can directly improve the efficiency of locating trapped individuals or critical objects, thereby increasing the success rate of rescue missions [1]. However, small objects in images pose significant challenges to detection models due to their extremely small size (typically less than 32 × 32 pixels), low resolution, and irregular shapes. Early detection methods primarily relied on traditional machine learning algorithms, such as support vector machines [2,3] and AdaBoost [4], often combined with manually designed shallow features such as edge gradients and texture patterns for classification [5,6]. However, these methods suffer from inherent limitations, including the subjectivity of feature design and sensitivity to environmental complexity, which weaken the model’s ability to distinguish detection targets from the environment in drone images, severely limiting detection accuracy and generalization performance.

With the evolution of convolutional neural network (CNN) architectures and significant advances in processing hardware, new solutions have emerged for the long-standing problem of detecting small objects in drone images. Compared to traditional methods, which lack sufficient discriminative power when capturing small object features, CNN-based models can autonomously learn from large-scale image datasets to extract complex discriminative features, thereby significantly improving detection accuracy. In addition, the scalability of their architecture supports deployment requirements in a variety of practical application scenarios. It is noteworthy that CNNs demonstrate significant potential across various fine-grained perception tasks, such as epilepsy detection based on S-transform, EEG recognition utilizing group cosine convolutions, and spatio-temporal-frequency frameworks for brain–computer interfaces. These approaches leverage CNNs to effectively enhance the recognition capabilities of complex patterns [7,8,9].

To further enhance small object representation, Lin et al. proposed the Feature Pyramid Network (FPN) [10], which constructs bidirectional cross-level features through bottom-up semantic propagation and top-down detail, thereby enhancing the ability to retain features of low-resolution targets. To address the issue of background interference in drone images, Woo et al. proposed Convolutional Block Attention Module (CBAM) [11], which uses a channel-space dual-path weighting mechanism to suppress redundant information and improve the recall rate of occluded objects. For edge deployment requirements, Howard et al.’s MobileNetV3 [12] uses neural architecture search technology to achieve a balance between accuracy and speed, while Zhu et al.’s TPH-YOLOv5 [13] further integrates a Transformer encoder to enhance small object perception, thereby improving detection accuracy.

Small object detection, as a core problem in computer vision, also relies on the design of the loss function for its detection performance. When a model detects an image, it produces a predicted value. The difference between the predicted value and the true value is called the loss. To make the predicted value as close as possible to the true value, the loss must be minimized, prompting researchers to introduce a loss function. The Efficient Intersection over Union (EIoU) [14] proposed by Repp et al. significantly improves localization accuracy by jointly optimizing the overlap rate and aspect ratio. To address the issue of misaligned bounding box directions in occluded scenes, Gevorgyan introduced a directional penalty mechanism from the perspective of the SIoU [15], effectively enhancing the ability to regress rotated objects. These loss functions not only support end-to-end data processing, avoiding the accumulation of errors in traditional multi-stage methods, but also significantly enhance the robustness and generalization capabilities of small object detection. However, most current research assumes that examples in training data have high quality and focuses on enhancing the fitting capability of bounding box loss. However, it should be noted that object detection training datasets contain low-quality examples. If we blindly enhance the regression of bounding boxes for low-quality examples, it will obviously harm the improvement of model detection performance.

In summary, although the above studies have achieved great success in the field of small object detection, model improvements also face new problems. In particular, the sparsity of small object features makes it difficult to effectively extract key information [16], and traditional attention mechanisms are unable to adapt to interference factors in complex backgrounds in multiple dimensions, thereby suppressing the shallow representation of small objects. Furthermore, the standard receptive field of convolutional kernels cannot adapt to the scale variations of small objects, leading to an uneven distribution of key categories in training samples, which further exacerbates the problem of missed detections. These issues collectively limit detection accuracy and robustness. This paper addresses the limitations of existing methods in terms of balancing accuracy and efficiency, interference suppression, scale adaptation, and sample imbalance, proposing an improved model based on the YOLOv12 [17] framework, named LCW-YOLO. The model significantly improves the accuracy, efficiency, and interpretability of small object detection while maintaining or even reducing model complexity, enabling researchers to make better decisions in actual drone application scenarios based on detection results [18]. The main innovations of this paper are as follows:

Heterogeneous multiscale convolution is applied to improve the original Area Attention (AAttn) module of YOLOv12, and the enhanced module is incorporated into the neck’s A2C2f. This integration yields a lightweight and more effective structure, termed Lightweight Channel-wise and Spatial Attention with Context (LA2C2f), which significantly enhances spatial perception for small targets while reducing model computational complexity.
The Convolutional Attentive Integration Module (CAIM) is proposed, deeply integrating a convolutional structure with an improved Residual Path-Guided Multi-dimensional Collaborative Attention Mechanism (RMCAM). This architecture enables the capture of contextual dependencies through convolution while facilitating enhanced information fusion across four dimensions containing channel, height, width, and original features, thereby enabling deep coupling of local and global features.
Introducing Wise-IoU (WIoU) v3 [19] with a dynamic non-monotonic focusing mechanism as the bounding box regression loss. By dynamically allocating gradient gains through outlier degree $β$ , we suppress interference from low-quality samples and improve the model’s generalization performance in complex scenarios.

The subsequent sections of this paper are organized as follows: Section 2 details the network structure, design, and key technologies of LCW-YOLO. Section 3 introduces the experimental setup, dataset, and evaluation metrics, and presents comprehensive ablation and comparison experiments to validate the effectiveness of the proposed method. Section 4 summarizes the entire paper and discusses future research directions.

2. Principles and Innovations

2.1. YOLOv12 Model

This study employs YOLOv12 as the foundational model. The YOLO series is renowned in object detection for its rapid detection capabilities and end-to-end training framework [20]. As one of the cutting-edge versions of the series, YOLOv12 inherits its efficiency while introducing a real-time detection framework centered on attention mechanisms for the first time, breaking through the structural limitations of traditional convolutional neural networks. As shown in Figure 1, the area attention mechanism and residual efficient layer aggregation network in the core component A2C2f simplify the model structure while effectively reducing computational overhead, further improving detection accuracy and robustness.

Experimental results on the COCO dataset [21] demonstrate that YOLOv12 outperforms current mainstream real-time object detection models in terms of detection accuracy while maintaining competitive inference speed. YOLOv12 models of different scales outperform YOLOv11 in terms of accuracy and speed, demonstrating significant performance improvements. Additionally, in small object detection tasks in drone aerial photography scenarios, YOLOv12 exhibits good adaptability and strong interference resistance. However, YOLOv12 still has certain limitations in small object detection. For example, the 7 × 7 convolution used in its positional awareness module is more suitable for medium and large object detection, and it is insufficient for extracting and retaining low-level detail features of small objects, which may lead to missed detections or positioning errors. Additionally, the traditional CNN structure relies on local receptive fields to extract features layer by layer and lacks the ability to dynamically focus on key areas, limiting further improvements in small object detection performance [22].

2.2. Proposed Method

Building upon the core design philosophy of YOLOv12, which centers on attention mechanisms, this paper proposes an improved model tailored for small object detection in drones, named LCW-YOLO, as shown in Figure 2. The model continues to utilize the C3k2 module from YOLOv12 in its backbone network and neck structure. This module is an optimized version of the traditional C3 module, achieved by setting the parameter n = 2 to cascade two C3k modules, thereby further enhancing feature extraction capabilities while improving model runtime efficiency and stability.

To improve small object detection performance, the residual multi-dimensional collaborative attention module was introduced into the CAIM module of the neck as the core component of the attention branch. This module has efficient feature extraction and attention focusing capabilities, effectively addressing common occlusion and overlap issues in small object detection. We designed the LA2C2f module to replace the original A2C2f module in the backbone of the original model. In this module, 3 × 3 and 5 × 5 convolutions are used to replace the original 7 × 7 positional convolutions, enhancing the model’s adaptability to targets of different scales. Additionally, the original C-IoU is replaced with the WIoU v3 dynamic bounding box regression loss to improve the robustness of small object boundary localization, particularly in scenarios with occlusion or blurring, significantly enhancing the model’s detection stability and generalization capabilities.

2.2.1. Lightweight Channel-Wise and Spatial Attention with Context

As shown in Table 1, the 7 × 7 separable convolution in the original YOLOv12 model used for area attention, may have limitations due to a fixed receptive field and poor detection performance for small targets in drone applications. As shown in Figure 3, we propose a heterogeneous dual-branch convolutional architecture to synergistically optimize multi-scale feature capture and model lightweighting. We select 3 × 3 convolutions, which exhibit high-frequency response advantages in detailed texture extraction, and 5 × 5 convolutions, which enhance spatial context modeling capabilities by expanding the receptive field. The two are fused through channel concatenation and 1 × 1 convolution compression, and the resulting feature importance allocation assists area attention in perceiving small target location information. Calculations show that the computational complexity of this heterogeneous parallel architecture is significantly reduced, and the FLOPs for the original 7 × 7 convolution are

{FLOPs}_{orig} = C_{in} \times C_{out} \times 49 \times H \times W

(1)

where

C_{in}

denotes the number of input channels,

C_{out}

denotes the number of output channels,

H \times W

denotes the feature map space dimension, and 49 represents the number of 7 × 7 convolution kernel parameters. The new FLOPs structure consists of three parts:

{FLOPs}_{new} = \begin{matrix} \underset{3 \times 3 branch}{\underset{︸}{C_{in} \times C_{mid} \times 9 \times H \times W}} \\ + & \underset{5 \times 5 branch}{\underset{︸}{C_{in} \times C_{mid} \times 25 \times H \times W}} \\ + & \underset{1 \times 1 compression}{\underset{︸}{C_{mid} \times C_{out} \times 1 \times H \times W}} \end{matrix}

(2)

where

C_{mid} = C_{in} / 2

denotes the middle channel compression coefficient. The numbers 9, 25, and 1 correspond to the number of parameters in the 3 × 3, 5 × 5, and 1 × 1 convolution kernels, respectively. When

C_{mid}

is set to

C_{in} / 2

, the resulting reduction in computational load is quantified by Equation (3).

η = 1 - \frac{9 (C_{in} / 2) + 25 (C_{in} / 2) + (C_{in} / 2)}{49 C_{in}} = 1 - \frac{17.5}{49} \approx 38 %

(3)

We achieved a 38% optimization in computational load in the key convolution module. And the two branches comprehensively capture local and mid-range features through parallel processing, with an effective receptive field similar to that of the original convolution but with higher flexibility and fewer parameters. The specific formula for this method is shown below:

F_{inter} = Concat (W_{3 \times 3} * F_{in}, W_{5 \times 5} * F_{in}) \in R^{B \times 2 C \times H \times W}

(4)

F_{out} = W_{1 \times 1} * \underset{F_{inter}}{\underset{︸}{Concat (F_{3 \times 3}, F_{5 \times 5})}} \in R^{B \times C \times H \times W}

(5)

where

F_{inter}

denotes the intermediate fusion feature, formed by concatenating the input features after processing them with

3 \times 3

and

5 \times 5

convolutions;

F_{out}

denotes the final output feature;

F_{3 \times 3}

and

F_{5 \times 5}

represent the feature maps processed by the corresponding convolution kernels;

W_{1 \times 1}

denotes the weight of the

1 \times 1

convolution kernel. Equation (1) implements multi-scale feature extraction and preliminary fusion, while Equation (2) performs cross-channel feature integration and dimension adjustment through

1 \times 1

convolution.

As shown in Figure 4, we propose an improved A2C2f module with lightweight area attention (LAAtn) that applies multi-scale parallel convolution, and achieve efficient feature extraction and fusion through a hierarchical structure. The module first performs channel adjustment via 1 × 1 convolution, followed by two cascaded ABlock units for basic feature extraction. After 1 × 1 convolution adjustment, feature interaction and fusion are achieved through the LAAttn attention mechanism and a multi-layer perceptron (MLP) layer. Finally, through deep feature integration via a MLP layer and two consecutive 1 × 1 convolution layers (with the last layer not using an activation function), the features are refined and dimension-reduced, resulting in an enhanced feature representation. This hierarchical processing architecture preserves shallow-layer spatial detail information while fully leveraging deep-layer semantic features. Through an attention-guided feature fusion mechanism, it significantly enhances the model’s feature expression capabilities.

2.2.2. Convolution and Attention Integration Module

The RMCAM module is an important supplement to the YOLOv12 model backbone. It combines an attention mechanism, incorporating Multi-dimensional Co-operative Attention Mechanism (MCAM) into the YOLOv12 backbone, significantly enhancing the model’s feature extraction capabilities and accuracy. As shown in Figure 5, the attention mechanism models three dimensions: channel, height, and width, with the average value taken from the three-branch output [23]. To mitigate the vanishing gradient problem in deep networks, preserve the original features, and avoid the loss of details caused by excessive focus in traditional attention mechanisms, the residual learning concept from ResNet [24] is adopted. The original input features are directly introduced into the output end of the three attention branches through an identity mapping, forming independent residual paths. The outputs of the three residual branches are concatenated with the original input or adjusted input along the channel dimension, and a 1 × 1 convolution is used to achieve cross-channel information interaction and dimension reduction. Finally, we reference the multi-head attention dynamic fusion strategy from EfficientFormer [25], but adopt a lighter single-layer convolution to reduce computational complexity. The weights of the 1 × 1 convolution can be viewed as a weighted sum of multi-branch features, similar to the soft-hard attention hybrid mechanism in agent attention, enabling the model to dynamically adjust the contribution of each branch based on input content [26]. While reducing the number of parameters, these methods retain feature integration capabilities. The formulas used in these methods are as follows:

A_{c} = σ (W_{c} \cdot GAP (F))

(6)

A_{h} = σ (W_{h} \cdot {GPool}_{h} (F))

(7)

A_{w} = σ (W_{w} \cdot {GPool}_{w} (F))

(8)

F_{cat} = Concat (R (F_{in}, A_{c} ⊙ F, A_{h} ⊙ F, A_{w} ⊙ F))

(9)

F_{out} = W_{1 \times 1} * F_{cat} (W_{1 \times 1} \in R^{1 \times 1 \times 4 C \times C})

(10)

where

A_{c}

,

A_{h}

, and

A_{w}

represent the attention weights in the channel, height, and width directions, respectively;

W_{c}

,

W_{h}

, and

W_{w}

represent the corresponding fully connected layer weights;

GAP

denotes global average pooling;

{GPool}_{h}

and

{GPool}_{w}

represent global pooling in the height and width directions, respectively;

F_{cat}

denotes the concatenated features;

R

denotes the residual feature integrator;

F_{out}

denotes the final output feature;

W_{1 \times 1}

denotes the

1 \times 1

convolution kernel weight (input channels

4 C

, output channels C). We not only calculate the spatial-channel joint attention weights but also fuse multi-dimensional features through attention-weighted concatenation, achieving feature compression and dimension adjustment.

As shown in Figure 6, we designed the CAIM module, which first applies a 1 × 1 convolution to the input features

X_{i n}

to achieve adaptive compression of the channel dimension and feature space alignment, thereby providing a structured representation for the subsequent attention mechanism. Its output is cascaded into two RMCAM modules, where the convolution-guided local receptive field and the attention-driven global dependency modeling work together. The former captures spatial details, while the latter dynamically calibrates cross-channel semantic weights, forming an iterative feature optimization process. The optimized features are then subjected to a final 1 × 1 convolution for channel fusion and dimension reduction, with the final output being

X_{o u t}

. The entire architecture is centered around a symmetric convolution layer as the core skeleton and a dual attention mechanism as the neural core, achieving precise coupling between local perception and global context. During the feature reconstruction process, it simultaneously enhances detail retention and semantic focus capabilities.

2.2.3. Wise-IoU

In object detection frameworks, the design of bounding box regression loss critically governs model convergence efficiency and localization precision. Conventional approaches presume high-quality training samples, predominantly optimizing regression fitting capability. However, when processing substantial low-quality samples (e.g., blurred targets, severe occlusions, or misaligned anchor priors), exclusively enhancing regression may amplify noise interference, compromising localization robustness. To resolve this challenge, we implement W-IoU (Weighted Intersection over Union) dynamic loss, which incorporates a non-monotonic dynamic focusing mechanism to mitigate adverse effects from suboptimal samples. The core innovation resides in its gradient weight assignment strategy based on anchor regression quality: high-quality samples receive amplified gradient signals to refine fitting, while low-quality samples undergo gradient attenuation to suppress interference. In the implementation process, we first define the IoU loss between the anchor box and the ground truth box:

L_{I o U} = 1 - I o U = 1 - \frac{W_{i} H_{i}}{S_{u}}

(11)

where

W_{i}

and

H_{i}

denote the width and height of the intersection region between predicted and ground-truth bounding boxes, respectively, with

S_{u}

representing their union area. An effective loss function should reduce penalization when anchor and target boxes exhibit sufficient overlap, while avoiding over-regularization that would compromise model generalization. To address this, we developed distance-sensitive attention through spatial metrics, culminating in WIoU v1 featuring a hierarchical attention architecture:

R_{W I o U} = exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}})

(12)

L_{W I o U v 1} = R_{W I o U} L_{I o U}

(13)

where

L_{W I o U v 1}

denotes the WIoUv1 loss function, where

L_{I o U}

can significantly reduce the

R_{W I o U}

of high-quality anchor boxes.

R_{W I o U}

is the implementation form of the non-monotonic focus coefficient, which can significantly amplify the

L_{I o U}

of ordinary-quality anchor boxes.

(x, y)

are the center coordinates of the predicted box, and

(x_{g t}, y_{g t})

are the center coordinates of the ground truth box.

W_{g}

and

H_{g}

are the width and height of the ground truth box, respectively, and

{(W_{g}^{2} + H_{g}^{2})}^{*}

denotes the normalization term. This exponential function evaluates outlierness

β

by measuring the offset distance from the center point:

β = \frac{{L^{*}}_{I o U}}{\bar{L_{I o U}}} \in [0, + \infty]

(14)

where

\bar{L_{I o U}}

denotes the momentum-based moving average. The anomaly threshold

β

is defined as the ratio of the current IoU loss to the moving average

\bar{L_{I o U}}

. When

β

approaches 1, the corresponding sample is regarded as a medium-quality sample and receives higher gradient gain. Conversely, low-quality samples with extremely high

β

values or high-quality samples with extremely low

β

values are suppressed. This dynamic mechanism enables the model to focus on samples that contribute most significantly to training, effectively blocking harmful gradients introduced by low-quality samples. Consequently, a non-monotonic focus coefficient derived from

β

was integrated into WIoU v1:

L_{W I o U v 3} = r L_{W I o U v 1}, r = \frac{β}{δ α^{β - δ}}

(15)

where

L_{W I o U v 3}

denotes the WIoUv3 loss function, and r denotes the core non-monotonic focus coefficient. This design ensures that r produces small gradient gains in both high-quality anchor boxes with small outlier degree

β

and low-quality anchor boxes with large outlier degree, thereby enabling the training process focuses on anchor box samples of average quality, effectively suppressing the interference of outlier samples on model optimization. Since

\bar{L_{I o U}}

is dynamic, the quality classification criteria for anchor boxes are also dynamic, enabling WIoU v3 to adopt the most appropriate gradient gain allocation strategy at each moment.

3. Experiments

3.1. Performance Evaluation

To evaluate model performance, the experiment used two primary metrics: mAP@0.5 for initial assessment and mAP@0.5:0.95 for comprehensive analysis. The former requires predicted bounding boxes to achieve

\geq 0.5

intersection-over-union (IoU) with ground truth annotations. The latter computes mean accuracy across IoU thresholds from 0.5 to 0.95, delivering a holistic assessment of localization precision under varying criteria [27]. Supplementary metrics included precision and recall, while computational efficiency was quantified via parameter counts (Params) and giga-floating point operations per second (GFLOPs). Mean average precision (mAP) was derived by averaging category-specific AP values over all small-target images, and the formula is expressed as

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(16)

where

{A P}_{i}

denotes the average accuracy for the i-th category, obtained by calculating the area under the curve of the accuracy–recall curve at different confidence thresholds. N denotes the total number of target categories.

Precision (P) denotes the ratio of accurately identified small drone targets to all predicted target samples. A higher precision value signifies reduced false positives in intricate backgrounds. Recall (R) assesses the ratio of accurately identified small drone targets to the total number of real small targets [28]. A higher recall rate signifies decreased false negatives in densely populated, subtly textured environments. The equations for precision and recall are depicted below:

P = \frac{T P}{T P + F P}

(17)

R = \frac{T P}{T P + F N}

(18)

where

T P

represents the count of accurately identified targets, while

F P

corresponds to false positive detections and

F N

represents the number of targets that were missed.

3.2. Experimental Setup

Dataset: We conducted quantitative experiments on two object detection datasets, VisDrone2019 [29] and UAVVaste [30]. The VisDrone2019 dataset includes 6471 training images, 548 validation images, and 3190 test images, all captured from drones at different locations and altitudes, providing over 2.6 million finely annotated instances across 10 object categories, including pedestrians, cars, and buses. A total of 72.3% of the objects have an area smaller than 32 × 32 pixels. With its dense object distribution (up to over 200 objects per frame) and complex background characteristics, it has become the standard benchmark for small object detection in drone scenarios. UAVVaste is a dataset specifically designed for aerial debris detection, comprising 772 street, park, and lawn scenes with 3716 annotated instances. The targets primarily consist of low-texture debris (such as plastic bags and cans), with 92.1% of target areas smaller than 20 × 20 pixels and an average signal-to-noise ratio below 2 dB. It is specifically designed to validate model robustness in low-contrast micro-target scenarios. When used together, the two datasets enable a comprehensive evaluation of a model’s detection capabilities, ranging from ordinary small targets to extremely tiny targets.

Experimental Details: Training was conducted on dual NVIDIA GeForce RTX 3080 GPUs with Ubuntu 22.04, using CUDA 11.3/cuDNN 8.5, PyTorch 2.0.0, and Python 3.10. The AdamW optimizer [31] configured with a 0.0001 learning rate, 0.9 momentum, and 36 batch size was run for 400 epochs. Data augmentation included mixup [32] set to a probability of 0.2 and Mosaic [33] set to a probability of 1.

3.3. Experimental Results

3.3.1. Training Results and Analysis

The experimental results in Figure 7 show that the improved model achieves significant improvements over the baseline model in the drone small object detection task. As the number of training iterations increases, the model’s bounding box loss (train/box_loss), validation set bounding box loss (val/box_loss), and classification loss (val/cls_loss) all exhibit a continuous downward trend, fully demonstrating the model’s excellent parameter optimization capabilities and stable convergence characteristics. As shown in Figure 8, the average accuracy rate (mAP@0.5) across all categories reached 44.7%, an improvement of 3.3% over the baseline model’s 41.4%. This improvement is particularly significant in key small object categories, where bicycle detection accuracy increased from 16.5% to 18.6%, tricycle detection accuracy increased from 29.0% to 34.0%, and tricycle detection coverage accuracy increased from 14.5% to 17.3%. The model’s accuracy, recall, and mAP metrics have all improved. This indicates that the model can learn more refined features from the training data, thereby reducing errors and improving detection performance.

3.3.2. Comparative Experiment

As shown in Table 2, our model displays the validation of the model’s efficacy by comparing it with various state-of-the-art target models trained on the same dataset. The models encompass general lightweight detectors, drone-specific models, general end-to-end detectors, drone-specific end-to-end lightweight models, and our LCW-YOLO model. On the VisDrone2019 dataset, our LCW-YOLO outperforms the baseline YOLOv12m in terms of mAP@0.5:0.95 increasing from 26.9% to 30.6%, and mAP@0.5 increasing from 46.0% to 49.3%. Additionally, GFLOPs decreased from 67.2 to 65.5, and the number of parameters decreased from 20.2 M to 19.8 M. This not only provides a significant accuracy advantage but also results in lower computational costs.

The LCW-YOLO model proposed in this study demonstrates significant performance advantages in the task of detecting small targets from drones, outperforming other methods with a parameter count of 19.8 M and computational complexity of 65.5 GFLOPs and achieving the best grid sample accuracy. Experiments on the VisDrone2019 benchmark dataset show that compared to detectors with similar computational costs, such as the efficient end-to-end drone target detection model UAV-DETR-R18 proposed by Fudan University in January 2025, which has 42 million parameters and 170 GFLOPs, LCW-YOLO improves upon UAV-DETR-R18 by 0.8% mAP@0.5:0.95 and 0.5%mAP@0.5. Even when compared to models that rely on large-scale pre-training, such as PP-YOLOE-P2-Alpha-l with 54.1 million parameters and 111.4 GFLOPs, this model still surpasses existing methods with its 0.5% mAP@0.5:0.95 and 0.4% mAP@0.5 of absolute advantage.

To further demonstrate the flexibility of LCW-YOLO, the method was evaluated on the UAVVaste dataset, which has a significantly reduced data scale. As shown in Table 3, despite the presence of texture analysis or lighting reflection in the dataset, the model’s accuracy mAP@0.5 still increased by 1.9%, and mAP@0.5:0.95 still increased by 1.6%. Nevertheless, LCW-YOLO maintains robust performance with a smaller number of parameters, validating its low dependency on labeled data and strong generalization capabilities. Through multi-scale feature fusion and attention mechanism design, the model effectively addresses the challenges of small object detection in drone imagery, providing a balanced solution that combines high accuracy with low resource consumption for edge-based real-time processing.

3.3.3. Ablation Experiment

To validate the effectiveness of the proposed improvement strategy, we conducted ablation experiments on the VisDrone2019 dataset based on the LCW-YOLO framework, gradually introducing the three core innovative modules and analyzing their contributions to detection accuracy. Table 4 shows the performance comparison under different configurations, where LA2C2f represents the lightweight area attention mechanism module, CAIM represents the convolution and attention integration module after improving the base model, and WIoU v3 represents the replaced weighted intersection-over-union loss function. By gradually introducing LA2C2f, CAIM, and the WIoU v3, performance improved step by step. Among these, the RMCAM contributes the most with improving 1.5% mAP@0.5, indicating that the dynamic mechanism effectively balances the optimization requirements of small objectives. This demonstrates the cumulative impact of each module on detection accuracy, while the modular structure also supports plug-and-play functionality, compatible with various CNN/Transformer architectures. Ultimately, the complete model achieves a 3.3% improvement over the baseline (mAP@0.5) while reducing the number of parameters by 0.4 M (19.8 M vs 20.2 M), validating the effectiveness of modular collaborative optimization.

As shown in Table 5, embedding CAIM and RMCAM into the backbone network can reduce the number of parameters while maintaining accuracy. However, embedding CAIM with RMCAM into the neck and embedding CAIM with MCAM into the backbone network both result in performance degradation compared to our method. This suggests that low-level feature enhancement is more suitable for small object detection in UAVs, while the high-level feature attention mechanism in the neck may introduce noise.

As shown in Table 6, the combination of 3 × 5 parallel multi-scale convolutions and 1 × 1 compression significantly outperforms other strategies. Compared to direct concatenation, 1 × 1 convolutions reduce feature redundancy, improving mAP@0.5:0.95 by 0.2% and the parallel design achieves higher accuracy than the serial structure, demonstrating that cross-scale parallel branches can more effectively model the multi-granularity features of small objects.

Based on the experimental results in Table 4, Table 5 and Table 6, the proposed model demonstrates three core advantages in the task of small target detection for unmanned aerial vehicles (UAVs): First, through a multi-scale feature enhancement strategy that includes 3 × 5 parallel convolutions and 1 × 1 feature compression, the model improves the mAP@0.5 detection accuracy on the VisDrone2019 dataset, significantly enhancing the model’s ability to perceive the texture and contours of small targets. Second, thanks to its lightweight design, the final model contains only 19.8 M parameters, with computational load controlled at 65.5 GFLOPs, providing an efficient solution for real-time processing on drone edge devices. These results indicate that by synergistically optimizing multi-scale feature fusion, attention mechanisms, and dynamic loss functions, the model successfully achieves a balance between accuracy and efficiency in detecting small targets in drone images.

3.3.4. Visualization

One of the characteristics of deep learning models is their poor interpretability, which has hindered the development and application of deep learning to some extent. In order to intuitively and conveniently illustrate the detection performance of the model proposed in this paper, we conducted comparative experiments and analyzed the model’s detection performance from the perspective of heat maps. Finally, to verify the generalizability of our method, we selected 16 representative images from each of two datasets and conducted inference experiments.

As shown in Figure 9, heatmaps for YOLOv12m and LCW-YOLO were generated based on the VisDrone2019 dataset, highlighting the backpropagation performed through bounding box predictions. Notably, compared to the baseline model, LCW-YOLO demonstrates a significant improvement in locating small targets. The heatmaps generated by our model show that small objects are assigned higher heat values, indicating enhanced capability in capturing these object features. Additionally, the model appears to place greater emphasis on contextual information around these small objects, reflecting its ability to leverage surrounding details during detection. The areas highlighted in yellow in the figure demonstrate the model’s proficiency in detecting partially occluded objects. Thus, our model does not compromise the real-time performance of the baseline model. In fact, as shown in Table 7, we evaluated the frames per second (FPS) and average power consumption of the YOLOv12m model and our LCW-YOLO model using 32-bit floating-point precision implemented with PyTorch. The results demonstrate that LCW-YOLO not only effectively preserves the real-time performance characteristics of YOLOv12m but also reduces average power consumption by 14%, which is crucial for persistent drone missions with limited battery capacity. The results indicate that LCW-YOLO effectively retains the real-time performance characteristics of YOLOv12m. After deployment on different hardware platforms, improvements in frame rate (FPS) can be observed [48].

To more intuitively demonstrate the detection effectiveness of this method, we selected representative scenes such as urban roads and traffic intersections from the Visdrone2019 dataset and green roads from the UAVVaste dataset as experimental data [49]. These scenes contain a large number of diverse small objects, making them suitable for inference experiments. This study used LCW-YOLO for inference experiments, and the detection results are shown in Figure 10 and Figure 11.

To validate the practical applicability of the proposed method, we deployed the model on a drone platform. Using an onboard camera, we captured real-time video streams for target detection and transmitted results back to the ground station. Test data was collected from diverse traffic road scenarios, featuring numerous small-scale targets and partially occluded objects, posing significant challenges to the model’s inference capabilities. Real-time inference results are shown in Figure 12. Figure 12a demonstrates the model’s detection performance on moving vehicles, while Figure 12b presents detection results for moving pedestrians. Experimental results indicate that even in complex real-world scenarios, the proposed method exhibits robust detection performance and robustness.

4. Conclusions

This paper proposes an interpretable computer vision drone small target detection model based on YOLOv12m, named LCW-YOLO. To enhance spatial perception capabilities, a lightweight LA2C2f structure was designed, which significantly improves small object localization accuracy while reducing complexity through heterogeneous multiscale convolution. The CAIM module is introduced, combining convolutional local context modeling with an enhanced RMCAM module to fuse cross-dimensional features and strengthen focus on key regions. Finally, the overall robustness is improved through the synergistic effect of scale-sensitive weights and perception ratios in the WIoU v3. Experimental results have shown that LCW-YOLO achieves an outstanding performance of 34.2% mAP@0.5:0.95 and 51.7% mAP@0.5, outperforming mainstream detectors such as UAV-DETR-R50 (31.5% mAP@0.5) and YOLOv8-L (26.1% mAP@0.5), while computational complexity is maintained at 65.5 GFLOPs, with over 60% similar models. In real-time testing on edge devices, it achieves an inference speed of 80 frames per second while delivering higher energy efficiency, making it suitable for real-time deployment on resource-constrained embedded platforms. To address the issue of poor explainability in deep learning, we conducted a comparative experiment and performed a heat map visualization analysis to demonstrate precise focusing on the target area, further validating the advantages of LCW-YOLO. In contrast, the baseline model was observed to be susceptible to background noise interference, leading to false positives or false negatives. Future work will focus on advancing the adaptive fusion and lightweight integration of the LCW-YOLO model for multimodal data encompassing multispectral, thermal infrared, and visible light. By incorporating cross-modal attention mechanisms, dynamic weight fusion, and hardware–software co-optimization, the model will comprehensively enhance its perception capabilities and inference efficiency for small targets in complex environments. This will support real-time application demands in critical domains such as military reconnaissance, medical imaging, and ecological monitoring.

Author Contributions

Conceptualization, D.L.; methodology, D.L.; software, D.L. and Y.Z.; validation, D.L.; formal analysis, D.L. and C.H.; investigation, D.L. and R.B.; resources, X.T.; data curation, Y.Z.; writing—original draft preparation, D.L. and L.H.; writing—review and editing, X.T. and C.H.; visualization, R.B.; supervision, X.T.; project administration, B.L.; funding acquisition, X.T. and B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research Foundation of Education Bureau of Hunan Province, China (Grant No. 24B0488), the Hunan Student’s Innovation and Entrepreneurship Training Program under grant no. S202410531096X, and the Hunan Student’s Innovation and Entrepreneurship Training Program under grant no. 202510531018.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflic of interest.

References

Jin, T.; Wang, W.; Sun, C.; Yu, Z.; Wu, Y.; Chen, X. TGC-YOLO: Detection Model for Small Objectsin UAV Image Scene. In Proceedings of the 2024 IEEE International Conference on Cognitive Computing and Complex Data (ICCD), Qinzhou, China, 28–30 September 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2024; pp. 119–124. [Google Scholar] [CrossRef]
Guenther, N.; Schonlau, M. Support vector machines. Stata J. 2016, 16, 917–937. [Google Scholar] [CrossRef]
Liu, M.; Jiang, Q.; Li, H.; Cao, X.; Lv, X. Finite-time-convergent support vector neural dynamics for classification. Neurocomputing 2025, 617, 128810. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai, HI, USA, 8–14 December 2001; Volume 1, p. I. [Google Scholar] [CrossRef]
Zhang, Z.; He, Y.; Mai, W.; Luo, Y.; Li, X.; Cheng, Y.; Huang, X.; Lin, R. Convolutional Dynamically Convergent Differential Neural Network for Brain Signal Classification. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 8166–8177. [Google Scholar] [CrossRef]
Qu, C.; Zhang, L.; Li, J.; Deng, F.; Tang, Y.; Zeng, X.; Peng, X. Improving feature selection performance for classification of gene expression data using Harris Hawks optimizer with variable neighborhood learning. Briefings Bioinform. 2021, 22, bbab097. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Zhou, W.; Geng, M. Automatic Seizure Detection Based on S-Transform and Deep Convolutional Neural Network. Int. J. Neural Syst. 2020, 30, 1950024. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Ren, S.; Wang, J.; Zhou, W. Efficient Group Cosine Convolutional Neural Network for EEG-Based Seizure Identification. IEEE Trans. Instrum. Meas. 2025, 74, 1–14. [Google Scholar] [CrossRef]
Liu, G.; Zhang, R.; Tian, L.; Zhou, W. Fine-Grained Spatial-Frequency-Time Framework for Motor Imagery Brain–Computer Interface. IEEE J. Biomed. Health Inform. 2025, 29, 4121–4133. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar] [CrossRef]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. arXiv 2022, arXiv:2101.08158. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Qin, Z.; Weian, G. Survey on deep learning-based small object detection algorithms. Appl. Res. Comput. 2025, 1–14. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Liu, G.; Zhang, J.; Chan, A.B.; Hsiao, J.H. Human attention guided explainable artificial intelligence for computer vision models. Neural Netw. Off. J. Int. Neural Netw. Soc. 2024, 177, 106392. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Zhang, M.; Ye, S.; Zhao, S.; Wang, W.; Xie, C. Pear Object Detection in Complex Orchard Environment Based on Improved YOLO11. Symmetry 2025, 17, 255. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the 13th European Conference on Computer Vision (ECCV)—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; PT, V., Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, Y.; Cheng, Z.; Song, Z.; Tang, C. MCA: Multidimensional collaborative attention in deep convolutional neural networks for image recognition. Eng. Appl. Artif. Intell. 2023, 126, 107079. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. EfficientFormer: Vision Transformers at MobileNet Speed. arXiv 2022, arXiv:2206.01191. [Google Scholar]
Han, D.; Ye, T.; Han, Y.; Xia, Z.; Pan, S.; Wan, P.; Song, S.; Huang, G. Agent Attention: On the Integration of Softmax and Linear Attention. arXiv 2024, arXiv:2312.08874. [Google Scholar]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural object detection with You Only Look Once (YOLO) Algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Huang, J.; Wang, K.; Hou, Y.; Wang, J. LW-YOLO11: A Lightweight Arbitrary-Oriented Ship Detection Method Based on Improved YOLO11. Sensors 2025, 25, 65. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
Kraft, M.; Piechocki, M.; Ptak, B.; Walas, K. Autonomous, Onboard Vision-Based Trash and Litter Detection in Low Altitude Aerial Images Collected by an Unmanned Aerial Vehicle. Remote Sens. 2021, 13, 965. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Computer Software. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 9 August 2025).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Authors, P. Paddledetection: Object Detection and Instance Segmentation Toolkit Based on PaddlePaddle. 2019. Available online: https://github.com/PaddlePaddle/PaddleDetection (accessed on 29 August 2025).
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 13658–13667. [Google Scholar] [CrossRef]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered Object Detection in Aerial Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8310–8319. [Google Scholar] [CrossRef]
Xu, C.; Ding, J.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7318–7328. [Google Scholar] [CrossRef]
Tang, S.; Zhang, S.; Fang, Y. HIC-YOLOv5: Improved YOLOv5 For Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 6614–6619. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. arXiv 2022, arXiv:2111.14330. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.N. UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar]
Minh, H.T.; Mai, L.; Minh, T.V. Performance Evaluation of Deep Learning Models on Embedded Platform for Edge AI-Based Real time Traffic Tracking and Detecting Applications. In Proceedings of the 2021 15th International Conference on Advanced Computing and Applications (ACOMP), Electr Network, Virtual, 24–26 November 2021; Le, L., Nguyen, H., Phan, T., Clavel, M., Dang, T., Eds.; IEEE: New York, NY, USA, 2021; pp. 128–135. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Photography Scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]

Figure 1. YOLOv12 model architecture.

Figure 2. LCW-YOLO model architecture.

Figure 3. The 3 × 3 and 5 × 5 parallel dual separable convolution structure.

Figure 4. LA2C2f structure. The LAAttn module in the ABlock denotes lightweight area attention.

Figure 5. RMCAM structure.

Figure 6. CAIM structure. The RMCAM module denotes Residual Multi-dimensional Collaborative Attention.

Figure 7. LCW-YOLO model training results.

Figure 8. The comparison training results between YOLOv12m and LCW-YOLO.

Figure 9. YOLOv12-m and LCW-YOLO heatmaps. Brighter areas in the heatmap correspond to higher model attention. Our method demonstrates intensified focus on small targets and their surrounding contexts, with two critical improvements visualized: the yellow box highlights enhanced detection of occluded objects, while the red box shows suppressed attention to background noise.

Figure 10. Examples of LCW-YOLO detection results on the Visdrone dataset.

Figure 11. Examples of LCW-YOLO detection results on the UAVVaste dataset.

Figure 12. Example of real-time data detection results. (a) Model’s detection performance on moving vehicles; (b) model’s detection performance on moving pedestrians.

Table 1. Model accuracy and inference speed under different convolution kernel sizes.

Kernel	APval (50:95)	Latency
3 × 3	40.4	1.60
5 × 5	40.4	1.61
7 × 7	40.6	1.64
9 × 9	40.7	1.79

Table 2. The performance comparison of different detection methods on the VisDrone2019 dataset.

Model	Input Size	Params (M)	GFLOPs	mAP@0.5:0.95	mAP@0.5
Real-time Object Detectors
YOLOv8-M [34]	640 × 640	25.9	78.9	24.6	40.7
YOLOv8-L [34]	640 × 640	43.7	165.2	26.1	42.7
YOLOv9-S [35]	640 × 640	7.2	26.7	22.7	38.3
YOLOv9-M [35]	640 × 640	20.1	76.8	25.2	42.0
YOLOv10-M [36]	640 × 640	15.4	59.1	24.5	40.5
YOLOv10-L [36]	640 × 640	24.4	120.3	26.3	43.1
YOLOv11-S [37]	640 × 640	9.4	21.3	23.0	28.7
YOLOv11-M [37]	640 × 640	20.0	67.7	25.9	43.1
YOLOv12-M [17]	640 × 640	20.2	67.2	26.9	46.0
UAV-Specific Detectors
PP-YOLOE-P2-Alpha-1 [38]	640 × 640	54.1	111.4	30.1	48.9
QuayDet [39]	2400 × 2400	33.9	212	28.3	48.1
ClusDet [40]	1000 × 600	30.2	207	26.7	50.6
DCFL [41]	1024 × 1024	36.1	157.8	-	32.1
HIC-YOLOv5 [42]	640 × 640	9.4	31.2	26.0	44.3
End-to-end Object Detectors
DETR [43]	1333 × 750	60	187	24.1	40.1
Deformable DETR [44]	1333 × 800	40	173	27.1	42.2
Sparse DETR [45]	1333 × 800	40.9	121	27.3	42.5
RT-DETR-R18 [46]	640 × 640	20	60.0	26.7	44.6
RT-DETR-R50 [46]	640 × 640	42	136	28.4	47.0
Real-time E2E Detectors for UAV
UAV-DETR-EV2 [47]	640 × 640	13	43	28.7	47.5
UAV-DETR-R18 [47]	640 × 640	20	77	29.8	48.8
UAV-DETR-R50 [47]	640 × 640	42	170	31.5	51.5
Proposed UAV Detector
LCW-YOLO (ours)	640 × 640	19.8	65.5	30.6	49.3

Table 3. The performance comparison of different detection methods on the UAVVaste dataset.

Model	Params (M)	GFLOPs	mAP@0.5:0.95	mAP@0.5
YOLOv11-S [37]	9.4	21.3	27.8	63.0
HIC-YOLOv5 [42]	9.4	31.2	30.5	65.1
RT-DETR-R18 [46]	20.0	57.3	36.3	72.6
RT-DETR-R50 [46]	42.0	129.9	37.4	73.5
UAV-DETR-EV2 [47]	13	43	37.1	70.6
UAV-DETR-R18 [47]	20	77	37.0	74.0
UAV-DETR-R50 [47]	42	170	37.5	75.9
YOLOv12-M [17]	20.2	67.2	35.7	73.2
LCW-YOLO (ours)	19.8	65.5	37.3	75.1

Table 4. The detection effect of the improvements in the proposed method.

Baseline	CAIM	LA2C2f	WIoU v3	Params (M)	GFLOPs	mAP@0.5:0.95	mAP@0.5
✓	-	-	-	20.2	67.2	26.9	46.0
✓	✓	-	-	19.8	65.8	28.4	47.5
✓	✓	✓	-	19.8	65.5	29.8	48.5
✓	✓	✓	✓	19.8	65.5	30.6	49.3

Note: “✓” denotes the module used in the configuration.

Table 5. The performance comparison of different attention modules and RMCAM placement.

Model	CAIM Location	Params (M)	GFLOPs	mAP@0.5:0.95	mAP@0.5	Recall
YOLOv12-M [17] (Baseline)	-	20.2	67.2	26.9	46.0	59.9
YOLOv12-M + MCAM [19]	Backbone	21.0	67.8	27.2	46.4	60.7
YOLOv12-M + RMCAM	Neck	19.7	65.5	27.8	47.0	59.9
YOLOv12-M + RMCAM	Backbone	19.8	65.8	28.4	47.5	61.1

Table 6. The performance comparison of different convolution structure and fusion strategies.

Model	Convolution Type	Feature Fusion	Params (M)	GFLOPs	mAP@0.5:0.95	mAP@0.5
YOLOv12-M+RMCAM (Backbone)	7 × 7 convolution	-	19.8	65.8	28.4	47.5
+ 3 × 5 Parallel	3 × 5 Parallel convolution	Direct splicing	19.8	65.8	29.6	48.3
+ 3 × 5 Sequential	3 × 5 Sequential convolution	Direct splicing	19.8	65.8	29.1	48.0
+ 3 × 5 Parallel + Concat 1 × 1	3 × 5 Parallel convolution	1 × 1 Convolution compression	19.8	65.5	29.8	48.5

Table 7. Comparison of model performance metrics.

Model	Params (M)	GFLOPs	FPS	Avg. Power (W)	FPS/W
YOLOv12-m [17]	20.2	67.2	76	12.5	6.08
LCW-YOLO (ours)	19.8	65.5	80	10.8	7.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, D.; Bi, R.; Zheng, Y.; Hua, C.; Huang, L.; Tian, X.; Liao, B. LCW-YOLO: An Explainable Computer Vision Model for Small Object Detection in Drone Images. Appl. Sci. 2025, 15, 9730. https://doi.org/10.3390/app15179730

AMA Style

Liao D, Bi R, Zheng Y, Hua C, Huang L, Tian X, Liao B. LCW-YOLO: An Explainable Computer Vision Model for Small Object Detection in Drone Images. Applied Sciences. 2025; 15(17):9730. https://doi.org/10.3390/app15179730

Chicago/Turabian Style

Liao, Dan, Rengui Bi, Yubi Zheng, Cheng Hua, Liangqing Huang, Xiaowen Tian, and Bolin Liao. 2025. "LCW-YOLO: An Explainable Computer Vision Model for Small Object Detection in Drone Images" Applied Sciences 15, no. 17: 9730. https://doi.org/10.3390/app15179730

APA Style

Liao, D., Bi, R., Zheng, Y., Hua, C., Huang, L., Tian, X., & Liao, B. (2025). LCW-YOLO: An Explainable Computer Vision Model for Small Object Detection in Drone Images. Applied Sciences, 15(17), 9730. https://doi.org/10.3390/app15179730

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LCW-YOLO: An Explainable Computer Vision Model for Small Object Detection in Drone Images

Abstract

1. Introduction

2. Principles and Innovations

2.1. YOLOv12 Model

2.2. Proposed Method

2.2.1. Lightweight Channel-Wise and Spatial Attention with Context

2.2.2. Convolution and Attention Integration Module

2.2.3. Wise-IoU

3. Experiments

3.1. Performance Evaluation

3.2. Experimental Setup

3.3. Experimental Results

3.3.1. Training Results and Analysis

3.3.2. Comparative Experiment

3.3.3. Ablation Experiment

3.3.4. Visualization

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI