Next Article in Journal
An Intelligent Differential Capacitive Bioelectronic Sensing System for Reliable Microfluidic Reagent Delivery in Automated Pathology
Previous Article in Journal
Dance Motion-Guided Music Generation via Residual Vector Quantization
Previous Article in Special Issue
A Heterogeneous Modular Framework for Pre-Trained Image Dehazing Models Based on Haze Level Clustering
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MD-YOLO: A Multi-Scale Adaptive and Dual-Attention Enhanced YOLOv11 for Small Object Detection

School of Artificial Intelligence and Computer Science, Jiangsu Normal University, Xuzhou 221116, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(10), 2099; https://doi.org/10.3390/electronics15102099
Submission received: 14 April 2026 / Revised: 11 May 2026 / Accepted: 12 May 2026 / Published: 14 May 2026

Abstract

Recent YOLO-based object detection methods have demonstrated strong performance in real-time applications due to their efficient end-to-end architecture. However, in complex scenarios such as VisDrone2019, existing methods still face limitations in small object detection and multi-scale feature modeling capability. These performance bottlenecks are not only attributed to model-level constraints, such as the loss of low-level spatial details during progressive downsampling and the insufficient preservation of fine-grained structural information in high-level semantic representations during feature propagation, which consequently limits multi-scale feature representation and fusion, but are also influenced by data-level factors, including long-tailed distributions and spatial distribution bias. To address these limitations, this paper proposes an improved model named MD-YOLO. First, a Multi-scale Adaptive Channel (MAC) module is introduced into the backbone to replace conventional stride-based downsampling, enhancing multi-scale feature representation while preserving fine-grained information. Second, a Dual Attention Feature Fusion (DAFA) module is designed to align features across different resolutions and further enhance fused representations using both channel and spatial attention mechanisms. Furthermore, a high-resolution P2 detection head is incorporated to enhance the detection capability for dense small objects. Experimental results on the VisDrone2019 dataset demonstrate that the proposed method substantially outperforms the YOLOv11s baseline, improving mAP@0.5 from 38.5% to 45.6% and mAP@0.5:0.95 from 22.8% to 27.1%, while maintaining a reasonable computational cost.

1. Introduction

Object detection has been a fundamental task in computer vision [1,2], aiming to accurately localize and classify objects within images. With the rapid development of deep learning, convolutional neural networks (CNNs) have substantially improved detection performance compared with traditional handcrafted feature-based methods, effectively handling complex backgrounds and large variations in object appearance and scale. Two-stage detectors, such as Faster R-CNN [3], achieve high accuracy by generating region proposals, but suffer from relatively slow inference speed. In contrast, single-stage detectors, including SSD [4] and RetinaNet [5], eliminate the proposal stage and achieve a better balance between accuracy and efficiency.
Despite these advances, detecting small and densely distributed objects in complex environments remains a challenging problem. Unmanned Aerial Vehicle (UAV)-based datasets such as VisDrone2019 contain numerous tiny objects with severe occlusions and large scale variations, which greatly increase the difficulty of detection. In such scenarios, existing methods often fail to effectively preserve fine-grained spatial details during downsampling and suffer from insufficient multi-scale representation, leading to inaccurate localization and missed detections.
To enhance multi-scale feature representation capability, the Feature Pyramid Network (FPN) [6] leverages the hierarchical structure of convolutional neural networks to integrate high-level semantic information with low-level spatial details, effectively improving multi-scale feature representation and substantially boosting detection performance. Building upon this idea, subsequent methods such as PANet [7] and EfficientDet [8] further improve feature aggregation strategies, enhancing cross-scale feature propagation and bidirectional information interaction. However, despite these advances, existing pyramid-based detection frameworks still primarily rely on cross-layer feature aggregation and pay insufficient attention to intra-layer multi-scale modeling. In addition, flexible and effective cross-layer feature interaction remains limited in complex scenarios, which restricts the preservation of fine-grained spatial information and the effective representation of scale-aware discriminative features.
In recent years, YOLO-based detectors [9] have attracted extensive attention due to their efficient end-to-end architecture and real-time inference capability. However, even advanced variants such as YOLOv11s [10] still struggle to simultaneously balance multi-scale feature representation, cross-layer semantic fusion, and high-resolution object perception in dense small object scenarios. To address these challenges, this paper systematically improves YOLOv11s from three perspectives: feature extraction, feature fusion, and detection prediction, aiming to enhance small object detection performance in complex UAV scenarios.
Accordingly, this paper proposes an improved object detection framework, namely MD-YOLO. The proposed method enhances the model through three stages, including a multi-scale feature enhancement module, an adaptive feature fusion module, and a high-resolution detection head. These improvements collectively strengthen multi-scale feature representation, cross-layer information interaction, and small object perception capability, enabling more accurate and robust detection in complex environments.
The main contributions of this paper are summarized as follows:
(1)
A Multi-scale Adaptive Channel (MAC) module is proposed to enhance feature representation by capturing multi-scale contextual information while preserving fine-grained details during downsampling.
(2)
A Dual Attention Feature Fusion (DAFA) module is proposed to perform explicit cross-scale feature alignment through multi-resolution unification, followed by adaptive recalibration using channel and spatial attention, which jointly improves feature consistency and enhances multi-scale representation capability.
(3)
A high-resolution detection head (P2) is introduced to improve the localization accuracy of dense small objects by leveraging finer spatial features.

2. Related Work

2.1. Small Object Detection and Detection Paradigms

Small object detection remains a challenging problem in computer vision, particularly in UAV scenarios, where objects are typically small in size, densely distributed, and embedded in complex backgrounds. These characteristics often result in limited spatial resolution and weak semantic representations, which may lead to missed detections and inaccurate localization.
Existing object detection methods can be broadly categorized into two groups: two-stage detectors and one-stage detectors. Two-stage methods, such as Faster R-CNN [3], perform region proposal generation followed by classification and regression in a sequential manner. They generally achieve high detection accuracy in complex scenarios, but often suffer from high computational cost and relatively slow inference speed. In contrast, one-stage methods, such as SSD [4] and YOLO [10], unify object classification and bounding box regression within an end-to-end framework, substantially improving inference efficiency and making them more suitable for real-time UAV applications.
However, this efficient detection paradigm relies on shared feature representations for dense prediction, which introduces inherent limitations for small object detection. Specifically, repeated downsampling in deep networks may weaken the representation of small objects and reduce discriminative information. Meanwhile, complex backgrounds and dense object distributions further increase detection difficulty. In addition, bounding box regression for small objects is highly sensitive to feature quality, making precise localization more challenging.
To address these issues, existing studies mainly focus on enhancing feature representation and improving cross-scale information interaction to strengthen discriminative capability and localization accuracy. Therefore, while maintaining detection efficiency, improving small object representation and enhancing the effectiveness of cross-scale feature utilization remain important challenges in UAV-based detection scenarios.

2.2. Multi-Scale Feature Extraction Methods

In object detection tasks, objects at different scales exhibit significant differences in spatial structure and semantic representation. In UAV scenarios, small objects often occupy only a few pixels, and their discriminative features tend to degrade or be lost during deep feature extraction. Therefore, effectively preserving fine-grained spatial information while enhancing multi-scale representation capability has become an important research direction for improving small object detection performance.
In recent years, multi-scale feature modeling has attracted increasing attention. Traditional convolutional neural networks typically rely on fixed receptive fields, which makes it difficult to simultaneously capture objects at different scales. To mitigate this issue, various approaches have been proposed, including multi-scale convolution kernels, dilated convolutions, and attention mechanisms, to improve the network’s adaptability to scale variations and enhance feature representation quality.
In UAV-based object detection, several methods have further explored explicit multi-scale modeling strategies. For instance, SMFF-YOLO [11] enhances feature extraction by incorporating a spatial pyramid structure, while MFA-YOLO [12] improves detection performance in complex scenarios through multi-feature aggregation, alleviating feature degradation in lightweight networks.
Although these methods have achieved certain improvements in multi-scale representation, many of them still rely on fixed-scale convolutions or post-hoc cross-layer feature aggregation. As a result, explicit multi-scale receptive field modeling within a single layer remains insufficient. This limitation makes it difficult for networks to fully capture discriminative features that integrate both local details and contextual semantics, especially under large scale variations.
To address this issue, a Multi-scale Adaptive Channel (MAC) module is proposed. It constructs parallel multi-scale convolution branches and integrates a channel attention mechanism, enabling explicit multi-scale receptive field modeling within a single layer, thereby improving feature representation and preserving fine-grained information of small objects.

2.3. Multi-Scale Feature Fusion Methods

Cross-layer feature fusion plays a crucial role in improving multi-scale detection performance in modern object detectors. Its primary goal is to effectively combine fine-grained spatial information from shallow layers with rich semantic information from deep layers. However, due to inherent differences in spatial resolution and semantic representation across feature levels, direct fusion may reduce information utilization efficiency and degrade performance in complex scenarios.
Existing multi-scale feature fusion methods can be broadly categorized into three types: bottom-up enhancement methods, bidirectional fusion strategies, and adaptive fusion methods. Bottom-up methods, such as PANet [7], introduce additional pathways to strengthen the propagation of shallow spatial information. Bidirectional methods, such as BiFPN [8], construct both top-down and bottom-up feature flows with learnable weighting mechanisms to enable more efficient cross-scale interaction. In addition, adaptive fusion methods, such as AFPN [13], dynamically adjust the contribution of different feature levels to improve fusion flexibility.
Meanwhile, attention mechanisms have also been widely adopted in feature fusion. Channel and spatial attention modules, such as CBAM [14,15], jointly model channel dependencies and spatial importance to enhance feature selection. Lightweight channel attention methods, such as ECA-Net [16], further improve channel interaction with relatively low computational cost.
Although these approaches have achieved notable progress in cross-scale feature fusion, modeling of scale alignment and dynamic interaction across feature levels still has room for improvement. In particular, in dense small object scenarios, the complementary relationship between shallow details and deep semantic features remains insufficiently exploited, which may limit the quality of fused representations.
To address this issue, a Dual Attention Feature Fusion (DAFA) module is proposed. It first performs resolution matching to align feature scales across different levels, and then applies channel and spatial attention mechanisms to enhance fused features adaptively, thereby improving cross-layer interaction and alleviating semantic inconsistency.

2.4. High-Resolution Detection Strategy

In single-stage object detection frameworks, the resolution of the detection head plays a critical role in small object perception. As the network progressively downsamples feature maps, spatial details of extremely small objects may gradually degrade, making it difficult for deep detection heads to preserve sufficient discriminative information. This may result in reduced localization accuracy and missed detections.
To improve small object detection performance, recent studies have explored the use of high-resolution prediction branches or shallow detection layers, enabling the model to perform predictions on higher-resolution feature maps. This strategy can enhance fine-grained perception and improve localization accuracy.
However, simply introducing high-resolution branches often increases computational cost. Moreover, without sufficient cross-layer semantic support, the utilization of shallow features may still be limited. Therefore, balancing detection efficiency while fully leveraging high-resolution features remains an important research direction.
Based on this motivation, a high-resolution P2 detection layer is introduced into the original YOLOv11s three-scale detection head, forming a four-scale prediction structure. This design enhances the model’s ability to perceive dense small objects and further reduces missed detections in complex UAV scenarios.

3. Proposed Method

3.1. Overall Architecture

To address the challenges of small object detection in complex scenes, this paper proposes an enhanced detection framework, termed MD-YOLO, based on the YOLOv11 architecture. The proposed model follows the classical three-stage paradigm, consisting of a backbone, a neck, and a detection head, while introducing targeted modifications at each stage to improve multi-scale feature representation, fusion and small object perception.
The overall architecture of MD-YOLO is illustrated in Figure 1.

3.2. MAC Module

3.2.1. Motivation of MAC Module

In object detection tasks, significant variations in object scale play a critical role in determining feature representation quality and overall detection performance. As a lightweight detector, YOLOv11s primarily relies on the Feature Pyramid Network (FPN) and path aggregation structures to achieve multi-scale feature fusion. Although such architectures can partially integrate high-level semantic information with low-level spatial details, they essentially operate as passive cross-level fusion mechanisms and lack explicit modeling capabilities for objects at different scales.
Specifically, during the feature fusion process, the inherent discrepancy between feature levels in terms of semantic strength and spatial resolution leads to imbalanced feature representations. Shallow features preserve rich spatial details but suffer from weak semantic representation, whereas deep features contain strong semantic information but experience substantial loss of fine-grained spatial structures due to repeated downsampling. Under this asymmetric condition, the fused features are often dominated by high-semantic responses, which suppress low-response small-object features along the channel dimension, making it difficult to preserve fine-grained information.
Furthermore, conventional upsampling and concatenation operations fail to adaptively regulate features across different scales and ignore distributional discrepancies among feature maps. This limitation becomes more pronounced in complex backgrounds or densely populated small-object scenarios, where small-object features are easily overwhelmed by background noise or large-object semantics, thereby degrading detection accuracy.
Although existing approaches attempt to alleviate these issues by deepening the network or introducing attention mechanisms, they often involve a trade-off between computational complexity and performance improvement, and still struggle to effectively model multi-scale features within a single layer. Therefore, enhancing multi-scale feature representation while maintaining a lightweight architecture remains a critical challenge.
To address these issues, a Multi-scale Adaptive Channel (MAC) module is proposed and embedded into the downsampling stages of the backbone network. Unlike conventional single-scale convolutions, the MAC module constructs multi-scale receptive fields within a single layer and incorporates a channel attention mechanism to adaptively recalibrate feature responses. This design not only enhances multi-scale representation capability but also suppresses redundant information. In addition, a residual connection is introduced to preserve original feature information and improve training stability, thereby providing more reliable feature representations for subsequent feature fusion.

3.2.2. Structure of MAC Module

The Multi-scale Adaptive Channel (MAC) module enhances multi-scale feature representation of UAV targets and addresses the limitations of the baseline YOLOv11s. Unlike YOLOv11s, which relies on single-scale convolutions and FPN-based feature fusion, MAC integrates multi-scale receptive fields within a single layer, reducing scale ambiguity and preserving discriminative information. The module combines multi-scale feature processing, channel attention, and residual connections with optional downsampling, as illustrated in Figure 2.

3.2.3. Multi-Scale Branch

To enhance the representation capability of features with different receptive fields, the proposed MAC module introduces a multi-scale feature processing branch to replace the standard convolution in YOLOv11s. This branch explicitly models fine-grained details, local contextual information, and global receptive features within a unified structure, thereby improving the discrimination ability for objects of different scales.
Given the input feature map:
x R B × C × H × W
where B , C , H and W denote batch size, channel dimension, height, and width, respectively.
The multi-scale feature representation is formulated as:
F m s = S i L U ( B N ( C o n c a t ( C o n v 1 × 1 ( x ) , C o n v 3 × 3 ( x ) , C o n v 3 × 3 d = 3 ( x ) ) ) )
where C o n v 1 × 1 ( x ) , C o n v 3 × 3 ( x ) and C o n v 3 × 3 d = 3 ( x ) represent point-wise convolution, standard convolution, and dilated convolution with dilation rate 3, respectively.
Specifically, the three convolutional branches capture complementary information at different scales:
  • 1 × 1 convolution focuses on fine-grained feature refinement;
  • 3 × 3 convolution models local spatial context;
  • Dilated convolution enlarges the receptive field to capture broader semantic context.
By integrating multi-scale features within a single stage, the proposed branch reduces scale ambiguity and provides richer feature representations for subsequent attention enhancement.

3.2.4. Channel Attention Branch

To suppress background interference and enhance discriminative feature channels, a lightweight channel attention branch is introduced on top of the multi-scale feature representation F m s . Unlike conventional SE-like mechanisms [17] that operate on single-scale features, the proposed design applies channel attention after multi-scale aggregation, enabling scale-aware channel reweighting.
Given the multi-scale feature map F m s R B × 3 C × H × W , the channel attention is formulated as:
F a t t = F m s R e s h a p e ( σ ( R e L U ( W · G M P ( F m s ) + b ) ) )
where G M P ( · ) denotes global max pooling, W and b are learnable parameters, R e L U ( · ) is the activation function, and σ ( · ) represents the sigmoid function. The operator de notes channel-wise multiplication.
Specifically, global max pooling aggregates spatial information into a compact channel descriptor, which emphasizes the most salient responses across each channel. The learned attention weights are then used to adaptively recalibrate the importance of different channels.
Through this mechanism, channels associated with small objects are selectively enhanced, while background-related responses are suppressed, leading to improved robustness in complex scenes.

3.2.5. Residual Connection Branch

To preserve information flow and stabilize gradient propagation, a residual connection is introduced to fuse the attention-enhanced features with the original input. The proposed design further integrates channel compression and optional downsampling to ensure efficient feature transformation and dimensional consistency.
Given the attention-refined feature F a t t R B × 3 C × H × W and the input feature x R B × C × H × W , the fusion process is defined as:
Z = Ŷ + R
where
Ŷ = A v g P o o l ( s ) ( C o n v 1 × 1 ( Concat ( F a t t , x ) ) ) , s > 1 C o n v 1 × 1 ( Concat ( F a t t , x ) ) , s = 1
R = C o n v 1 × 1 s ( x ) , C C o u t   or   s > 1 x , otherwise
Here, C o n c a t ( · ) denotes channel-wise concatenation, and C o n v 1 × 1 ( · ) is used to compress the concatenated features from 4 C to C o u t . The operator A v g P o o l s ( · ) represents average pooling with stride s , enabling spatial downsampling when required.
All in all, the main branch first aggregates the attention-enhanced features and the original input via concatenation, followed by channel compression. When spatial resolution reduction is needed, average pooling is applied to the fused features. Meanwhile, the residual branch performs identity mapping or linear projection to match spatial and channel dimensions.
Through this design, the proposed module maintains efficient information propagation while ensuring structural flexibility, which contributes to stable optimization and improved feature representation.

3.3. DAFA Module

3.3.1. Motivation of DAFA Module

In YOLOv11s, multi-scale feature fusion is primarily achieved through FPN and PAN structures, where features from different levels are directly fused via upsampling and concatenation. However, such operations ignore differences in spatial resolution and feature distribution, leading to suboptimal feature aggregation.
Shallow features provide high-resolution spatial details but lack semantic richness, whereas deeper features contain strong semantic information but lack fine-grained spatial details. Direct fusion without spatial resolution unification may introduce scale inconsistency and semantic conflict, which adversely affects small object detection.
Moreover, conventional fusion strategies lack adaptive mechanisms to emphasize informative regions and suppress background noise, resulting in limited discriminability in complex UAV scenarios.
To address these limitations, the proposed DAFA module rescales multi-scale features to a unified spatial resolution, followed by channel and spatial attention mechanisms to enhance feature representation.

3.3.2. Structure of DAFA Module

To further exploit fine-grained information from shallow features and achieve effective multi-scale feature fusion, a Dual Attention Feature Fusion (DAFA) module is proposed and embedded into the detection head, as illustrated in Figure 3. The module takes multi-level features as input, rescales them to a unified spatial resolution to construct consistent feature representations, and introduces dual attention mechanisms to enhance discriminative information while suppressing background noise. The DAFA module consists of three components: multi-scale resolution unification and feature fusion, channel attention, and spatial attention.

3.3.3. Resolution Unification and Feature Fusion

Given multi-level input features X i i = 1 N , where X i ϵ R B × C i × H i × W i , the resolution unification process is defined as:
F = C o n c a t ( X ~ 1 , X ~ 2 , , X ~ N )
where
X ~ i = X i , ( H i , W i ) = ( H * , W * ) I n t e r p b i l i n e a r ( X i , H * , W * ) , otherwise
and ( H * ,   W * ) denotes the reference spatial resolution.
The fused feature is F R B × C t o t × H * × W * , where C t o t = i = 1 N C i .
Unlike YOLOv11s, which directly concatenates multi-scale features without explicit resolution unification, the proposed strategy ensures resolution consistency before fusion, thereby reducing representation inconsistency. This design enables effective integration of high-resolution details from shallow layers and semantic information from deeper layers, resulting in more coherent and discriminative feature representations.

3.3.4. Channel Attention

To enhance the discriminative representation of fused multi-scale features, a channel attention mechanism is introduced to adaptively recalibrate channel-wise responses.
Given the fused feature map F R B × C t o t × H × W , the channel attention process is formulated as:
F c = F σ ( W 2 δ ( W 1 G A P ( F ) ) )
where G A P ( · ) denotes global average pooling, δ ( · )   is the ReLU activation function, and σ ( · ) is the Sigmoid function. W 1 and W 2 are learnable parameters, and denotes channel-wise multiplication.
Unlike YOLOv11s, which lacks explicit channel-wise feature recalibration in its detection head, the proposed mechanism dynamically models inter-channel dependencies to enhance informative feature responses while suppressing redundant activations. Compared with SE attention applied on single-scale features, the proposed channel attention is performed after resolution unification and feature fusion, enabling more effective interaction between cross-scale semantic representations.
Through this design, channel-wise discriminative features associated with small objects are strengthened, thereby improving robustness in complex UAV environments.

3.3.5. Spatial Attention

To further enhance spatial localization ability, a spatial attention mechanism is applied to the channel-refined features to emphasize target-relevant regions.
Given the channel-enhanced feature F R B × C t o t × H × W , the spatial attention is defined as:
F o u t = F c σ ( C o n v 7 × 7 ( [ A v g ( F c ) ; M a x ( F c ) ] ) )
where [ · ; · ] denotes channel-wise concatenation, A v g ( · ) and M a x ( · ) represent average pooling and max pooling along the channel dimension.
Unlike YOLOv11s, which does not explicitly model spatial attention, the proposed mechanism adaptively highlights spatially salient regions while suppressing background interference. Compared with CBAM-based spatial attention applied on single-scale features, the proposed approach operates on multi-scale fused features after resolution unification, enabling more accurate localization of small objects in complex UAV scenes.
By jointly modeling spatial saliency and channel-wise importance, the proposed module effectively improves feature discriminability and object localization precision.

3.4. Detection Head with P2 Layer

To further enhance the detection performance of small objects in UAV scenarios, a high-resolution P2 detection layer is introduced into the detection head of the proposed MD-YOLO framework. Compared with YOLOv11s, which adopts a three-scale detection paradigm (P3, P4, P5), the proposed method extends the detection hierarchy to four scales (P2, P3, P4, P5), enabling the network to exploit higher-resolution feature representations for small object localization.
Formally, the multi-scale detection outputs are defined as:
Y = D e t e c t ( F p 2 , F p 3 , F p 4 , F p 5 )
where F p 2 denotes the newly introduced high-resolution feature map with a downsampling stride of 4, while F p 3 , F p 4 , and F p 5 are inherited from the original YOLOv11s detection head.
The introduction of the P2 layer effectively improves the representation capacity for small objects, as higher-resolution feature maps preserve more fine-grained spatial details, which are often lost in deeper layers. This design is particularly beneficial for UAV-based detection tasks where objects typically occupy only a few pixels in the image.
Considering that the integration of the P2 detection branch expands the spatial resolution of feature maps, it inherently poses a theoretical risk of capturing redundant background responses while enhancing the sensitivity to extremely small objects. To bolster the discriminative power of these high-resolution features, the MAC module is integrated into the backbone architecture. This module refines multi-scale feature representations and enhances informative channel responses, providing a more discriminative feature input before propagation to the high-resolution detection head. Such a collaborative design enables that the model can fully leverage the fine-grained spatial details from the P2 branch, while simultaneously contributing to improved localization stability and detection robustness in complex scenarios.
Overall, the proposed four-scale detection head enables a better trade-off between small object sensitivity and computational efficiency, leading to improved recall and overall detection accuracy in complex UAV environments.

3.5. Overall Method Summary

In this section, the overall architecture of the proposed MD-YOLO is summarized. The model follows a unified design philosophy that enhances feature representation, feature fusion, and detection capability in a progressive manner, specifically tailored for small object detection in UAV scenarios.
At the backbone stage, the proposed MAC module strengthens multi-scale feature extraction by integrating multi-branch convolutions, channel attention, and residual learning within a single layer. This design improves the representational capacity of shallow features while preserving fine-grained spatial details, thereby providing more discriminative feature inputs for subsequent layers.
At the neck stage, the DAFA module performs explicit multi-scale feature alignment followed by dual attention refinement. By enforcing resolution consistency prior to fusion and applying both channel and spatial attention mechanisms, DAFA effectively mitigates feature misalignment and semantic inconsistency issues commonly observed in FPN-based architectures, leading to more robust multi-scale feature aggregation.
At the detection head stage, a four-scale prediction strategy is introduced by incorporating an additional P2 detection layer. This design extends the original YOLOv11s three-scale paradigm and enables the network to leverage high-resolution feature maps for improved small object localization, thereby substantially enhancing recall performance in dense and complex scenes.
Overall, the three components are tightly coupled and mutually complementary: MAC enhances feature extraction, DAFA improves feature fusion quality, and the P2 layer strengthens detection sensitivity. This unified design enables MD-YOLO to achieve a better balance between feature discriminability and computational efficiency, making it particularly suitable for UAV-based small object detection tasks.

4. Experiment Settings

4.1. Datasets

The proposed method is evaluated on the VisDrone2019 dataset, a widely used benchmark for UAV-based object detection. The dataset contains 10 object categories, including cars, pedestrians, people, trucks, vans, buses, motorcycles, bicycles, tricycles, and awning tricycles, representing complex urban traffic scenarios characterized by dense small objects and significant scale variations.
To systematically analyze the dataset characteristics, statistical analysis is conducted from three perspectives: category distribution, spatial distribution, and object scale, as illustrated in Figure 4.
From the perspective of category distribution, the dataset exhibits a pronounced long-tail characteristic. Categories such as car and pedestrian dominate in terms of sample quantity, whereas tricycle, awning-tricycle, and bus are relatively underrepresented. This severe class imbalance may bias the model toward high-frequency categories during training, thereby degrading overall detection performance.
In terms of spatial distribution, object centers are primarily concentrated in the lower-middle regions of the images, while objects near the image boundaries are relatively sparse. This indicates a spatial distribution bias, which may negatively affect detection performance in edge regions.
Regarding object scale, the majority of targets belong to the small-object category. Their normalized width and height are predominantly distributed within the range of 0–0.3, indicating compact object sizes with weak feature representations. This substantially increases the difficulty of accurate detection.
Overall, the VisDrone2019 dataset is characterized by a high proportion of small-scale objects, severe class imbalance, and spatial distribution bias. These characteristics pose substantial challenges to conventional object detection models. On the one hand, small objects are prone to information loss during downsampling. On the other hand, semantic inconsistency and redundancy across multi-scale features hinder effective feature fusion.

4.2. Training Settings

All experiments are conducted on a workstation equipped with one NVIDIA RTX 4090 GPU (24 GB memory) and an Intel Xeon Gold 6430 CPU with 16 vCPUs, running Ubuntu 22.04. The implementation is based on Python 3.10, PyTorch 2.1.0, and CUDA 12.1. The input image size is set to 640 × 640, and the batch size is 16. The model is trained using the SGD optimizer with an initial learning rate of 0.01 for a total of 200 epochs.
The experiments are conducted following the official VisDrone2019 benchmark partition protocol, which consists of the VisDrone2019-DET-train subset with 6471 images, the validation subset with 548 images, and the test-dev subset with 1610 images.
During training, the training and validation losses exhibit a consistent convergence trend without significant divergence, indicating good generalization ability. The loss components decrease rapidly in the early stages and gradually stabilize in later epochs, demonstrating that the model effectively learns feature representations and reaches convergence.
Meanwhile, performance metrics such as Precision, Recall, and mAP steadily improve throughout the training process and tend to stabilize after approximately 150 epochs, suggesting that the model has reached convergence. Further training beyond this point yields marginal performance gains. Therefore, setting the total number of training epochs to 200 provides a reasonable balance between model performance and computational efficiency.

4.3. Evaluation Metrics

To comprehensively evaluate the performance of MD-YOLO for UAV-based small object detection, several widely used metrics are adopted, including Precision, Recall, F1-score, and mean Average Precision (mAP).
In object detection, prediction results are categorized into four fundamental cases: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).
Specifically:
  • TP denotes correctly detected objects
  • FP represents background regions incorrectly classified as objects
  • FN denotes missed ground-truth objects
  • TN refers to correctly classified background regions
(1)
Precision and Recall
Precision measures the proportion of correctly predicted objects among all predicted positives:
P r e c i s i o n = T P T P + F P
Recall measures the proportion of correctly detected objects among all ground-truth objects:
R e c a l l = T P T P + F N
Precision reflects the model’s ability to suppress false detections, while Recall indicates its capability to reduce missed detections.
(2)
F1-score
F1-score is defined as the harmonic mean of Precision and Recall:
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
It provides a balanced evaluation metric, especially suitable for imbalanced and small object detection scenarios.
(3)
Average Precision (AP) and mean AP (mAP)
For a single class, Average Precision (AP) is defined as the area under the Precision–Recall curve:
A P = 0 1 P ( R ) d R
where P ( R ) denotes precision as a function of recall.
For multi-class detection tasks, mAP is defined as:
m A P = 1 N i = 1 N A P i
where N denotes the number of categories and A P i represents the average precision of the i-th class.
In this work, two variants of mAP are reported:
  • mAP@0.5: evaluated at IoU = 0.5
  • mAP@0.5:0.95: averaged over IoU thresholds from 0.5 to 0.95 with step 0.05, providing a stricter evaluation of localization accuracy

5. Experiments

5.1. Comparative Experiments

To comprehensively evaluate the effectiveness of the proposed MD-YOLO in UAV-based object detection scenarios, extensive comparative experiments are conducted against several state-of-the-art detection models, as reported in Table 1.
Compared with the YOLOv11s baseline, MD-YOLO improves mAP@0.5 from 38.5% to 45.6%, corresponding to an increase of 7.1 percentage points, while mAP@0.5:0.95 increases from 22.8% to 27.1%, yielding a gain of 4.3 percentage points. These results indicate improved detection accuracy and stronger localization capability under stricter IoU thresholds.
Compared with lightweight one-stage detectors, including YOLOv5s, YOLOv8s, and YOLOX-s, MD-YOLO achieves higher detection accuracy with only a moderate increase in model complexity. Relative to the two-stage detector Faster R-CNN, the proposed method achieves a more favorable trade-off between detection accuracy and real-time inference efficiency.
Although MD-YOLO achieves slightly lower accuracy than SMFF-YOLO on certain evaluation metrics, the number of parameters is reduced from 99.1 M to 13.2 M, and the computational complexity is reduced from 257.7 GFLOPs to 76.5 GFLOPs, providing greater suitability for resource-constrained deployment scenarios.

5.2. Overall Performance and Scale-Wise Analysis

As shown in Figure 5 and Figure 6, from a category-level perspective, the model exhibits significant performance imbalance across different object scales.
For large objects such as car and bus, the model demonstrates strong feature representation capability. The car category achieves an AP of 0.795, maintaining consistently high precision across the recall range, indicating effective activation of high-level semantic features for large-scale targets.
In contrast, small object categories such as bicycle and awning-tricycle show substantially degraded performance, with AP values of 0.126 and 0.143, respectively, becoming the primary bottleneck of overall performance.
This phenomenon can be attributed to three main factors: information degradation caused by repeated downsampling, which leads to loss of spatial detail for small objects; background interference in complex UAV scenes, which suppresses weak object responses; and optimization bias induced by long-tailed data distribution, which favors frequent classes during training.
Although MAC and DAFA enhance multi-scale feature representation and fusion, and the P2 detection head improves high-resolution feature utilization, small object detection remains challenging under extreme scale variation and severe class imbalance.

5.3. Confidence-Driven Decision Behavior Analysis

The Precision–Confidence and Recall–Confidence curves, reported in Figure 7 and Figure 8, illustrate the behavioral dynamics of the model under different confidence thresholds.
As the confidence threshold increases, precision exhibits a monotonic increasing trend, indicating more reliable predictions under stricter conditions. Meanwhile, recall decreases accordingly, reflecting the inherent trade-off between detection completeness and prediction confidence.
Further analysis shows that large object categories remain stable across different confidence levels, whereas small object categories are highly sensitive to threshold variations, with a much faster decline in recall. This indicates that small objects generally exhibit lower confidence distributions, making it difficult for the model to produce stable high-confidence predictions.
Therefore, the performance bottleneck is not only caused by insufficient feature representation but is also closely related to confidence calibration limitations.

5.4. Error Analysis and Confusion Mechanism

As Figure 9 illustrated, the confusion matrix further reveals the error distribution patterns of the model.
For large objects such as car, the model demonstrates strong discriminative ability, with only slight confusion with visually similar classes such as van, indicating effective learning of global structural features.
In contrast, small object categories exhibit significant error concentration. Pedestrian, people, and bicycle are frequently misclassified as background, while inter-class confusion is mainly observed among semantically and structurally similar categories.
From a mechanistic perspective, this issue arises from three factors: insufficient feature activation for small objects, leading to weak discriminative signals; overlap in embedding space, resulting in unclear decision boundaries; and gradient dominance of background samples, which biases optimization toward suppressing false positives rather than improving small object recognition.
These results indicate that the core limitation lies in insufficient feature space separability rather than detector head design.

5.5. Ablation Study and Synergistic Analysis

To further analyze the impact of each component on both detection accuracy and computational efficiency, ablation studies are conducted from two perspectives, as reported in Table 2 and Table 3.
From the accuracy perspective, introducing the MAC module leads to consistent improvements in detection performance, indicating enhanced multi-scale feature extraction capability. Building upon this, the incorporation of the DAFA module further improves detection accuracy, demonstrating its effectiveness in mitigating cross-layer semantic conflicts. Finally, adding the P2 detection head achieves the best performance on small object detection, reaching 45.6% mAP@0.5, confirming the importance of high-resolution feature representation.
Regarding efficiency, the MAC module increases the parameter count from 9.6 M to 12.7 M and GFLOPs from 21.7 to 67.3, while inference speed decreases from 211.9 FPS to 141.9 FPS, yet remains suitable for real-time applications. This change is attributed to the replacement of standard convolutions with a multi-branch structure.
The DAFA module introduces a marginal increase in computational cost from 67.3 to 68.0 GFLOPs with negligible impact on inference efficiency. The P2 detection layer further increases GFLOPs to 76.5 and slightly reduces inference speed, while yielding improved detection accuracy.

5.6. Visualization

To further assess the efficacy of the proposed MD-YOLO, representative samples from the VisDrone2019 dataset were selected for a qualitative visual comparison against the baseline YOLOv11s model, as presented in Figure 10. These samples encompass several formidable challenges inherent in drone-view imagery, including high-density object clustering, heterogeneous scale variations, significant background clutter, and partial occlusions.
As illustrated in Figure 10, MD-YOLO consistently demonstrates enhanced performance in detection recall and localization precision across diverse environments. In traffic scenarios characterized by dense object distributions, such as those shown in row (b) of Figure 10, the baseline model YOLOv11s appears more prone to missed detections and imprecise bounding box alignment when faced with severe mutual overlap or scale ambiguity. In contrast, MD-YOLO achieves a higher degree of detection completeness, particularly for small-scale objects, while providing more stable bounding box regression results.
Furthermore, in crowded pedestrian areas with substantial occlusion, as seen in rows (a) and (c) of Figure 10, MD-YOLO exhibits improved discriminative capabilities. It successfully identifies overlapping targets that are often overlooked by the baseline, suggesting that the proposed architectural enhancements effectively mitigate the loss of semantic information in complex, cluttered scenes. These visual results indicate that MD-YOLO possesses stronger robustness for multi-scale object detection in aerial surveillance applications.
Figure 11 presents the feature heatmaps of small objects within typical scenarios from the VisDrone2019 dataset. Compared to the YOLOv11s baseline architecture, MD-YOLO demonstrates a substantially enhanced capability for localizing densely clustered small objects and mitigating complex background interference. In the heatmaps generated by MD-YOLO, small objects exhibit further higher and more concentrated heat values, indicating that our model is more effective in capturing the discriminative features of these extremely small targets.

5.7. Discussion

Comprehensive experimental results demonstrate that MD-YOLO achieves competitive performance in challenging UAV detection scenarios. To further validate its effectiveness, comparisons are conducted with representative state-of-the-art methods, including SMFF-YOLO, MSEF-YOLO11 [25], and the approach presented in Learning Temporal Distribution and Spatial Correlation Towards Universal Moving Object Segmentation [26], from the perspectives of multi-scale feature modeling, cross-layer feature fusion, and accuracy–complexity trade-off.
SMFF-YOLO improves multi-scale object detection in UAV scenarios by introducing a bidirectional feature fusion pyramid, adaptive atrous spatial pyramid pooling, and an enhanced detection head. It achieves 54.3% mAP@0.5 and 33.7% mAP@0.5:0.95 on VisDrone2019. These improvements are mainly attributed to strengthened cross-layer information propagation and contextual aggregation. However, its design still relies heavily on pyramid-based aggregation and pooling operations for cross-scale compensation, while explicit intra-layer multi-scale representation remains relatively limited. In addition, its large model size of 99.1 M parameters and 257.7 GFLOPs imposes considerable deployment costs in resource-constrained environments.
In contrast, MD-YOLO adopts a collaborative multi-stage optimization strategy. The MAC module explicitly constructs intra-layer multi-scale receptive fields through parallel convolutional branches and dilated convolutions, enabling scale-aware feature extraction at the feature generation stage. The DAFA module further enhances cross-layer feature interaction through resolution matching and joint channel–spatial attention, improving semantic consistency across different feature levels. Meanwhile, the P2 detection head introduces high-resolution feature supervision to strengthen fine-grained small-object perception. With only 13.2 M parameters and 76.5 GFLOPs, MD-YOLO achieves 45.6% mAP@0.5 and 27.1% mAP@0.5:0.95, demonstrating a more favorable balance between detection accuracy and computational efficiency.
MSEF-YOLO11 integrates a lightweight partial multi-scale module, a multi-scale boundary semantic alignment module, and a shared detection head, improving mAP@0.5 from 32.0% to 38.6% over YOLOv11s. Its gains mainly stem from enhanced multi-scale feature aggregation and boundary-aware semantic refinement. However, its multi-scale modeling still largely depends on cross-layer compensation mechanisms. By comparison, MAC explicitly constructs multi-scale receptive fields within a single feature level, reducing scale ambiguity at the feature extraction stage, while DAFA further enhances complementary interactions between shallow spatial details and deep semantic information through joint attention modeling. Despite using only 13.2 M parameters compared with 13.8 M in MSEF-YOLO11, MD-YOLO achieves superior detection performance, reaching 45.6% mAP@0.5 versus 38.6%, which further validates the effectiveness of the proposed design.
Compared with the method presented in Learning Temporal Distribution and Spatial Correlation Towards Universal Moving Object Segmentation, both approaches address robust feature representation in complex visual environments through structured feature learning. The latter mainly relies on temporal distribution modeling and spatial correlation learning to enhance continuous target perception in video scenes, whereas MD-YOLO is designed for single-frame UAV detection and improves small-object detection performance through explicit multi-scale feature modeling, dual-attention cross-level feature fusion, and high-resolution detection. Although the task settings differ, both highlight the importance of structured feature modeling in complex visual perception tasks.
Overall, most existing approaches focus on isolated optimization at specific stages, such as feature fusion enhancement or detection head refinement. In contrast, MD-YOLO establishes a unified optimization framework spanning feature extraction, feature fusion, and detection stages, forming a complete multi-scale enhancement pipeline. Experimental results indicate that the proposed method achieves improved detection performance with moderate computational overhead, demonstrating its potential for practical UAV-based object detection applications.

6. Conclusions

This paper addresses the limitation of small object detection in complex UAV scenarios by proposing an improved model, termed MD-YOLO, based on the YOLOv11 architecture. The proposed approach systematically enhances the network from three key aspects: feature extraction, feature fusion, and detection prediction, through the integration of the MAC module, DAFA module, and a high-resolution P2 detection head.
In the feature extraction stage, the MAC module employs multi-scale convolutional branches combined with a lightweight channel attention mechanism to improve the representation capability across different object scales, particularly enhancing semantic representation of small objects in shallow layers. In the feature fusion stage, the DAFA module first performs feature resolution unification across multi-scale inputs, followed by a channel–spatial joint attention mechanism for adaptive feature recalibration. This improves cross-layer interaction consistency and enhances the discriminative and stable representation of fused features. In the detection stage, the P2 detection head leverages high-resolution feature maps to enable more precise localization at finer spatial granularity, substantially improving the detection accuracy and recall of small objects.
Experimental results on the VisDrone2019 dataset demonstrate that MD-YOLO consistently outperforms the baseline model across all evaluation metrics. Specifically, the proposed method achieves 45.6% mAP@0.5 and improves recall by 6.8%, validating its effectiveness and robustness in complex background and small object detection tasks.
Overall, the MAC, DAFA, and P2 detection head collaboratively enhance the model from feature representation, feature fusion, and detection scale perspectives, forming an efficient multi-scale detection framework.

Author Contributions

Conceptualization, W.Z.; Methodology, W.Z.; Software, W.Z.; Validation, W.Z.; Formal analysis, W.Z.; Investigation, W.Z.; Resources, G.G.; Data curation, W.Z.; Writing—original draft, W.Z.; Writing—review and editing, G.G.; Visualization, W.Z.; Supervision, G.G.; Project administration, G.G.; Funding acquisition, G.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Chinese Key Project for Innovation and Entrepreneurship of College Students (202410320173P) and in part by the National Natural Science Foundation of China (62401235).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are grateful to the editor and the anonymous reviewers for their useful comments and advice, which were vital for improving the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
  2. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  3. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  4. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
  5. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
  6. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2017; pp. 936–944. [Google Scholar] [CrossRef]
  7. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 8759–8768. Available online: https://openaccess.thecvf.com/content_cvpr_2018/html/Liu_Path_Aggregation_Network_CVPR_2018_paper.html (accessed on 28 March 2026).
  8. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 10781–10790. Available online: https://openaccess.thecvf.com/content_CVPR_2020/html/Tan_EfficientDet_Scalable_and_Efficient_Object_Detection_CVPR_2020_paper.html (accessed on 28 April 2026).
  9. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
  10. Zhang, R.; Pan, C.; Chen, Z.; Fan, J.; Chen, Z.; Yuan, B. An improved lightweight YOLOv11 algorithm for weld surface defect detection. Sci. Rep. 2026, 16, 11440. [Google Scholar] [CrossRef] [PubMed]
  11. Wang, Y.; Zou, H.; Yin, M.; Zhang, X. SMFF-YOLO: A Scale-Adaptive YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes. Remote Sens. 2023, 15, 4580. [Google Scholar] [CrossRef]
  12. Li, S.; Chen, C. MFA-YOLO: A multi-feature aggregation approach for small-object detection method in drone imagery. Sci. Rep. 2025, 16, 2484. [Google Scholar] [CrossRef] [PubMed]
  13. Wang, C.; Zhong, C. Adaptive Feature Pyramid Networks for Object Detection. IEEE Access 2021, 9, 107024–107032. [Google Scholar] [CrossRef]
  14. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar] [CrossRef]
  15. Huang, M.-L.; Yang, Y.-T. HybridCBAMNet: Enhancing time series binary classification with convolutional recurrent networks and attention mechanisms. Measurement 2025, 241, 115746. [Google Scholar] [CrossRef]
  16. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
  17. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 7132–7141. Available online: https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper (accessed on 3 May 2026).
  18. Wang, C.Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
  19. Liu, Q.; Zhang, J.; Zhao, Y.; Bu, X.; Hanajima, N. A YOLOX Object Detection Algorithm Based on Bidirectional Cross-scale Path Aggregation. Neural Process. Lett. 2024, 56, 35. [Google Scholar] [CrossRef]
  20. Yuan, X.; Fu, Z.; Zhang, B.; Xie, Z.; Gan, R. Research on lightweight algorithm for gangue detection based on improved Yolov5. Sci. Rep. 2024, 14, 6707. [Google Scholar] [CrossRef] [PubMed]
  21. Giri, K.J. SO-YOLOv8: A novel deep learning-based approach for small object detection with YOLO beyond COCO. Expert Syst. Appl. 2025, 280, 127447. [Google Scholar] [CrossRef]
  22. Li, J.; Feng, Y.; Shao, Y.; Liu, F. IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective. Appl. Sci. 2024, 14, 5277. [Google Scholar] [CrossRef]
  23. Gao, X.; Yu, A.; Tan, J.; Gao, X.; Zeng, X.; Wu, C. GSD-YOLOX: Lightweight and more accurate object detection models. J. Vis. Commun. Image Represent. 2024, 98, 104009. [Google Scholar] [CrossRef]
  24. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 16965–16974. Available online: https://openaccess.thecvf.com/content/CVPR2024/html/Zhao_DETRs_Beat_YOLOs_on_Real-time_Object_Detection_CVPR_2024_paper.html (accessed on 4 May 2026).
  25. Zhang, K.; Zhang, P.; Ullah, F.; Zhao, Y. MSEF-YOLO11s: A multi-scale extraction and fusion network for small target detection in drone imagery. J. Supercomput. 2026, 82, 72. [Google Scholar] [CrossRef]
  26. Dong, G.; Zhao, C.; Pan, X.; Basu, A. Learning Temporal Distribution and Spatial Correlation Toward Universal Moving Object Segmentation. IEEE Trans. Image Process. 2024, 33, 2447–2461. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overall architecture of the proposed MD-YOLO framework. The black arrows represent the sequential data flow throughout the network, from the backbone to the detection heads. The upward arrows in the neck denote upsampling paths for multi-scale feature fusion, and horizontal arrows indicate the flow of feature maps to the decoupled detection heads (P2–P5) for final classification and regression.
Figure 1. Overall architecture of the proposed MD-YOLO framework. The black arrows represent the sequential data flow throughout the network, from the backbone to the detection heads. The upward arrows in the neck denote upsampling paths for multi-scale feature fusion, and horizontal arrows indicate the flow of feature maps to the decoupled detection heads (P2–P5) for final classification and regression.
Electronics 15 02099 g001
Figure 2. Overview of the MAC module architecture, including multi-scale convolutions, channel attention, and residual connections. The colored blocks represent feature maps processed by different convolutional layers or pooling operations. Arrows indicate the direction of data flow. The circled symbols C, S, and × denote operation nodes for concatenation, Sigmoid function, and channel-wise multiplication, respectively.
Figure 2. Overview of the MAC module architecture, including multi-scale convolutions, channel attention, and residual connections. The colored blocks represent feature maps processed by different convolutional layers or pooling operations. Arrows indicate the direction of data flow. The circled symbols C, S, and × denote operation nodes for concatenation, Sigmoid function, and channel-wise multiplication, respectively.
Electronics 15 02099 g002
Figure 3. Overview of the DAFA module architecture, including feature alignment, channel attention and spatial attention.
Figure 3. Overview of the DAFA module architecture, including feature alignment, channel attention and spatial attention.
Electronics 15 02099 g003
Figure 4. Object types and size distribution in the VisDrone2019 dataset. (Top-left) Class distribution where each color corresponds to a specific category; (Top-right) Visualization of initial anchor boxes, color-coded by class; (Bottom) Spatial distribution and height-width scatter plots, where darker shading indicates a higher density of object instances.
Figure 4. Object types and size distribution in the VisDrone2019 dataset. (Top-left) Class distribution where each color corresponds to a specific category; (Top-right) Visualization of initial anchor boxes, color-coded by class; (Bottom) Spatial distribution and height-width scatter plots, where darker shading indicates a higher density of object instances.
Electronics 15 02099 g004
Figure 5. F1-score versus confidence threshold curve.
Figure 5. F1-score versus confidence threshold curve.
Electronics 15 02099 g005
Figure 6. Precision–Recall curve on VisDrone2019 dataset.
Figure 6. Precision–Recall curve on VisDrone2019 dataset.
Electronics 15 02099 g006
Figure 7. Precision variation with confidence threshold.
Figure 7. Precision variation with confidence threshold.
Electronics 15 02099 g007
Figure 8. Recall variation with confidence threshold.
Figure 8. Recall variation with confidence threshold.
Electronics 15 02099 g008
Figure 9. Confusion matrix of detection results.
Figure 9. Confusion matrix of detection results.
Electronics 15 02099 g009
Figure 10. Qualitative comparison of detection results between the baseline YOLOv11s and the proposed MD-YOLO on the VisDrone2019 dataset. (a) Comparison in a dense pedestrian and vehicle street scene; (b) Comparison in an urban area with extremely small and clustered objects; (c) Comparison in a low-light nighttime environment with complex illumination. Different bounding box colors represent distinct object categories, and the confidence score is displayed above each box.
Figure 10. Qualitative comparison of detection results between the baseline YOLOv11s and the proposed MD-YOLO on the VisDrone2019 dataset. (a) Comparison in a dense pedestrian and vehicle street scene; (b) Comparison in an urban area with extremely small and clustered objects; (c) Comparison in a low-light nighttime environment with complex illumination. Different bounding box colors represent distinct object categories, and the confidence score is displayed above each box.
Electronics 15 02099 g010
Figure 11. Visual heatmap analysis of YOLOv11s and MD-YOLO. From left to right: original images, heatmaps of YOLOv11s, and heatmaps of MD-YOLO. (a) Dense street scene; (b) Distant small pedestrian; (c) Clustered objects in a sports court. Compared to YOLOv11s, MD-YOLO shows more precise feature activation on targets with significantly reduced background noise. Warmer colors indicate higher model attention.
Figure 11. Visual heatmap analysis of YOLOv11s and MD-YOLO. From left to right: original images, heatmaps of YOLOv11s, and heatmaps of MD-YOLO. (a) Dense street scene; (b) Distant small pedestrian; (c) Clustered objects in a sports court. Compared to YOLOv11s, MD-YOLO shows more precise feature activation on targets with significantly reduced background noise. Warmer colors indicate higher model attention.
Electronics 15 02099 g011
Table 1. Comparison with state-of-the-art object detection methods on VisDrone2019 dataset.
Table 1. Comparison with state-of-the-art object detection methods on VisDrone2019 dataset.
ModelmAP50/%mAP50-95/%Params (M)GFLOPS (G)FPS
YOLOv7 [18]40.924.037.2105.1114
Faster R-CNN [3]38.323.941.0137.112
RetinaNet [19]26.211.236.3145.717
SSD [4]36.921.124.587.934
YOLOX-s [19]38.519.76.312.9102
YOLOv7-tiny35.622.96.013.3152
YOLOv5s [20]36.319.57.316.5124
YOLOv8s [21]39.123.411.228.8143
YOLOv9s [22]41.625.37.226.7113
Gold-YOLO [23]40.424.275.1151.5
SMFF-YOLO54.333.799.1257.7
MFA-YOLOs42.623.99.634.085
RT-DETR-R18 [24]47.529.319.957.0130
YOLO26s38.823.09.520.5233
YOLOv11s38.522.89.621.7211.9
MD-YOLO (Ours)45.627.113.276.5123.9
Table 2. Ablation study of different components in the proposed MD-YOLO.
Table 2. Ablation study of different components in the proposed MD-YOLO.
ModelPrecision/%Recall/%mAP50/%mAP50-95/%
YOLOv11s
(baseline)
49.337.638.522.8
+MAC53.139.840.823.9
+DAFA51.639.039.823.3
+P253.040.342.125.4
+MAC + DAFA52.940.441.824.6
+MAC + DAFA + P2 (ours)54.244.445.627.1
Table 3. Computational complexity and inference speed comparison of different models.
Table 3. Computational complexity and inference speed comparison of different models.
ModelParams (M)GFLOPs (G)FPS
YOLOv11s (baseline)9.621.7211.9
+MAC12.767.3141.9
+DAFA9.722.0184.6
+P29.629.8163.6
+MAC + DAFA13.068.0130.1
+MAC + DAFA + P2 (ours)13.276.5123.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, W.; Gong, G. MD-YOLO: A Multi-Scale Adaptive and Dual-Attention Enhanced YOLOv11 for Small Object Detection. Electronics 2026, 15, 2099. https://doi.org/10.3390/electronics15102099

AMA Style

Zhou W, Gong G. MD-YOLO: A Multi-Scale Adaptive and Dual-Attention Enhanced YOLOv11 for Small Object Detection. Electronics. 2026; 15(10):2099. https://doi.org/10.3390/electronics15102099

Chicago/Turabian Style

Zhou, Wenyan, and Gu Gong. 2026. "MD-YOLO: A Multi-Scale Adaptive and Dual-Attention Enhanced YOLOv11 for Small Object Detection" Electronics 15, no. 10: 2099. https://doi.org/10.3390/electronics15102099

APA Style

Zhou, W., & Gong, G. (2026). MD-YOLO: A Multi-Scale Adaptive and Dual-Attention Enhanced YOLOv11 for Small Object Detection. Electronics, 15(10), 2099. https://doi.org/10.3390/electronics15102099

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop