1. Introduction
Unmanned Aerial Vehicle (UAV) technology has advanced rapidly and is now widely used in traffic monitoring, disaster response, agricultural inspection, and military reconnaissance [
1]. UAVs carry high-resolution cameras that capture aerial images in real time. Object detection is a core visual task and is critical for UAV perception and autonomous decision-making [
2]. However, drone images present dramatic variations in object scale, cluttered backgrounds, and dense small objects. Moreover, the limited computing power and battery capacity of airborne platforms impose strict constraints on detection algorithms [
3].
Object detection methods generally fall into two categories: two-stage and single-stage detectors. Two-stage detectors like Faster R-CNN [
4] generate region proposals followed by classification and regression. They achieve high accuracy but incur heavy computation. Single-stage detectors such as the YOLO series perform dense prediction directly on feature maps, offering superior real-time performance [
5]. The YOLO family has evolved continuously from YOLOv1 to YOLOv12 [
6]. YOLOv8 introduced CSPNet and PANet for a better trade-off between accuracy and speed [
7]; YOLOv10 further streamlined the architecture for higher efficiency [
8]; YOLOv11 demonstrated improved performance through architectural upgrades [
9]. Nevertheless, when applied to drone aerial images, these models’ backbones still lack sufficient multi-scale perception. Moreover, their feature pyramids respond weakly to small objects.
Many improvements have been made for drone-view detection. FPN [
10] and its improved version PANet [
11] enhance detection performance through multi-scale feature fusion. HRDNet uses multiple resolution branches to enhance multi-scale features [
12]. Lightweight backbones such as MobileNet [
13] and ShuffleNet [
14] reduce computation via depthwise separable convolutions, but their accuracy degrades in complex aerial scenes. Attention mechanisms have been shown to improve feature discrimination. SENet [
15] generates channel weights through global average pooling; CBAM [
16] adds spatial attention; GSoP-Net [
17] models channel relationships via covariance matrices. More recent works explore lightweight attention and efficient feature extraction for drone scenarios [
18,
19,
20]. However, they typically add computational overhead. Recent surveys on real-time aerial object detection have also highlighted the importance of onboard performance optimization [
21]. For drone-view object detection, existing lightweight YOLO variants exhibit specific deficiencies. YOLOv8n uses CSPNet to lower computation, but its long gradient path can weaken fine-grained features of small objects. YOLOv10n reduces channel width, which weakens cross-channel interactions needed for cluttered backgrounds. YOLOv11n improves feature aggregation with C2k2 bottlenecks but still lacks an adaptive channel recalibration mechanism. As shown later in our experiments (see
Section 3.4), on the VisDrone dataset, these models achieve only 32.1% (YOLOv8n), 32.2% (YOLOv10n), and 32.4% (YOLOv11n) mAP@0.5, confirming their limited performance for small object detection. These observations motivate our design of ARB and ORFM as plug-in enhancements that address these gaps without significant parameter overhead. Recent works like RTUAV-YOLO [
22] also explore lightweight enhancements.
Other recent works have also advanced UAV-based perception. Alshehri et al. [
23] developed an integrated neural network framework for multi-object detection and recognition from UAV imagery, combining preprocessing, segmentation, detection, tracking, counting, trajectory prediction, and classification. In the context of small-object detection in UAV images, Sun et al. [
24] proposed a scale-adaptive aggregation and multi-domain feature fusion architecture that includes a dedicated P2 detection head. However, these frameworks either involve multiple tasks beyond single-stage detection or require extra modules (e.g., segmentation, tracking), whereas MSG-YOLO focuses on lightweight RGB object detection through synergistic ARB + ORFM design.
Despite these advances, existing lightweight models still have two major shortcomings in small-object detection for drones. One is the lack of an efficient recalibration mechanism in feature extraction, which makes it difficult to focus on key targets in complex backgrounds. Another is insufficient multi-scale context aggregation, especially for small objects. Maintaining a lightweight design while improving small-object detection remains a key challenge. Our design differs from existing lightweight detectors in three ways. First, ARB embeds channel scaling directly into the residual path, avoiding extra layers. Second, ORFM uses learnable multi-branch dilated convolutions with dynamic fusion, enabling scale-adaptive receptive fields. Third, the P2 head uses a lightweight C3k2 module with a reduced expansion factor (e = 0.25) to control parameter growth. Ablation experiments validate these design choices.
We choose YOLOv11n as the baseline because it strikes a favorable balance between accuracy (32.4% mAP@0.5 on VisDrone) and model size (2.58M parameters), and its architecture is representative of mainstream lightweight single-stage detectors. In contrast, YOLOv12n achieves slightly lower accuracy (29.1%) under the same training setting, and its newly introduced A2C2f modules may interfere with the independent evaluation of our proposed ARB and ORFM.
To tackle these problems, we propose MSG-YOLO, a lightweight detection network. Our main contributions are:
An Attention Refinement Bottleneck (ARB) that builds a feature recalibration mechanism via learnable channel scaling factors, enhancing focus on informative features with negligible parameter increase.
An Omni-Receptive Field Module (ORFM) that captures multi-scale context via parallel dilated convolutions and incorporates second-order channel attention. As shown later, ARB and ORFM exhibit a strong synergy.
Adding a P2 small-object detection head (160 × 160) to the feature pyramid. It fuses shallow spatial details with deep semantics, significantly improving small-object detection.
The rest of this paper is organized as follows.
Section 2 details the MSG-YOLO architecture, including ARB, ORFM, and the P2 head.
Section 3 describes the experimental setup, datasets, ablation studies, and comparisons.
Section 4 discusses limitations, and
Section 5 concludes the paper.
2. Methodology
We adopt YOLOv11n as the baseline. To handle scale variation, background clutter, and small-object detection in drone imagery, we improve three aspects: feature extraction, multi-scale perception, and detection head design.
Figure 1 shows the overall architecture. The backbone replaces the original C3k2 blocks with ARB-based C3k2_Attn modules. The SPPF module at the backbone end is replaced by ORFM. In the neck, a P2 detection head (160 × 160) is added. These three modules work together to improve accuracy while keeping the model light.
2.1. Attention Refinement Bottleneck (ARB)
The original C3k2 module in YOLOv11 consists of stacked standard residual bottlenecks that treat all channels equally; thus, it lacks feature recalibration. In complex backgrounds, this structure cannot effectively emphasize important targets. To address this, we propose the Attention Refinement Bottleneck (ARB) and arrange it alternately with ordinary bottlenecks to form C3k2_Attn, embedding channel attention without heavy computation.
Figure 2 illustrates the ARB structure. Given input
, a
convolution expands the channel dimension to
where
and expansion factor
. The output is split along channels into
and
, each with
channels.
passes through a sequence
of
bottlenecks: even indices use the ARB core, odd indices use a plain bottleneck. Each bottleneck has input and output channels
with a residual connection. The outputs
are concatenated with
and then fed to a
convolution
that produces the final output with
channels.
The ARB core (inside
Figure 2) works as follows. Input
goes through a
convolution
that compresses channels to
, yielding
. Then a depthwise separable convolution block
performs a
depthwise convolution (groups =
) and then a
pointwise convolution. Each convolution is followed by batch normalization and SiLU activation, producing
. Depthwise separable convolution reduces parameters to about
of a standard
convolution. Next, a
convolution
restores the channel dimension to
, giving
. Finally, a learnable channel scaling factor
(shape
, initialized to 0.1) is applied element-wise to
and then added to the input via a residual connection.
With only
parameters,
introduces channel-wise attention into the residual branch. During training, α adaptively scales each channel, helping the network emphasize important features and suppress redundant ones. RLD-YOLO [
25] further improves small-target feature capture by incorporating large kernel attention. The learnable scaling factor α is initialized to 0.1 to avoid disrupting pretrained features early in training.
Figure 2.
Attention Refinement Bottleneck (ARB).
Figure 2.
Attention Refinement Bottleneck (ARB).
2.2. Omni-Receptive Field Module (ORFM)
The original YOLOv11 backbone ends with an SPPF module that stacks max-pooling layers for multi-scale aggregation. However, its receptive field is fixed and channel relationships are not modeled, making it difficult to adapt to large-scale changes in drone images. We therefore propose the Omni-Receptive Field Module (ORFM), which employs multi-branch dilated convolutions to capture context at multiple scales and adds second-order channel attention to enhance feature discriminability.
Figure 3 shows the ORFM structure. Input
first passes through a
convolution
that reduces the channel count to
, producing
. Then three parallel depthwise separable convolutions [
26] with kernel size
and dilation rates 1, 2, and 3 (groups =
) generate multi-scale features
. To fuse these branches adaptively [
27], we apply global average pooling to
, followed by a
convolution and Softmax to obtain dynamic weights
. The weighted sum is
The fused feature undergoes batch normalization and SiLU activation, yielding
. For better feature discrimination, we add a second-order channel attention module [
18]. It first applies adaptive average pooling to
to reduce the spatial size to
, then computes the covariance matrix and extracts its diagonal as the variance vector
. The global average pooling
(first-order descriptor) is retained. The element-wise product
gives a joint statistical feature, which passes through two
convolutions and a Sigmoid to produce channel attention weights
.
is multiplied element-wise with
to get
.
Next, a lightweight spatial gating module (a
convolution followed by Sigmoid) generates spatial weights
. Multiplying
element-wise with
yields
. A final
convolution
restores the channel dimension to
, producing
. A residual connection is added
ORFM alone brings only limited improvement, but when combined with ARB, a strong synergy emerges, as shown in the ablation study. CMA-Net [
28] uses an efficient bi-directional feature pyramid and dual-dimensional channel attention to improve small-object detection on edge devices.
Figure 3.
Omni-Receptive Field Module (ORFM).
Figure 3.
Omni-Receptive Field Module (ORFM).
2.3. P2 Small-Object Detection Head
Small objects account for a large proportion of drone images. The original YOLOv11 has three detection heads (P3: 80 × 80, P4: 40 × 40, P5: 20 × 20), which are not sensitive enough to extremely small targets. We therefore add a P2 detection head (160 × 160) to the feature pyramid, dedicated to capturing fine details of small objects.
Figure 4 illustrates the construction of the P2 head. Starting from the neck’s P3 layer (layer 16, 80 × 80), we apply 2× upsampling to obtain a 160 × 160 feature map. This map is then concatenated with the backbone’s second layer (P2, 160 × 160), merging shallow spatial information with deeper semantics. To control parameter growth, a lightweight C3k2 module (128 channels,
n = 1, e = 0.25) refines the features, and its output serves as the P2 detection head. Thus, the network performs detection on four scale levels, greatly improving small-object response. MSCM-YOLO [
24] uses a dedicated P2 detection head to preserve high-resolution features for small targets.
The P2 head adds about 0.08M parameters and 3.5 GFLOPs. Ablation shows that adding the P2 head alone increases mAP by 2.4 points. When combined with ARB and ORFM, it further improves mAP by 2.0 points over the ARB + ORFM combination, confirming its effectiveness for small objects. The expansion factor e = 0.25 for the P2 head was selected as it offers the best trade-off between parameter increase and detection gain (adding 0.08M parameters and 3.5 GFLOPs for a 2.4 point mAP lift).
3. Experiments and Results Analysis
3.1. Datasets
We evaluate MSG-YOLO on VisDrone2019-DET as the main dataset and use HazyDet for generalization tests.
VisDrone2019-DET [
3] is a widely used benchmark for drone-view object detection. It contains 10 categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor, with a resolution of 2000 × 1500 pixels. The dataset has 8629 images: 6471 for training, 548 for validation, and 1610 for testing. The images were captured in urban, rural, and campus environments under various lighting and weather conditions. It contains many small, densely packed objects with large-scale variations, along with occlusions, illumination changes, and motion blur.
To evaluate low-visibility robustness, we also use the public dataset HazyDet [
29], which was designed for drone-view object detection in hazy scenes. It contains approximately 383,000 real-world instances, including both real hazy images and clear images with synthetic fog. We select 1000 representative images of cars, trucks, and buses under haze, rain, and low-light conditions. These images have low contrast and blurry boundaries, making them suitable for testing our attention-based background suppression and the P2 head’s performance under adverse conditions.
3.2. Experimental Setup and Evaluation Metrics
All experiments were run under the same environment (
Table 1). System: Linux, Intel Xeon Silver 4310 (6 vCPUs @ 2.10 GHz), NVIDIA GeForce RTX 3090 (24 GB). Framework: PyTorch 2.4.1, Python 3.10.18, CUDA 11.8. To find the best learning rate, we tested two values: 0.01 and 0.001 (
Figure 5). The model with 0.01 converged more steadily and gave higher validation accuracy than the one with 0.001, which made convergence too slow. We therefore set the learning rate to 0.01, under which the training curve was smooth with no obvious oscillations.
We also examined the effect of training epochs by comparing 100, 150, and 200 epochs (
Figure 6). The results showed that the baseline model had essentially converged by the 150th epoch, and continuing training to 200 epochs yielded only limited performance improvements. Considering both training efficiency and performance, the training epoch was ultimately set to 150. To ensure fairness in subsequent experimental comparisons, both the baseline and the improved models were trained under identical settings. The batch size was set to 32, and the optimizer AdamW was used [
30]. Specific parameter configurations are shown in
Table 2.
To comprehensively evaluate model performance, precision, recall, average precision (AP), and mean average precision (mAP) are used as evaluation metrics for detection accuracy [
31,
32]. The calculation formulas are as follows:
Here, TP represents the number of correctly detected positive samples, FP represents the number of negative samples incorrectly classified as targets, and precision reflects the accuracy of the detection results; FN represents the number of missed positive samples, and recall reflects the model’s ability to detect positive samples. AP is obtained by integrating the precision-recall curve and comprehensively evaluates detection performance for a single class. mAP is the average of the APs across all classes, and C is the total number of classes. This paper further adopts the COCO-style evaluation metric as the primary performance benchmark [
33]. This metric calculates AP across multiple levels of Intersection over Union (IoU) thresholds ranging from 0.5 to 0.95 with a step size of 0.05, then takes the average, thereby providing a more comprehensive reflection of the model’s detection capabilities under varying localization accuracy requirements.
Regarding model complexity, this paper employs three metrics for evaluation: the number of parameters (Params), the number of floating-point operations (FLOPs), and the inference speed (FPS). The number of parameters refers to the total sum of all trainable parameters in the model, measured in millions (M), and characterizes the model’s storage footprint and memory requirements; GFLOPs measure the total number of floating-point operations required for the model to complete a single forward inference, measured in GFLOPs, and reflect the model’s computational complexity. FPS is measured on an NVIDIA RTX 3090 GPU with batch size 1, indicating how many frames the model can process per second; it directly reflects the model’s suitability for real-time drone applications.
3.3. Ablation Studies
To gain a deeper understanding of the role and performance of each improvement, ablation experiments were conducted on VisDrone2019. The baseline model is YOLOv11n, which achieves an mAP@0.5 of 32.4% on the validation set, with 2.58 million parameters and a computational cost of 6.3 GFLOPs. The experiments recorded the performance changes after sequentially adding each module; the specific results are shown in
Table 3.
Table 3 shows that adding ARB alone raises mAP@0.5 by 1.9 points (from 32.4% to 34.3%), confirming the value of channel attention for feature recalibration. In contrast, using ORFM by itself slightly hurts performance: mAP@0.5 drops to 32.1% while parameters decrease by 0.13M. Without the guidance of channel attention, ORFM’s aggressive dimensionality reduction (C_m = C/4) likely discards some useful information, and its receptive field benefits cannot compensate for this loss.
When ARB and ORFM are combined, however, a clear synergy appears. The joint model reaches 34.7% mAP@0.5—a 2.3-point gain over baseline and 0.4 points higher than ARB alone. Moreover, the parameter count falls to 2.61M, which is even lower than using ARB alone (2.74M), and the computational cost also drops slightly. ARB applies learnable channel scaling factors to modulate feature maps, suppressing redundant channels and highlighting informative ones. This refined feature representation provides a solid foundation for ORFM’s multi-scale processing. ORFM then uses parallel dilated convolutions and second-order channel attention to capture context at appropriate scales, recovering details that would otherwise be lost during dimension reduction.
These results indicate that ARB and ORFM play complementary roles. ARB improves channel-wise feature selection, whereas ORFM enhances multi-scale feature aggregation. Their combined performance exceeds the sum of their individual gains, and the parameter reduction comes from ORFM’s efficient design (replacing SPPF with a larger channel reduction ratio). This complementary design explains why the ARB + ORFM combination achieves both higher accuracy and lower complexity.
Finally, adding the P2 head on top of ARB + ORFM lifts mAP@0.5 to 36.7%, a further 2.0-point improvement. This confirms that the P2 head supplies high-resolution spatial details that the other two modules cannot provide, and thus enhances multi-scale object detection. FPA-YOLOv8s [
34] introduces a P2-like detection head and achieves notable gains on VisDrone, while DRONet [
35] further confirms the importance of high-resolution features for small and occluded objects.
Table 3 also reports the inference speed (FPS) measured on an NVIDIA RTX 3090 GPU with batch size 1. The baseline YOLOv11n achieves 194.13 FPS. Adding ARB alone slightly enhances FPS to 196.14 FPS. The ORFM alone decreases FPS to 185.41 FPS due to its multi-branch dilated convolutions. When ARB and ORFM are combined, the FPS drops further to 184.00 FPS, which is still acceptable for real-time drone applications (above 30 FPS). Finally, adding the P2 head increases computational load, lowering FPS to 150.02 FPS. Although this is a notable reduction, the model remains fully capable of real-time processing on high-end GPUs, and the significant accuracy gain (+2.0 points) justifies this trade-off. For resource-constrained edge devices, one can optionally remove the P2 head to recover a higher speed while still benefiting from ARB + ORFM.
3.4. Model Comparison Experiments
We compare MSG-YOLO with several mainstream lightweight object detection networks, including YOLOv5n, YOLOv6n, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, and YOLOv26n.
All YOLO-series baselines (YOLOv5n, YOLOv6n, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, YOLOv26n) and our MSG-YOLO were trained from scratch on the VisDrone2019 training set using identical settings: AdamW optimizer, initial learning rate 0.01, weight decay 0.0005, momentum 0.937, batch size 32, input resolution 640 × 640, and 150 epochs. The results are shown in
Table 4.
The results in
Table 4 show that the proposed MSG-YOLO achieves 36.7% on mAP@0.5, ranking first among lightweight models with fewer than 3 million parameters. Compared to YOLOv11n, it achieves a 4.3 percentage point improvement; compared to YOLOv26n, it is 3.9 percentage points higher; and compared to YOLOv10n, it is 4.5 percentage points higher.
Compared with earlier YOLO versions, MSG-YOLO also shows clear advantages. It outperforms YOLOv5n (31.7% mAP@0.5) by 5.0 points while using only 0.19M more parameters. YOLOv6n, despite having a larger parameter count (4.23M), reaches only 29.0% mAP@0.5, 7.7 points lower than MSG-YOLO. YOLOv7-tiny, with 6.03M parameters (more than twice that of MSG-YOLO), achieves 33.4% mAP@0.5, still 3.3 points behind.
For models with similar parameter counts, YOLOv26n has only 2.37 million parameters, 0.32 million fewer than MSG-YOLO, but its mAP@0.5 is 3.9 percentage points lower. YOLOv10n has 2.26 million parameters, 0.43 million fewer than MSG-YOLO, yet its mAP@0.5 is 4.5 percentage points lower. This indicates that a modest parameter increase can bring significant accuracy gains, and MSG-YOLO effectively balances accuracy and efficiency. Comparative studies of YOLO variants for UAV-based infrastructure inspection also show that lightweight models can achieve competitive accuracy with appropriate design [
36]. Benchmarking lightweight YOLO models on edge devices confirms that favorable speed-accuracy trade-offs can be achieved for small-object detection [
37]. EBAD-YOLO [
38] achieves 35.9% mAP@0.5 on VisDrone using a bidirectional adaptive dense FPN. The ESO-Det [
39] detector achieves strong performance on VisDrone through a dense cross-branch complementary module.
Overall, MSG-YOLO achieves the highest detection accuracy among all comparison models with fewer than 3M parameters, while maintaining excellent control over both parameter count and computational cost. MSG-YOLO is more suitable for practical deployment.
3.5. Scale-Wise Performance Analysis
To investigate the effectiveness of the proposed modules on objects of different sizes, we compare the per-category AP50 improvements between the baseline YOLOv11n and MSG-YOLO. The results are presented in
Table 5. The categories are grouped into small, medium, and large scales following the COCO definition.
As shown in the table, the proposed MSG-YOLO achieves substantial gains on small objects. For instance, pedestrian AP50 increases by 8.5 percentage points (from 34.8% to 43.3%), people by 7.4 points (27.3% → 34.7%), and motor by 6.5 points (36.1% → 42.6%). The average improvement across the six small-object categories (pedestrian, people, bicycle, tricycle, awning-tricycle, motor) is approximately 4.7 percentage points, whereas the average improvement for medium-sized objects (car, van, bus) is 3.5 points, and for the large object (truck) it is 4.7 points. These results indicate that the P2 detection head, together with the ARB and ORFMs, is particularly beneficial for small-scale targets, which are the most challenging in drone-captured scenes. The gains on medium and large objects are also positive but less pronounced, confirming that the proposed modules do not sacrifice performance on larger instances while significantly boosting small-object detection.
This scale-wise analysis further supports the design rationale of MSG-YOLO: the P2 head provides high-resolution spatial details essential for small objects, and the synergistic effect of ARB and ORFM enhances multi-scale feature representation, leading to balanced improvements across all object sizes.
3.6. Generalization Experiments
To verify MSG-YOLO’s generalization capabilities in low-visibility environments, generalization experiments were conducted on the HazyDet public dataset. YOLOv11n and MSG-YOLO models used identical parameter settings; the experimental results are shown in
Table 6.
In tests on the HazyDet dataset, which features blurry images and low contrast and is specifically designed to evaluate detection performance under adverse weather conditions, MSG-YOLO achieved an mAP@0.5 of 71.5%, 3.1 percentage points higher than YOLOv11n, indicating that the model can still detect objects reliably under low-visibility conditions. This validates the attention-enhancing effect of the ARB module on low-contrast targets, as well as the effectiveness of the ORFM’s multi-scale context aggregation in foggy and rainy environments. The P2 detection head’s ability to capture small, blurred targets was also confirmed by the experiments. UFDT-YOLO [
40] addresses hazy conditions by integrating an image restoration module and feature attention mechanism, achieving robust detection on HazyDet. On this dataset, the inference speed of our model drops from 226.04 FPS to 179.15 FPS, which remains well above real-time requirements (30 FPS), confirming its practical usability even under challenging weather conditions.
3.7. Heatmap Analysis
To visually validate the improved model’s performance in feature focusing, we visualized the feature maps using the Grad-CAM method [
41]. Three images were selected: two from the VisDrone validation set and one low-visibility image from HazyDet. The comparison was made between the baseline YOLOv11n and our MSG-YOLO, with heatmaps extracted from the Detect output layer. The images are presented in pseudocolor, where deeper red indicates higher model attention.
Figure 7 presents a comparison of the three sets of heatmaps. As shown in the figure, the baseline model exhibits relatively low heatmap responses, with responses also appearing in background areas, indicating that it is easily affected by environmental noise and rarely responds to distant objects. In contrast, the improved model’s heatmap responses are primarily concentrated on the objects, with extremely high responses even for distant targets.
Figure 7a shows a scene with pedestrians and vehicles of varying sizes. The baseline model exhibits extremely low overall response and is easily affected by environmental noise, whereas the improved model shows very high response to both near and far targets.
Figure 7b depicts a low-light, high-traffic scene. The baseline model has a low response to near targets and virtually no response to far targets, while the improved model demonstrates significantly enhanced response to small-sized targets.
Figure 7c presents a low-visibility image. Due to blurred edges, the baseline model generates a large number of low-response targets in the hazy areas, whereas the improved model still manages to concentrate the heatmap on the vehicles. This demonstrates that the ARB module’s attention enhancement for low-contrast targets and the ORFM’s multi-scale context aggregation remain effective under adverse weather conditions.
The above heatmap comparison confirms the improved model’s advantages in feature focusing and small-object response. These findings are corroborated by data from the ablation experiments, further validating the synergistic effects of ARB and ORFM, as well as the effectiveness of the P2 head for small-object detection.
3.8. Analysis of Visualization Results
To visually demonstrate the actual detection performance of MSG-YOLO, we selected three representative images from the VisDrone2019 validation set and one low-visibility image from HazyDet for comparative analysis.
Figure 8 compares the detection results of the baseline YOLOv11n and our complete model in the same scene.
Figure 8a shows a complex street scene. The baseline model missed pedestrians and vehicles in the distance, and the car front in the lower-left corner was also missed; our model successfully detected these objects.
Figure 8b depicts a low-light, densely populated vehicle scene. Our improved model not only detected small objects in the distance that the baseline model missed, but also achieved higher accuracy for nearby objects.
Figure 8c presents a complex environment. The baseline model misclassified a nearby awning as a truck; our model did not produce such misclassifications and detected more small distant targets.
Figure 8d shows a comparison under strong lighting. In this setting, the baseline model exhibited numerous omissions and misclassifications, whereas the improved model did not suffer from these issues and achieved higher accuracy.
5. Conclusions
To handle large-scale variation, background clutter, and deployment constraints in drone imagery, we propose MSG-YOLO, a lightweight detector for small and dense objects. We tackle these challenges through both network architecture and model compression: We designed an Attention Refinement Bottleneck module that integrates depth-separable convolutions and learnable channel scaling factors into the residual structure, adding channel attention to the network with a minimal number of parameters; we also designed a global receptive field module that uses multi-branch dilated convolutions to capture context at different scales, further enhanced by second-order channel attention; and we added a 160 × 160 detection layer within the feature pyramid specifically to handle small objects.
Experiments show that the combined use of the ARB and ORFMs produces a synergistic enhancement effect. While performance improvements were limited when each module was used separately, combining the two resulted in a 0.4 percentage point increase in mAP@0.5, with a reduction of 0.13 million parameters. Test results on VisDrone2019: MSG-YOLO has 2.69 million parameters, a computational cost of 11.6 GFLOPs, and an average accuracy of 36.7%, which is 4.3 percentage points higher than YOLOv11n. Ablation experiments also confirmed that each module contributes effectively, with ARB, ORFM, and the P2 head excelling at their respective scales. In terms of generalization, the model outperformed the baseline by 3.1 percentage points on the HazyDet public dataset, validating its robustness in low-visibility scenarios.
Future work includes exploring lighter architectures, incorporating temporal information, and evaluating real-time performance on embedded platforms [
42].