MSG-YOLO: A Multi-Scale Statistical-Guided Network for Drone Object Detection

Liu, Jianhua; Wang, Yanfei

doi:10.3390/app16126050

Open AccessArticle

MSG-YOLO: A Multi-Scale Statistical-Guided Network for Drone Object Detection

by

Jianhua Liu

and

Yanfei Wang

^*

School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450046, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 6050; https://doi.org/10.3390/app16126050 (registering DOI)

Submission received: 12 May 2026 / Revised: 3 June 2026 / Accepted: 9 June 2026 / Published: 15 June 2026

Download

Browse Figures

Versions Notes

Abstract

Object detection in drone-captured images must handle large-scale variations, cluttered backgrounds, and limited onboard computing power. To address these issues, this paper presents a lightweight detector named MSG-YOLO. The improvements lie in three aspects: feature extraction, multi-scale perception, and detection head design. We introduce an Attention Refinement Bottleneck (ARB) that combines depthwise separable convolutions with learnable channel scaling factors to achieve effective feature recalibration at very low parameter cost. We design an Omni-Receptive Field Module (ORFM) that employs multi-branch dilated convolutions to capture multi-scale context and incorporates second-order channel attention for better feature discrimination. A P2 detection head (160 × 160) is further added to the feature pyramid to enhance small-object response. Experiments show that ARB and ORFM work synergistically: each alone brings limited gain, but together they improve mAP@0.5 by 0.4 percentage points while reducing parameters by 0.13 M. On VisDrone2019, MSG-YOLO achieves 36.7% mAP@0.5, outperforming the baseline YOLOv11n by 4.3 points while using only 2.69M parameters. On the HazyDet benchmark, the method outperforms the baseline by 3.1 points, confirming robustness under low visibility. Ablation studies and visual analyses further reveal how each component contributes.

Keywords:

object detection; lightweight network; attention mechanism; multi-scale fusion; synergistic effect

1. Introduction

Unmanned Aerial Vehicle (UAV) technology has advanced rapidly and is now widely used in traffic monitoring, disaster response, agricultural inspection, and military reconnaissance [1]. UAVs carry high-resolution cameras that capture aerial images in real time. Object detection is a core visual task and is critical for UAV perception and autonomous decision-making [2]. However, drone images present dramatic variations in object scale, cluttered backgrounds, and dense small objects. Moreover, the limited computing power and battery capacity of airborne platforms impose strict constraints on detection algorithms [3].

Object detection methods generally fall into two categories: two-stage and single-stage detectors. Two-stage detectors like Faster R-CNN [4] generate region proposals followed by classification and regression. They achieve high accuracy but incur heavy computation. Single-stage detectors such as the YOLO series perform dense prediction directly on feature maps, offering superior real-time performance [5]. The YOLO family has evolved continuously from YOLOv1 to YOLOv12 [6]. YOLOv8 introduced CSPNet and PANet for a better trade-off between accuracy and speed [7]; YOLOv10 further streamlined the architecture for higher efficiency [8]; YOLOv11 demonstrated improved performance through architectural upgrades [9]. Nevertheless, when applied to drone aerial images, these models’ backbones still lack sufficient multi-scale perception. Moreover, their feature pyramids respond weakly to small objects.

Many improvements have been made for drone-view detection. FPN [10] and its improved version PANet [11] enhance detection performance through multi-scale feature fusion. HRDNet uses multiple resolution branches to enhance multi-scale features [12]. Lightweight backbones such as MobileNet [13] and ShuffleNet [14] reduce computation via depthwise separable convolutions, but their accuracy degrades in complex aerial scenes. Attention mechanisms have been shown to improve feature discrimination. SENet [15] generates channel weights through global average pooling; CBAM [16] adds spatial attention; GSoP-Net [17] models channel relationships via covariance matrices. More recent works explore lightweight attention and efficient feature extraction for drone scenarios [18,19,20]. However, they typically add computational overhead. Recent surveys on real-time aerial object detection have also highlighted the importance of onboard performance optimization [21]. For drone-view object detection, existing lightweight YOLO variants exhibit specific deficiencies. YOLOv8n uses CSPNet to lower computation, but its long gradient path can weaken fine-grained features of small objects. YOLOv10n reduces channel width, which weakens cross-channel interactions needed for cluttered backgrounds. YOLOv11n improves feature aggregation with C2k2 bottlenecks but still lacks an adaptive channel recalibration mechanism. As shown later in our experiments (see Section 3.4), on the VisDrone dataset, these models achieve only 32.1% (YOLOv8n), 32.2% (YOLOv10n), and 32.4% (YOLOv11n) mAP@0.5, confirming their limited performance for small object detection. These observations motivate our design of ARB and ORFM as plug-in enhancements that address these gaps without significant parameter overhead. Recent works like RTUAV-YOLO [22] also explore lightweight enhancements.

Other recent works have also advanced UAV-based perception. Alshehri et al. [23] developed an integrated neural network framework for multi-object detection and recognition from UAV imagery, combining preprocessing, segmentation, detection, tracking, counting, trajectory prediction, and classification. In the context of small-object detection in UAV images, Sun et al. [24] proposed a scale-adaptive aggregation and multi-domain feature fusion architecture that includes a dedicated P2 detection head. However, these frameworks either involve multiple tasks beyond single-stage detection or require extra modules (e.g., segmentation, tracking), whereas MSG-YOLO focuses on lightweight RGB object detection through synergistic ARB + ORFM design.

Despite these advances, existing lightweight models still have two major shortcomings in small-object detection for drones. One is the lack of an efficient recalibration mechanism in feature extraction, which makes it difficult to focus on key targets in complex backgrounds. Another is insufficient multi-scale context aggregation, especially for small objects. Maintaining a lightweight design while improving small-object detection remains a key challenge. Our design differs from existing lightweight detectors in three ways. First, ARB embeds channel scaling directly into the residual path, avoiding extra layers. Second, ORFM uses learnable multi-branch dilated convolutions with dynamic fusion, enabling scale-adaptive receptive fields. Third, the P2 head uses a lightweight C3k2 module with a reduced expansion factor (e = 0.25) to control parameter growth. Ablation experiments validate these design choices.

We choose YOLOv11n as the baseline because it strikes a favorable balance between accuracy (32.4% mAP@0.5 on VisDrone) and model size (2.58M parameters), and its architecture is representative of mainstream lightweight single-stage detectors. In contrast, YOLOv12n achieves slightly lower accuracy (29.1%) under the same training setting, and its newly introduced A2C2f modules may interfere with the independent evaluation of our proposed ARB and ORFM.

To tackle these problems, we propose MSG-YOLO, a lightweight detection network. Our main contributions are:

An Attention Refinement Bottleneck (ARB) that builds a feature recalibration mechanism via learnable channel scaling factors, enhancing focus on informative features with negligible parameter increase.
An Omni-Receptive Field Module (ORFM) that captures multi-scale context via parallel dilated convolutions and incorporates second-order channel attention. As shown later, ARB and ORFM exhibit a strong synergy.
Adding a P2 small-object detection head (160 × 160) to the feature pyramid. It fuses shallow spatial details with deep semantics, significantly improving small-object detection.

The rest of this paper is organized as follows. Section 2 details the MSG-YOLO architecture, including ARB, ORFM, and the P2 head. Section 3 describes the experimental setup, datasets, ablation studies, and comparisons. Section 4 discusses limitations, and Section 5 concludes the paper.

2. Methodology

We adopt YOLOv11n as the baseline. To handle scale variation, background clutter, and small-object detection in drone imagery, we improve three aspects: feature extraction, multi-scale perception, and detection head design. Figure 1 shows the overall architecture. The backbone replaces the original C3k2 blocks with ARB-based C3k2_Attn modules. The SPPF module at the backbone end is replaced by ORFM. In the neck, a P2 detection head (160 × 160) is added. These three modules work together to improve accuracy while keeping the model light.

2.1. Attention Refinement Bottleneck (ARB)

The original C3k2 module in YOLOv11 consists of stacked standard residual bottlenecks that treat all channels equally; thus, it lacks feature recalibration. In complex backgrounds, this structure cannot effectively emphasize important targets. To address this, we propose the Attention Refinement Bottleneck (ARB) and arrange it alternately with ordinary bottlenecks to form C3k2_Attn, embedding channel attention without heavy computation.

Figure 2 illustrates the ARB structure. Given input

X \in R^{C_{1} \times H \times W}

, a

1 \times 1

convolution expands the channel dimension to

2 C_{h i d}

where

C_{h i d} = ⌊ C_{2} \cdot e ⌋

and expansion factor

e = 0.5

. The output is split along channels into

Y_{0}

and

Y_{1}

, each with

C_{h i d}

channels.

Y_{1}

passes through a sequence

M

of

n

bottlenecks: even indices use the ARB core, odd indices use a plain bottleneck. Each bottleneck has input and output channels

C_{h i d}

with a residual connection. The outputs

B_{1}, \dots, B_{n}

are concatenated with

Y_{0}

and then fed to a

1 \times 1

convolution

{c v}_{2}

that produces the final output with

C_{2}

channels.

The ARB core (inside Figure 2) works as follows. Input

X_{b} \in R^{C_{h i d} \times H \times W}

goes through a

1 \times 1

convolution

{c v}_{b 1}

that compresses channels to

C_{m i d} = ⌊C_{h i d} \cdot e⌋

, yielding

Y_{b 1}

. Then a depthwise separable convolution block

{c v}_{b 2}

performs a

3 \times 3

depthwise convolution (groups =

C_{m i d}

) and then a

1 \times 1

pointwise convolution. Each convolution is followed by batch normalization and SiLU activation, producing

Y_{b 2}

. Depthwise separable convolution reduces parameters to about

1 / C_{m i d} + 1 / 9

of a standard

3 \times 3

convolution. Next, a

1 \times 1

convolution

{c v}_{b 3}

restores the channel dimension to

C_{h i d}

, giving

Y_{b 3}

. Finally, a learnable channel scaling factor

α \in R^{C_{h i d}}

(shape

(1, C_{h i d}, 1, 1)

, initialized to 0.1) is applied element-wise to

Y_{b 3}

and then added to the input via a residual connection.

Y_{b, out} = Y_{b 3} ⊙ α + X_{b}

(1)

With only

C_{h i d}

parameters,

α

introduces channel-wise attention into the residual branch. During training, α adaptively scales each channel, helping the network emphasize important features and suppress redundant ones. RLD-YOLO [25] further improves small-target feature capture by incorporating large kernel attention. The learnable scaling factor α is initialized to 0.1 to avoid disrupting pretrained features early in training.

Figure 2. Attention Refinement Bottleneck (ARB).

2.2. Omni-Receptive Field Module (ORFM)

The original YOLOv11 backbone ends with an SPPF module that stacks max-pooling layers for multi-scale aggregation. However, its receptive field is fixed and channel relationships are not modeled, making it difficult to adapt to large-scale changes in drone images. We therefore propose the Omni-Receptive Field Module (ORFM), which employs multi-branch dilated convolutions to capture context at multiple scales and adds second-order channel attention to enhance feature discriminability.

Figure 3 shows the ORFM structure. Input

X \in R^{C \times H \times W}

first passes through a

1 \times 1

convolution

{c v}_{1}

that reduces the channel count to

C_{m} = C / 4

, producing

F

. Then three parallel depthwise separable convolutions [26] with kernel size

3 \times 3

and dilation rates 1, 2, and 3 (groups =

C_{m}

) generate multi-scale features

F_{1}, F_{2}, F_{3} \in R^{C_{m} \times H \times W}

. To fuse these branches adaptively [27], we apply global average pooling to

F

, followed by a

1 \times 1

convolution and Softmax to obtain dynamic weights

w_{1}, w_{2}, w_{3}

. The weighted sum is

F_{fuse} = w_{1} F_{1} + w_{2} F_{2} + w_{3} F_{3}

(2)

The fused feature undergoes batch normalization and SiLU activation, yielding

F_{bn}

. For better feature discrimination, we add a second-order channel attention module [18]. It first applies adaptive average pooling to

F_{bn}

to reduce the spatial size to

8 \times 8

, then computes the covariance matrix and extracts its diagonal as the variance vector

v

. The global average pooling

g

(first-order descriptor) is retained. The element-wise product

s = g ⊙ v

gives a joint statistical feature, which passes through two

1 \times 1

convolutions and a Sigmoid to produce channel attention weights

a

.

a

is multiplied element-wise with

F_{bn}

to get

F_{att}

.

Next, a lightweight spatial gating module (a

3 \times 3

convolution followed by Sigmoid) generates spatial weights

s_{p}

. Multiplying

s_{p}

element-wise with

F_{att}

yields

F_{gated}

. A final

1 \times 1

convolution

{c v}_{2}

restores the channel dimension to

C

, producing

F_{out}

. A residual connection is added

Y = F_{out} + X

(3)

ORFM alone brings only limited improvement, but when combined with ARB, a strong synergy emerges, as shown in the ablation study. CMA-Net [28] uses an efficient bi-directional feature pyramid and dual-dimensional channel attention to improve small-object detection on edge devices.

Figure 3. Omni-Receptive Field Module (ORFM).

2.3. P2 Small-Object Detection Head

Small objects account for a large proportion of drone images. The original YOLOv11 has three detection heads (P3: 80 × 80, P4: 40 × 40, P5: 20 × 20), which are not sensitive enough to extremely small targets. We therefore add a P2 detection head (160 × 160) to the feature pyramid, dedicated to capturing fine details of small objects.

Figure 4 illustrates the construction of the P2 head. Starting from the neck’s P3 layer (layer 16, 80 × 80), we apply 2× upsampling to obtain a 160 × 160 feature map. This map is then concatenated with the backbone’s second layer (P2, 160 × 160), merging shallow spatial information with deeper semantics. To control parameter growth, a lightweight C3k2 module (128 channels, n = 1, e = 0.25) refines the features, and its output serves as the P2 detection head. Thus, the network performs detection on four scale levels, greatly improving small-object response. MSCM-YOLO [24] uses a dedicated P2 detection head to preserve high-resolution features for small targets.

The P2 head adds about 0.08M parameters and 3.5 GFLOPs. Ablation shows that adding the P2 head alone increases mAP by 2.4 points. When combined with ARB and ORFM, it further improves mAP by 2.0 points over the ARB + ORFM combination, confirming its effectiveness for small objects. The expansion factor e = 0.25 for the P2 head was selected as it offers the best trade-off between parameter increase and detection gain (adding 0.08M parameters and 3.5 GFLOPs for a 2.4 point mAP lift).

3. Experiments and Results Analysis

3.1. Datasets

We evaluate MSG-YOLO on VisDrone2019-DET as the main dataset and use HazyDet for generalization tests.

VisDrone2019-DET [3] is a widely used benchmark for drone-view object detection. It contains 10 categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor, with a resolution of 2000 × 1500 pixels. The dataset has 8629 images: 6471 for training, 548 for validation, and 1610 for testing. The images were captured in urban, rural, and campus environments under various lighting and weather conditions. It contains many small, densely packed objects with large-scale variations, along with occlusions, illumination changes, and motion blur.

To evaluate low-visibility robustness, we also use the public dataset HazyDet [29], which was designed for drone-view object detection in hazy scenes. It contains approximately 383,000 real-world instances, including both real hazy images and clear images with synthetic fog. We select 1000 representative images of cars, trucks, and buses under haze, rain, and low-light conditions. These images have low contrast and blurry boundaries, making them suitable for testing our attention-based background suppression and the P2 head’s performance under adverse conditions.

3.2. Experimental Setup and Evaluation Metrics

All experiments were run under the same environment (Table 1). System: Linux, Intel Xeon Silver 4310 (6 vCPUs @ 2.10 GHz), NVIDIA GeForce RTX 3090 (24 GB). Framework: PyTorch 2.4.1, Python 3.10.18, CUDA 11.8. To find the best learning rate, we tested two values: 0.01 and 0.001 (Figure 5). The model with 0.01 converged more steadily and gave higher validation accuracy than the one with 0.001, which made convergence too slow. We therefore set the learning rate to 0.01, under which the training curve was smooth with no obvious oscillations.

We also examined the effect of training epochs by comparing 100, 150, and 200 epochs (Figure 6). The results showed that the baseline model had essentially converged by the 150th epoch, and continuing training to 200 epochs yielded only limited performance improvements. Considering both training efficiency and performance, the training epoch was ultimately set to 150. To ensure fairness in subsequent experimental comparisons, both the baseline and the improved models were trained under identical settings. The batch size was set to 32, and the optimizer AdamW was used [30]. Specific parameter configurations are shown in Table 2.

To comprehensively evaluate model performance, precision, recall, average precision (AP), and mean average precision (mAP) are used as evaluation metrics for detection accuracy [31,32]. The calculation formulas are as follows:

P = \frac{T P}{T P + F P}

(4)

R = \frac{T P}{T P + F N}

(5)

A P = \int_{0}^{1} P (R) d R

(6)

m A P = \frac{1}{C} \sum_{c = 1}^{C} A P_{c}

(7)

Here, TP represents the number of correctly detected positive samples, FP represents the number of negative samples incorrectly classified as targets, and precision reflects the accuracy of the detection results; FN represents the number of missed positive samples, and recall reflects the model’s ability to detect positive samples. AP is obtained by integrating the precision-recall curve and comprehensively evaluates detection performance for a single class. mAP is the average of the APs across all classes, and C is the total number of classes. This paper further adopts the COCO-style evaluation metric as the primary performance benchmark [33]. This metric calculates AP across multiple levels of Intersection over Union (IoU) thresholds ranging from 0.5 to 0.95 with a step size of 0.05, then takes the average, thereby providing a more comprehensive reflection of the model’s detection capabilities under varying localization accuracy requirements.

Regarding model complexity, this paper employs three metrics for evaluation: the number of parameters (Params), the number of floating-point operations (FLOPs), and the inference speed (FPS). The number of parameters refers to the total sum of all trainable parameters in the model, measured in millions (M), and characterizes the model’s storage footprint and memory requirements; GFLOPs measure the total number of floating-point operations required for the model to complete a single forward inference, measured in GFLOPs, and reflect the model’s computational complexity. FPS is measured on an NVIDIA RTX 3090 GPU with batch size 1, indicating how many frames the model can process per second; it directly reflects the model’s suitability for real-time drone applications.

3.3. Ablation Studies

To gain a deeper understanding of the role and performance of each improvement, ablation experiments were conducted on VisDrone2019. The baseline model is YOLOv11n, which achieves an mAP@0.5 of 32.4% on the validation set, with 2.58 million parameters and a computational cost of 6.3 GFLOPs. The experiments recorded the performance changes after sequentially adding each module; the specific results are shown in Table 3.

Table 3 shows that adding ARB alone raises mAP@0.5 by 1.9 points (from 32.4% to 34.3%), confirming the value of channel attention for feature recalibration. In contrast, using ORFM by itself slightly hurts performance: mAP@0.5 drops to 32.1% while parameters decrease by 0.13M. Without the guidance of channel attention, ORFM’s aggressive dimensionality reduction (C_m = C/4) likely discards some useful information, and its receptive field benefits cannot compensate for this loss.

When ARB and ORFM are combined, however, a clear synergy appears. The joint model reaches 34.7% mAP@0.5—a 2.3-point gain over baseline and 0.4 points higher than ARB alone. Moreover, the parameter count falls to 2.61M, which is even lower than using ARB alone (2.74M), and the computational cost also drops slightly. ARB applies learnable channel scaling factors to modulate feature maps, suppressing redundant channels and highlighting informative ones. This refined feature representation provides a solid foundation for ORFM’s multi-scale processing. ORFM then uses parallel dilated convolutions and second-order channel attention to capture context at appropriate scales, recovering details that would otherwise be lost during dimension reduction.

These results indicate that ARB and ORFM play complementary roles. ARB improves channel-wise feature selection, whereas ORFM enhances multi-scale feature aggregation. Their combined performance exceeds the sum of their individual gains, and the parameter reduction comes from ORFM’s efficient design (replacing SPPF with a larger channel reduction ratio). This complementary design explains why the ARB + ORFM combination achieves both higher accuracy and lower complexity.

Finally, adding the P2 head on top of ARB + ORFM lifts mAP@0.5 to 36.7%, a further 2.0-point improvement. This confirms that the P2 head supplies high-resolution spatial details that the other two modules cannot provide, and thus enhances multi-scale object detection. FPA-YOLOv8s [34] introduces a P2-like detection head and achieves notable gains on VisDrone, while DRONet [35] further confirms the importance of high-resolution features for small and occluded objects. Table 3 also reports the inference speed (FPS) measured on an NVIDIA RTX 3090 GPU with batch size 1. The baseline YOLOv11n achieves 194.13 FPS. Adding ARB alone slightly enhances FPS to 196.14 FPS. The ORFM alone decreases FPS to 185.41 FPS due to its multi-branch dilated convolutions. When ARB and ORFM are combined, the FPS drops further to 184.00 FPS, which is still acceptable for real-time drone applications (above 30 FPS). Finally, adding the P2 head increases computational load, lowering FPS to 150.02 FPS. Although this is a notable reduction, the model remains fully capable of real-time processing on high-end GPUs, and the significant accuracy gain (+2.0 points) justifies this trade-off. For resource-constrained edge devices, one can optionally remove the P2 head to recover a higher speed while still benefiting from ARB + ORFM.

3.4. Model Comparison Experiments

We compare MSG-YOLO with several mainstream lightweight object detection networks, including YOLOv5n, YOLOv6n, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, and YOLOv26n.

All YOLO-series baselines (YOLOv5n, YOLOv6n, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, YOLOv26n) and our MSG-YOLO were trained from scratch on the VisDrone2019 training set using identical settings: AdamW optimizer, initial learning rate 0.01, weight decay 0.0005, momentum 0.937, batch size 32, input resolution 640 × 640, and 150 epochs. The results are shown in Table 4.

The results in Table 4 show that the proposed MSG-YOLO achieves 36.7% on mAP@0.5, ranking first among lightweight models with fewer than 3 million parameters. Compared to YOLOv11n, it achieves a 4.3 percentage point improvement; compared to YOLOv26n, it is 3.9 percentage points higher; and compared to YOLOv10n, it is 4.5 percentage points higher.

Compared with earlier YOLO versions, MSG-YOLO also shows clear advantages. It outperforms YOLOv5n (31.7% mAP@0.5) by 5.0 points while using only 0.19M more parameters. YOLOv6n, despite having a larger parameter count (4.23M), reaches only 29.0% mAP@0.5, 7.7 points lower than MSG-YOLO. YOLOv7-tiny, with 6.03M parameters (more than twice that of MSG-YOLO), achieves 33.4% mAP@0.5, still 3.3 points behind.

For models with similar parameter counts, YOLOv26n has only 2.37 million parameters, 0.32 million fewer than MSG-YOLO, but its mAP@0.5 is 3.9 percentage points lower. YOLOv10n has 2.26 million parameters, 0.43 million fewer than MSG-YOLO, yet its mAP@0.5 is 4.5 percentage points lower. This indicates that a modest parameter increase can bring significant accuracy gains, and MSG-YOLO effectively balances accuracy and efficiency. Comparative studies of YOLO variants for UAV-based infrastructure inspection also show that lightweight models can achieve competitive accuracy with appropriate design [36]. Benchmarking lightweight YOLO models on edge devices confirms that favorable speed-accuracy trade-offs can be achieved for small-object detection [37]. EBAD-YOLO [38] achieves 35.9% mAP@0.5 on VisDrone using a bidirectional adaptive dense FPN. The ESO-Det [39] detector achieves strong performance on VisDrone through a dense cross-branch complementary module.

Overall, MSG-YOLO achieves the highest detection accuracy among all comparison models with fewer than 3M parameters, while maintaining excellent control over both parameter count and computational cost. MSG-YOLO is more suitable for practical deployment.

3.5. Scale-Wise Performance Analysis

To investigate the effectiveness of the proposed modules on objects of different sizes, we compare the per-category AP50 improvements between the baseline YOLOv11n and MSG-YOLO. The results are presented in Table 5. The categories are grouped into small, medium, and large scales following the COCO definition.

As shown in the table, the proposed MSG-YOLO achieves substantial gains on small objects. For instance, pedestrian AP50 increases by 8.5 percentage points (from 34.8% to 43.3%), people by 7.4 points (27.3% → 34.7%), and motor by 6.5 points (36.1% → 42.6%). The average improvement across the six small-object categories (pedestrian, people, bicycle, tricycle, awning-tricycle, motor) is approximately 4.7 percentage points, whereas the average improvement for medium-sized objects (car, van, bus) is 3.5 points, and for the large object (truck) it is 4.7 points. These results indicate that the P2 detection head, together with the ARB and ORFMs, is particularly beneficial for small-scale targets, which are the most challenging in drone-captured scenes. The gains on medium and large objects are also positive but less pronounced, confirming that the proposed modules do not sacrifice performance on larger instances while significantly boosting small-object detection.

This scale-wise analysis further supports the design rationale of MSG-YOLO: the P2 head provides high-resolution spatial details essential for small objects, and the synergistic effect of ARB and ORFM enhances multi-scale feature representation, leading to balanced improvements across all object sizes.

3.6. Generalization Experiments

To verify MSG-YOLO’s generalization capabilities in low-visibility environments, generalization experiments were conducted on the HazyDet public dataset. YOLOv11n and MSG-YOLO models used identical parameter settings; the experimental results are shown in Table 6.

In tests on the HazyDet dataset, which features blurry images and low contrast and is specifically designed to evaluate detection performance under adverse weather conditions, MSG-YOLO achieved an mAP@0.5 of 71.5%, 3.1 percentage points higher than YOLOv11n, indicating that the model can still detect objects reliably under low-visibility conditions. This validates the attention-enhancing effect of the ARB module on low-contrast targets, as well as the effectiveness of the ORFM’s multi-scale context aggregation in foggy and rainy environments. The P2 detection head’s ability to capture small, blurred targets was also confirmed by the experiments. UFDT-YOLO [40] addresses hazy conditions by integrating an image restoration module and feature attention mechanism, achieving robust detection on HazyDet. On this dataset, the inference speed of our model drops from 226.04 FPS to 179.15 FPS, which remains well above real-time requirements (30 FPS), confirming its practical usability even under challenging weather conditions.

3.7. Heatmap Analysis

To visually validate the improved model’s performance in feature focusing, we visualized the feature maps using the Grad-CAM method [41]. Three images were selected: two from the VisDrone validation set and one low-visibility image from HazyDet. The comparison was made between the baseline YOLOv11n and our MSG-YOLO, with heatmaps extracted from the Detect output layer. The images are presented in pseudocolor, where deeper red indicates higher model attention.

Figure 7 presents a comparison of the three sets of heatmaps. As shown in the figure, the baseline model exhibits relatively low heatmap responses, with responses also appearing in background areas, indicating that it is easily affected by environmental noise and rarely responds to distant objects. In contrast, the improved model’s heatmap responses are primarily concentrated on the objects, with extremely high responses even for distant targets. Figure 7a shows a scene with pedestrians and vehicles of varying sizes. The baseline model exhibits extremely low overall response and is easily affected by environmental noise, whereas the improved model shows very high response to both near and far targets. Figure 7b depicts a low-light, high-traffic scene. The baseline model has a low response to near targets and virtually no response to far targets, while the improved model demonstrates significantly enhanced response to small-sized targets. Figure 7c presents a low-visibility image. Due to blurred edges, the baseline model generates a large number of low-response targets in the hazy areas, whereas the improved model still manages to concentrate the heatmap on the vehicles. This demonstrates that the ARB module’s attention enhancement for low-contrast targets and the ORFM’s multi-scale context aggregation remain effective under adverse weather conditions.

The above heatmap comparison confirms the improved model’s advantages in feature focusing and small-object response. These findings are corroborated by data from the ablation experiments, further validating the synergistic effects of ARB and ORFM, as well as the effectiveness of the P2 head for small-object detection.

3.8. Analysis of Visualization Results

To visually demonstrate the actual detection performance of MSG-YOLO, we selected three representative images from the VisDrone2019 validation set and one low-visibility image from HazyDet for comparative analysis. Figure 8 compares the detection results of the baseline YOLOv11n and our complete model in the same scene.

Figure 8a shows a complex street scene. The baseline model missed pedestrians and vehicles in the distance, and the car front in the lower-left corner was also missed; our model successfully detected these objects. Figure 8b depicts a low-light, densely populated vehicle scene. Our improved model not only detected small objects in the distance that the baseline model missed, but also achieved higher accuracy for nearby objects. Figure 8c presents a complex environment. The baseline model misclassified a nearby awning as a truck; our model did not produce such misclassifications and detected more small distant targets. Figure 8d shows a comparison under strong lighting. In this setting, the baseline model exhibited numerous omissions and misclassifications, whereas the improved model did not suffer from these issues and achieved higher accuracy.

4. Discussion

Although MSG-YOLO achieves strong performance on drone-view object detection, it has several limitations. First, the model is designed only for visible light images; its applicability to multispectral or thermal data has not been verified. Second, in scenes with extreme motion blur or drastic illumination changes, the P2 head may still miss some tiny objects due to the low signal-to-noise ratio. Third, the current design focuses on static image detection; temporal information from video streams is not exploited, which could improve robustness in future work.

5. Conclusions

To handle large-scale variation, background clutter, and deployment constraints in drone imagery, we propose MSG-YOLO, a lightweight detector for small and dense objects. We tackle these challenges through both network architecture and model compression: We designed an Attention Refinement Bottleneck module that integrates depth-separable convolutions and learnable channel scaling factors into the residual structure, adding channel attention to the network with a minimal number of parameters; we also designed a global receptive field module that uses multi-branch dilated convolutions to capture context at different scales, further enhanced by second-order channel attention; and we added a 160 × 160 detection layer within the feature pyramid specifically to handle small objects.

Experiments show that the combined use of the ARB and ORFMs produces a synergistic enhancement effect. While performance improvements were limited when each module was used separately, combining the two resulted in a 0.4 percentage point increase in mAP@0.5, with a reduction of 0.13 million parameters. Test results on VisDrone2019: MSG-YOLO has 2.69 million parameters, a computational cost of 11.6 GFLOPs, and an average accuracy of 36.7%, which is 4.3 percentage points higher than YOLOv11n. Ablation experiments also confirmed that each module contributes effectively, with ARB, ORFM, and the P2 head excelling at their respective scales. In terms of generalization, the model outperformed the baseline by 3.1 percentage points on the HazyDet public dataset, validating its robustness in low-visibility scenarios.

Future work includes exploring lighter architectures, incorporating temporal information, and evaluating real-time performance on embedded platforms [42].

Author Contributions

Conceptualization, J.L. and Y.W.; methodology, J.L.; software, Y.W.; validation, J.L. and Y.W.; formal analysis, J.L.; investigation, Y.W.; resources, J.L.; data curation, Y.W.; writing—original draft preparation, J.L.; writing—review and editing, Y.W.; visualization, Y.W.; supervision, J.L.; project administration, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable (the study does not involve humans or animals).

Informed Consent Statement

Not applicable (the study does not involve human participants).

Data Availability Statement

The VisDrone2019 dataset used in this study is publicly available at https://github.com/VisDrone/VisDrone-Dataset (accessed on 13 October 2025). The HazyDet dataset can be accessed at https://github.com/GrokCV/HazyDet (accessed on 13 October 2025). The code and models are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Colomina, I.; Molina, P. Unmanned Aerial Systems for Photogrammetry and Remote Sensing: A Review. ISPRS J. Photogramm. Remote Sens. 2014, 92, 79–97. [Google Scholar] [CrossRef]
Zhao, Z.-Q.; Zheng, P.; Xu, S.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Sapkota, R.; Flores-Calero, M.; Qureshi, R.; Badgujar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M. YOLO Advances to Its Genesis: A Decadal and Comprehensive Review of the You Only Look Once (YOLO) Series. Artif. Intell. Rev. 2025, 58, 274. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Yolov8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA; pp. 8759–8768.
Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-Resolution Detection Network for Small Objects. arXiv 2020, arXiv:2006.07607. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Salt Lake City, UT, USA, 2018; pp. 6848–6856. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Salt Lake City, UT, USA, 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar]
Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global Second-Order Pooling Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 3024–3033. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. Simam: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 11863–11874. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 2778–2788. [Google Scholar]
Habash, N.; Alqumsan, A.A.; Zhou, T. Recent Real-Time Aerial Object Detection Approaches, Performance, Optimization, and Efficient Design Trends for Onboard Performance: A Survey. Sensors 2025, 25, 7563. [Google Scholar] [CrossRef]
Zhang, R.; Hou, J.; Li, L.; Zhang, K.; Zhao, L.; Gao, S. RTUAV-YOLO: A Family of Efficient and Lightweight Models for Real-Time Object Detection in UAV Aerial Imagery. Sensors 2025, 25, 6573. [Google Scholar] [CrossRef]
Alshehri, M.; Xue, T.; Mujtaba, G.; AlQahtani, Y.; Almujally, N.A.; Jalal, A.; Liu, H. Integrated neural network framework for multi-object detection and recognition using UAV imagery. Front. Neurorobot. 2025, 19, 1643011. [Google Scholar] [CrossRef]
Sun, Z.; Zhang, G.; Xing, Y.; Liu, Y. A Scale-Adaptive Aggregation and Multi-Domain Feature Fusion Architecture for Small-Target Detection in UAV Aerial Imagery. Sensors 2026, 26, 1610. [Google Scholar] [CrossRef]
Wu, X.; Ablameyko, S.V. RLD-YOLO: New Method for Object Detection in Unmanned Aerial Vehicle Images Using YOLOv11 Neural Network. J. Belarusian State Univ. Math. Inform. 2025, 2, 105–117. [Google Scholar]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE: Long Beach, CA, USA, 2019; pp. 510–519. [Google Scholar]
Yang, J.; Yue, X.; Wu, L. A Collaborative Multi-Attention Network for Real-Time Small Object Detection in UAV Imagery. Sci. Rep. 2026, 16, 5852. [Google Scholar] [CrossRef]
Feng, C.; Chen, Z.; Li, X.; Wang, C.; Yang, J.; Cheng, M.-M.; Dai, Y.; Fu, Q. HazyDet: Open-Source Benchmark for Drone-View Object Detection with Depth-Cues in Hazy Scenes. arXiv 2024, arXiv:2409.19833. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Padilla, R.; Netto, S.L.; Da Silva, E.A. A Survey on Performance Metrics for Object-Detection Algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 1–3 July 2020; IEEE: New York, NY, USA, 2020; pp. 237–242. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Nan, H.; Li, C. Fpa-Yolov8s: An Efficient Small Object Detection Algorithm for Drone Aerial Imagery. Pattern Anal. Appl. 2025, 28, 187. [Google Scholar] [CrossRef]
Qian, J.; Tao, C.; Luo, X.; Gao, Z.; Wang, T.; Xiao, F.; Cao, F.; Zhang, Z. DRONet: Occlusion-Mastering Multi-Object Detection Tailored for Unmanned Aerial Vehicles. Displays 2026, 93, 103388. [Google Scholar] [CrossRef]
Yang, Z.; Lan, X.; Wang, H. Comparative Analysis of YOLO Series Algorithms for UAV-Based Highway Distress Inspection: Performance and Application Insights. Sensors 2025, 25, 1475. [Google Scholar] [CrossRef] [PubMed]
Shakya, B.; El-Gayar, O.; Kaabi, J.; Ahmed, K.M. Small-Object Detection at the Edge: A Pareto-Efficient Benchmark of Lightweight YOLO Models on UAV and Overhead Datasets. IEEE Access 2025, 14, 528–548. [Google Scholar] [CrossRef]
Hu, S.; Xing, R.; Han, L.; Liu, T. EBAD-YOLO: Efficient Bidirectional Adaptive Dense Network for UAV Small-Object Detection. J. Real-Time Image Process. 2026, 23, 40. [Google Scholar] [CrossRef]
Deng, H.; Zhou, S.; Yang, W. ESO-Det: An Efficient Small Object Detector for Real-Time UAV Perception. Sensors 2026, 26, 1512. [Google Scholar] [CrossRef] [PubMed]
Zhao, M.; Chen, J.; Zheng, J.; Li, Y.; Bao, C. UFDT-YOLO: Robust Small Object Detection from UAV Perspectives in Foggy Environments. Digit. Signal Process. 2026, 171, 105826. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Venice, Italy, 2017; pp. 618–626. [Google Scholar]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 2815–2823. [Google Scholar] [CrossRef]

Figure 1. MSG-YOLO network architecture.

Figure 4. Small-object detection head.

Figure 5. Experimental results with different learning rates. (a) 0.01; (b) 0.001.

Figure 6. Experimental results with different training epochs. (a) 100; (b) 150; (c) 200.

Figure 7. Heatmap comparison. (a) Complex street scene; (b) low-light dense traffic; (c) hazy low-visibility scene. Warmer colors (red/yellow) indicate higher model attention.

Figure 8. Detection Performance Comparison. (a) Complex street scene; (b) low-light dense vehicles; (c) complex environment; (d) strong lighting scene.

Table 1. Experimental environment configuration.

Settings	Parameters
CPU	6 vCPU Intel(R) Xeon(R) Silver 4310 CPU @ 2.10 GHz
GPU	NVIDIA GeForce RTX 3090
Operating system	Linux
Deep learning framework	PyTorch 2.4.1
Language	Python 3.10.18

Table 2. Hyperparameter settings.

Settings	Parameters
lr0	0.01
Momentum	0.9
Optimizer	AdamW
Epochs	150
Batch size	32

Table 3. Ablation study on VisDrone2019 validation set (%).

Baseline	ORFM	ARB	Head	mAP@0.5 (%)	mAP@0.5:95 (%)	Param (M)	GFLOPs	FPS
				32.4	18.8	2.58	6.3	194.13
√	√			32.1	18.6	2.45	6.2	185.41
√		√		34.3	20.1	2.74	8.0	196.14
√	√	√		34.7	20.4	2.61	7.9	184.00
√	√	√	√	36.7	22.0	2.69	11.6	150.02

Table 4. Comparison with state-of-the-art detectors on VisDrone2019.

Model	mAP@0.5 (%)	mAP@0.5:95 (%)	Params (M)	GFLOPs
YOLOv5n	31.7	18.2	2.50	7.1
YOLOv6n	29.0	16.6	4.23	11.7
YOLOv7-tiny	33.4	17.1	6.03	13.3
YOLOv8n	32.1	18.5	3.00	8.1
YOLOv10n	32.2	18.6	2.26	6.5
YOLOv11n	32.4	18.8	2.58	6.3
YOLOv12n	29.1	16.6	2.55	6.3
YOLOv26n	32.8	18.7	2.37	5.2
Ours	36.7	22.0	2.69	11.6

Table 5. Per-category AP50 comparison and improvement on the VisDrone2019 validation set.

Category	mAP@0.5/%		Improvement
Category	YOLOv11n	MSG-YOLO	Improvement
Overall	32.4	36.7	+4.3
Pedestrian	34.8	43.3	+8.5
People	27.3	34.7	+7.4
Bicycle	8.0	9.8	+1.8
Car	75.5	80.2	+4.7
Van	38.5	40.8	+2.3
Truck	26.5	31.2	+4.7
Tricycle	20.4	23.2	+2.8
Awning-tricycle	12.2	13.3	+1.1
Bus	44.3	47.8	+3.5
Motor	36.1	42.6	+6.5

Table 6. Generalization results on the HazyDet dataset.

Method	mAP@0.5 (%)	mAP@0.5:95 (%)	P	R	FPS
YOLOv11n	68.4	47.4	0.794	0.6	226.04
Ours	71.5	51.0	0.78	0.648	179.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Wang, Y. MSG-YOLO: A Multi-Scale Statistical-Guided Network for Drone Object Detection. Appl. Sci. 2026, 16, 6050. https://doi.org/10.3390/app16126050

AMA Style

Liu J, Wang Y. MSG-YOLO: A Multi-Scale Statistical-Guided Network for Drone Object Detection. Applied Sciences. 2026; 16(12):6050. https://doi.org/10.3390/app16126050

Chicago/Turabian Style

Liu, Jianhua, and Yanfei Wang. 2026. "MSG-YOLO: A Multi-Scale Statistical-Guided Network for Drone Object Detection" Applied Sciences 16, no. 12: 6050. https://doi.org/10.3390/app16126050

APA Style

Liu, J., & Wang, Y. (2026). MSG-YOLO: A Multi-Scale Statistical-Guided Network for Drone Object Detection. Applied Sciences, 16(12), 6050. https://doi.org/10.3390/app16126050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSG-YOLO: A Multi-Scale Statistical-Guided Network for Drone Object Detection

Abstract

1. Introduction

2. Methodology

2.1. Attention Refinement Bottleneck (ARB)

2.2. Omni-Receptive Field Module (ORFM)

2.3. P2 Small-Object Detection Head

3. Experiments and Results Analysis

3.1. Datasets

3.2. Experimental Setup and Evaluation Metrics

3.3. Ablation Studies

3.4. Model Comparison Experiments

3.5. Scale-Wise Performance Analysis

3.6. Generalization Experiments

3.7. Heatmap Analysis

3.8. Analysis of Visualization Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI