SODet-YOLO: A Small Object Detection Algorithm for UAV Aerial Photography Perspective

Zeng, Ke; Yu, Wangsheng; Long, Siyu; Qin, Xianxiang; Wang, Peng; Hou, Zhiqiang; Ma, Sugang; Li, Tianxin

doi:10.3390/rs18111714

Open AccessArticle

SODet-YOLO: A Small Object Detection Algorithm for UAV Aerial Photography Perspective

by

Ke Zeng

¹,

Wangsheng Yu

^2,*

,

Siyu Long

¹,

Xianxiang Qin

²,

Peng Wang

²,

Zhiqiang Hou

³,

Sugang Ma

³

and

Tianxin Li

¹

Graduate School, Air Force Engineering University, Xi’an 710051, China

²

School of Information and Navigation, Air Force Engineering University, Xi’an 710077, China

³

School of Computer Science, Xi’an University of Posts and Telecommunications, Xi’an 710061, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1714; https://doi.org/10.3390/rs18111714

Submission received: 19 April 2026 / Revised: 22 May 2026 / Accepted: 23 May 2026 / Published: 26 May 2026

(This article belongs to the Special Issue Artificial Intelligence Algorithm for Remote Sensing Imagery Processing (5th Edition))

Download

Browse Figures

Versions Notes

Highlights

Addressing the issues of missed detection and false detection in small object detection for UAV aerial photography scenarios, this study optimizes the YOLO11n model comprehensively from the dimensions of network architecture and loss functions, and proposes the SODet-YOLO algorithm. A novel neck network structure named FGA-AFPN is proposed to fully fuse multi-level features and suppress background interference. An improved C3k2_IDC module is proposed, and the PPA module is introduced to further boost model performance. Finally, we designed the MPDInterpIoU loss function to accelerate the training process.

What are the main findings?

First, the FGA-AFPN structure is proposed to replace the original FPN+PAN architecture to suppress cluttered backgrounds, highlights objects, and fully fuse multi-level features simultaneously. The loss function is optimized using MPDIoU-InterpIoU to accelerate early-stage regression speed, and the IDC is introduced to enhance the C3k2 module for enlarging the receptive field and capturing contextual information. Finally, the PPA module is introduced to preserve critical object features during downsampling. These strategies jointly improve model performance.
Second, the improved YOLO11n achieves 41.487% mAP@0.5 on the VisDrone2019, which is an 8.92% improvement compared with the baseline model.

What are the implications of the main findings?

Firstly, this paper presents a new method for small object detection in UAV aerial scenarios and verifies that the combination of non-adjacent feature fusion, attention mechanisms and optimized loss functions can significantly enhance model detection performance.
Secondly, the presented detection algorithm can be utilized in UAV-borne tasks, including surveillance, rescue, and environmental monitoring, effectively reducing the incidence of missed and false detections.

Abstract

To tackle the frequent missed and false detection issues arising from the tiny scale of objects and strong background clutter in UAV aerial photography scenarios, this paper proposes a novel algorithm named SODet-YOLO for UAV aerial imagery. First, to effectively extract the features of aerial objects and alleviate background interference, we integrate a high-resolution detection head denoted as P2 into the YOLO11n, which is connected to the feature layer from the second downsampling stage of the Backbone and Neck networks, we design the Fine-Grained Aggregation-Asymptotic Feature Pyramid Network (FGA-AFPN) to realize adequate fusion of feature information at different levels. Second, we redesign the original C3k2 module by embedding the Inception Depthwise Convolution (IDC). This design effectively expands the receptive field, enriches multi-scale contextual feature extraction, and mitigates adverse interference from complex background clutter. In addition, a novel IoU loss function named MPDInterpIoU is proposed by combining InterpIoU with MPDIoU. This function promotes faster convergence at the early learning stage and optimizes detection-related performance. Finally, the Parallelized Patch-aware Attention (PPA) is incorporated before the downsampling module to preserve the key features of small objects throughout multiple downsampling steps. The experimental findings validate that SODet-YOLO achieves an mAP@0.5 score of 41.487% on the VisDrone2019 object detection dataset, representing an 8.92% performance enhancement relative to the baseline YOLO11n model. However, the computational cost increases moderately, with the number of parameters increasing by 1.08 M, the computational complexity increasing by 26.1 GFLOPs, and the average inference time growing by 34.7 ms.

Keywords:

YOLO11n; UAV; attention mechanism; IoU; small object detection; convolution

1. Introduction

Driven by the swift advancement and widespread application of UAV technology, object detection in aerial remote sensing images has become a core research field in modern computer vision. Over the past few years, object detection approaches powered by deep learning have consistently boosted the precision of object localization and the efficiency of detection. These methods can be broadly categorized into one-stage and two-stage detection architectures. Concretely, two-stage detection approaches realize high-accuracy object detection by first producing region proposals and subsequently executing regression and classification tasks on these proposals. Several classical detection models, such as R-CNN [1], Fast R-CNN [2] and Faster R-CNN [3], are widely adopted. Conversely, one-stage detection approaches can directly output object categories and bounding box coordinates, enabling faster inference and making them popular for real-time detection tasks. Typical instances of such algorithms include the SSD [4], YOLO [5], and Retina-Net [6].

However, the core dilemma of UAV aerial image object detection stems from its unique scene characteristics, which exhibit fundamental differences from those of ground-level perspectives. First and foremost, extreme scale variations and the overwhelming proportion of small objects constitute the most prominent challenge in this field [7]. UAVs typically operate at altitudes ranging from tens to hundreds of meters, causing common ground objects such as pedestrians and vehicles to occupy only a tiny fraction of pixels in the captured images. These small objects lack sufficient texture and detail features, making them highly susceptible to feature degradation and information loss during the multi-layer convolutions in deep neural networks [8]. Consequently, models struggle to distinguish objects from background noise, leading to a significant increase in both the miss detection and the false positive rate. Meanwhile, dynamic changes in UAV flight altitude cause the same object to exhibit scale differences of several or even tens of times across different frames, further exacerbating the requirements for the scale robustness of detection models.

To address these ubiquitous challenges, this paper proposes a novel small object detection algorithm named SODet-YOLO. This algorithm primarily tackles the critical issues of insufficient detail features in small objects and the severe missed detection and false positive problems induced by intense background interference. Its main technical contributions are as follows:

(1): We embed a high-resolution detection head, termed P2, into YOLO11n to mitigate the loss of detailed feature information caused by downsampling. Meanwhile, the Fine-Grained Aggregation-Asymptotic Feature Pyramid Network (FGA-AFPN) is developed to achieve a thorough fusion of feature information at diverse hierarchical levels. AFPN realizes asymptotic feature fusion spanning from two feature layers to three layers. By comparison, FGA-AFPN further fuses feature maps of adjacent layers in the intermediate process to achieve better feature fusion performance. The feature layers involved in fusion include P2, P3, and P4. To avoid excessive computational costs and considering that the P5 layer contributes little to small object detection, we exclude this layer from feature fusion. Within the proposed architecture, we introduce the ASFF module and design the novel SCFM to implement feature fusion. The ASFF module fuses each input feature layer with all other input feature layers, while the SCFM module embeds features from two adjacent input layers into the current feature layer. Both modules ensure that the channel dimensions of the output features are consistent with those of the input features. By adopting adaptive spatial feature fusion and attention coefficient weighting mechanisms, the two modules can fully integrate multi-layer features and highlight discriminative representative features. Therefore, this approach addresses the problems of scale variations and loss of detailed features, while mitigating background interference., consequently markedly enhancing small object detection performance.
(2): We embed the IDC module between the two convolutional layers within the bottleneck of the C3k2 module, resulting in the C3k2_IDC module. Since the IDC module can significantly expand the model’s receptive field while maintaining low parameter overhead, it enables the model to capture contextual environmental information around small objects, distinguish objects from backgrounds via global semantic features, and compensate for the feature degradation and detail loss of small objects during network downsampling, thereby achieving superior detection performance.
(3): We introduce a new loss function for object detection, termed MPDInterpIoU. This loss function integrates MPDIoU and InterpIoU, taking multiple loss-associated indicators into account, including overlapping and non-overlapping regions, center-point distances, as well as width and height discrepancies. Meanwhile, it effectively solves the gradient vanishing issue during the early training period. As a result, it speeds up the convergence rate in the initial training stage, greatly optimizes the overall training procedure of the model, and strengthens its detection capability.
(4): We incorporate the PPA module before the downsampling module (Conv) of the backbone network. By utilizing hierarchical feature fusion and an attention mechanism, the module maintains the vital features of small objects throughout downsampling operations, effectively alleviating feature loss in the downsampling process.

2. Related Work

Currently, a wealth of research strategies has been developed for small object detection in UAV aerial imagery. These strategies mainly focus on the following dimensions: mitigating the issues of limited data volume and imbalanced distribution of small objects via techniques like Mosaic data augmentation [9]; employing advanced neck architectures for multi-scale feature fusion (such as the Bidirectional Feature Pyramid Network, BiFPN [10]) to strengthen the feature expression ability of small objects; and integrating attention modules including Spatial-Channel Attention Mechanism (SCAM) [11] and Efficient Multi-Scale Attention (EMA) [12] to suppress background clutter and emphasize the critical feature information of small objects.

Grounded in the above-mentioned research approaches, scholars have devised object improvement schemes for UAV small object detection. In a study by Zhou et al. [13], BiFPN was employed as the neck of YOLO11, leading to a remarkable improvement in feature fusion performance. Furthermore, Wang et al. [14] introduced the Contextual Transformer (CoT) module, concentrating on the effective mining of texture features for small objects, thus elevating the detection precision. Wang et al. [15] incorporated the EMA module into the YOLOv8 framework, which realized pixel-wise feature integration via global information encoding and dimensional interaction, efficiently boosting the model’s small object detection ability. Feng et al. [16] improved YOLOv5 by introducing the Spatial-Channel Attention Mechanism (SCAM) module, which can more effectively aggregate the feature regions of small objects by combining channel and spatial attention.

Ahmed Gomaa and his colleagues [17] put forward a real-time approach that integrates detection and tracking characteristics. By adopting Top-hat/Bottom-hat transformations, the KLT + K-means algorithm, and an efficient association mechanism, this method is designed to solve the problems of vehicle occlusion, camera motion, and high computational consumption in the detection and tracking of moving vehicles in aerial videos. Sabina Umirzakova et al. [18] developed Cotton Multitask Learning (CMTL), a multitask learning framework driven by a transformer. This framework achieves cross-task mutual learning and feature retention through the Cross-Level Multi-Granular Encoder (CLMGE) and Multitask Self-Distilled Attention Fusion (MSDAF), thus enabling precise detection in cotton cultivation scenarios.

Xu et al. [19] combined the multi-scale feature complementary aggregation module with attention mechanisms to strengthen the localization ability of small objects, and further enhanced the multi-scale feature fusion capability by optimizing the detection head, thereby significantly improving detection performance. Ben Rouighi et al. [20] added the P2 small object detection layer, removed the large-object detection layer, and introduced the C3Ghost structure to reduce computational overhead, which significantly lowered the computational cost. Hussain et al. [21] embedded the PGI-aware Swin Fusion Block at the end of the backbone network to balance high-resolution local features and long-range contextual information. Meanwhile, they introduced the dual-path spatial-channel attention module into the detection head and reversible auxiliary branches of PGI, which optimized gradient flow and feature fidelity, and reduced missed detections and false positives. Liu et al. [22] expanded the receptive field and enriched semantic features, introduced attention mechanisms to enhance salient feature regions and suppress complex environmental interference, and achieved precise alignment and fusion of multi-scale shallow features, thereby significantly boosting the detection accuracy of small objects.

In general, these technological improvements provide valuable reference insights for the research on small object detection in UAV aerial photography.

3. The Proposed Algorithm

Introduced by the Ultralytics team, YOLO11n [23] is a lightweight model belonging to the YOLO series (its structure is shown in Figure 1), which has achieved significant progress in architectural design, operational efficiency, and multi-task compatibility. Originating from YOLOv8, YOLO11 consists of three fundamental components, namely the Backbone, Neck and Head. The backbone network is optimized on the basis of CSPDarknet-53 and integrates the C2PSA module. This module fuses the Pyramid Squeeze Attention (PSA) and the Cross Stage Partial (CSP) structure, thereby strengthening the capability of feature extraction.

Meanwhile, the C3k2 module is an optimized upgrade of the CSPNet structure. As an alternative to the C2f module, it leverages a modified cross-stage partial framework by fusing the Bottleneck design and depthwise separable convolution. It reduces the number of parameters while maintaining feature extraction capability, optimizes residual connections to enhance gradient flow, and effectively alleviates the training difficulties of deep networks. In addition, it uses PAFPN (Path Aggregation Feature Pyramid Network) as the neck network, achieving feature aggregation through both bottom-up and top-down paths. The detection head uses a depthwise separable convolution optimization structure with a non-anchored frame design, including a classification branch and a regression branch. Collectively, these optimization measures have brought about a substantial boost to the model performance.

In the COCO dataset, objects with a pixel area smaller than 32 × 32 pixels are defined as small objects. However, such tiny objects account for the vast majority in UAV aerial view object detection scenarios. Furthermore, the backbone network of YOLO11n conducts multiple downsampling operations on feature maps, which easily leads to the loss of feature information of small objects. The incorporation of the higher-resolution P2 layer and the Parallelized Patch-aware Attention (PPA) module enables the preservation of more detailed feature information. Additionally, we use FGA-AFPN as the neck network. This replacement fully integrates feature information from both adjacent and non-adjacent layers, thereby significantly enhancing the feature extraction capacity.

Regarding the loss function, the newly proposed InterpIoU and MPDIoU are combined to optimize the regression process. Lastly, the C3k2 module is upgraded with Inception Depthwise Convolution (IDC), which can expand the receptive field. Such optimization enables the module to capture richer global context information, thereby significantly enhancing the detection accuracy of small objects. SODet-YOLO’s structure is presented in Figure 1.

3.1. FGA-AFPN and P2 Small Object Detection Layer

In YOLO11n, the backbone network halves the resolution of the feature map at each downsampling stage, forming five feature layers with different resolutions: P1, P2, P3, P4, and P5. Taking an input image with a resolution of 640 × 640 as an example, the resolutions of P1, P2, P3, P4 and P5 are 320 × 320, 160 × 160, 80 × 80, 40 × 40, and 20 × 20 respectively. Objects in UAV aerial imagery scenarios are extremely small in scale, and the continuous downsampling operations further reduces the feature resolution, resulting in the almost complete loss of object’ detailed features. This ultimately causes severe missed and false detections. To resolve this problem, a high-resolution P2 layer is added after the second downsampling step, which can markedly boost detection accuracy. Furthermore, FGA-AFPN is utilized as the neck network of YOLO11n to prevent feature loss during multi-level feature transmission. The structural schematic is depicted in Figure 2.

The architectural design of the proposed network draws inspiration from the Fine-Grained Aggregation Module (FGAM) presented in IF-YOLO [24] and the Asymptotic Feature Pyramid Network (AFPN) [25]. Although AFPN improves the integration of multi-scale features via the stepwise fusion of cross-level semantic representations, the overall improvement remains modest, primarily because the feature information across different hierarchical stages is insufficiently aligned. To address this limitation, the fusion strategy is formulated as follows: initially, the features at levels P3 and P4 are merged; subsequently, the representations from P2 and P4 are aggregated into the P3 layer, and those from P3 and P5 are incorporated into the P4 layer. This sequential procedure promotes closer semantic alignment among the features of P2, P3, and P4, culminating in a comprehensive integration of these three levels to effectively improve feature fusion capability. The complete network architecture is instantiated using the Adaptive Spatial Feature Fusion (ASFF) module [26] in conjunction with the Spatial-Channel Fusion Module (SCFM). In particular, ASFF_2 refers to feature fusion across two distinct hierarchical levels, whereas ASFF_3 entails fusion over three separate levels. Both the ASFF and SCFM take multi-level feature maps as input, and the downsampling and upsampling of all feature maps are implemented inside the modules. In ASFF, each input feature layer needs to perform feature fusion with all other feature layers. Its output is a feature map after spatial dynamic weighting, which enhances the capacity of multi-scale feature fusion and mitigates the interference of irrelevant backgrounds. Unlike ASFF, SCFM performs twofold downsampling and upsampling on each pair of adjacent input feature layers. The final output is weighted by spatial and channel attention, which can also improve the multi-scale feature fusion capability and suppress irrelevant background interference. As exemplified by ASFF_2, the input feature maps initially undergo adaptive weight learning through a series of trainable convolutional layers:

M_{low} = Conv (I_{low}), F_{high} = Conv (I_{high})

(1)

Here,

I_{low}

and

I_{high}

correspond to the low-level input and high-level input, respectively.

M_{low}

and

M_{high}

correspond to the convolved low-level and high-level feature maps, respectively, while

Conv

designates the convolution operation.

Low-level feature representations are fused into high-level ones when L = 0:

M = Concat (Upsample (M_{low}), M_{high})

(2)

Similarly, when L = 1, high-level feature representations are fused into low-level ones:

M = Concat (Downsample (M_{high}), M_{low})

(3)

Here, M represents the fused feature map, Downsample and Upsample represent the downsampling and upsampling, and Concat denotes the concatenation operation.

Furthermore, by applying a 1 × 1 pointwise convolution, the number of channels in M is reduced to 2, and spatial weight coefficients are then acquired through softmax

W = Softmax (1 \times 1 Conv (M))

(4)

where

Softmax

refers to performing softmax normalization on elements across the channel dimension at individual spatial positions,

1 \times 1 Conv

denotes the point convolution operation, and

W

represents the output weights.

In the end, the

W

is multiplied by the

M_{l o w}

and

M_{h i g h}

to generate the final feature map:

When L = 0:

Y = W_{2} \times M_{high} + W_{1} \times Upsample (M_{low})

(5)

When L = 1:

Y = W_{1} \times M_{low} + W_{2} \times Downsample (M_{high})

(6)

where

W_{1}

refers to the weight associated with the primary channel,

W_{2}

refers to the weight associated with the secondary channel, and

Y

represents the final fused feature information.

The architecture of SCFM is illustrated in Figure 3, and its computational procedures are detailed below:

Suppose

F_{1}

indicates the low-level feature map,

F_{2}

corresponds to the feature of the present layer, and

F_{3}

signifies the high-level feature. The dimensions of

F_{1}

and

F_{3}

are adjusted to match those of

F_{2}

through upsampling and downsampling, and then the feature concatenation is performed:

F = Concat (Upsample (F_{1}, F_{2}, Downsample (F_{3}))

(7)

where

C o n c a t

refers to the operation of feature concatenation, and

U p s a m p l e

and

D o w n s a m p l e

represent the upsampling operation and downsampling operation, respectively.

Subsequently,

F_{c}

and

F_{s}

are processed by spatial attention as well as channel attention independently to generate the corresponding attention-weighted feature maps:

F_{c} = C A (F), F_{s} = S A (F)

(8)

where

F_{c}

represents the channel attention-weighted feature map, while

F_{s}

stands for the feature map after spatial attention weighting. The

C A

and

S A

represent channel and spatial attention mechanism weighting, respectively.

Finally,

F_{c}

and

F_{s}

are fused together:

F_{f i n a l} = F_{c} + F_{s}

(9)

3.2. C3k2_IDC

Large-kernel depthwise separable convolutions exhibit a comparatively extensive receptive field and can substantially enhance model performance, yet they are plagued by low computational efficiency. Conversely, small-kernel depthwise separable convolutions deliver a rapid inference speed, but their overly narrow receptive field results in deteriorated model performance. Thus, motivated by the Inception architecture, Inception Depthwise Convolution (IDC) disassembles large-kernel depthwise convolutions into three parallel branches: an identity mapping branch, a small square kernel branch, and an orthogonal kernel branch. Leveraging parallel branch structures and partial channel processing, IDC can markedly boost model efficiency. Simultaneously, the orthogonal strip kernel branch efficiently enlarges the receptive field, thereby achieving better performance [27].

We propose a fusion of the Inception Depthwise Convolution (IDC) module and the C3k2 network architecture to introduce a novel feature extraction component, designated C3k2_IDC. Within this design, the IDC module is adopted to replace one conventional convolution operation of the Bottleneck structure. This expansion of the model’s receptive field is achieved with only a slight increase in parameters. A moderately enlarged receptive field enables the model to capture contextual information surrounding small objects, thereby compensating for the deficiency in detailed features of small objects. During inference, the model can accurately recognize small objects by leveraging local environmental information in their surrounding areas, which further improves the overall detection performance. Accordingly, the proposed module substantially elevates the accuracy of detecting small objects in the YOLO11n framework, while incurring only a marginal reduction in overall model efficiency. A schematic illustration of the C3k2_IDC architecture is provided in Figure 4.

3.3. MPDInterpIoU

MPDIoU functions as a bounding box similarity metric established on the foundation of minimum point distance, which integrates several key influencing factors, including the overlapping region, center-point distance and scale deviation [28]. Below presents the computational steps of MPDIoU:

First, compute the distance

d_{1}

between the top-left vertices of the ground-truth (GT) bounding box and the predicted (Prd) bounding box, as well as the distance

d_{2}

between their bottom-right corner points;

(x_{1}^{gt}, y_{1}^{gt})

and

(x_{1}^{pred}, y_{1}^{pred})

represent the top-left corner coordinates of the ground-truth and predicted bounding boxes, respectively. Similarly,

(x_{2}^{gt}, y_{1}^{gt})

and

(x_{2}^{p r e d}, y_{2}^{p r e d})

denote the bottom-right corner coordinates of the ground-truth and predicted bounding boxes:

d_{1}^{2} = {(x_{1}^{gt} - x_{1}^{p r e d})}^{2} + {(y_{1}^{gt} - y_{1}^{p r e d})}^{2}

(10)

d_{2}^{2} = {(x_{2}^{gt} - x_{2}^{p r e d})}^{2} + {(y_{2}^{gt} - y_{2}^{p r e d})}^{2}

(11)

Finally, calculate MPDIoU:

MPDIoU = IoU - \frac{d_{1}^{2} + d_{2}^{2}}{w^{2} + h^{2}}

(12)

where

w

and

h

represent the width and height of the minimum enclosing box,

IoU

denotes the intersection over union of the ground truth box and the prediction box.

By selecting appropriate interpolation coefficients, InterpIoU is able to guarantee a partial overlap between the interpolated box and the ground-truth box, thus effectively solving the gradient vanishing problem caused by the non-intersection of the ground-truth box and the predicted box [29]. The calculation process is detailed below:

First, calculate the interpolation coefficient

α

:

α = clamp (1 - IoU (B_{pred}, B_{gt}), α_{low}, α_{high})

(13)

where

c l a m p

denotes restricting

α

between

α_{low}

and

α_{high}

,

Then, calculate the interpolated bounding box

B_{int}

:

B_{int} = (1 - α) B_{pred} + α B_{gt}, 0 < α < 1

(14)

where

B_{pred}

and

B_{gt}

represent the prediction box and the ground truth box, respectively. The larger the value of

α

, the closer the interpolated box is to the ground-truth box; the smaller

α

is, the closer it is to the predicted box.

Finally, calculate the InterpIoU:

InterpIoU = IoU (B_{int}, B_{gt})

(15)

MPDIoU is combined with InterpIoU, and the calculation method is as follows:

MPDInterpIoU = InterpIoU - \frac{d_{1}^{2} + d_{2}^{2}}{w^{2} + h^{2}}

(16)

This loss function integrates the complementary advantages of the two metrics. InterpIoU fundamentally solves the gradient vanishing problem when there is no overlap between the predicted bounding box and the ground-truth box via an interpolation mechanism. It ensures that the model can still obtain effective optimization directions in non-overlapping scenarios, thereby accelerating the early regression process. By contrast, MPDIoU adopts the minimum point distance constraint to provide fine geometric alignment signals within overlapping regions, further improving the accuracy of bounding box regression. The combination of the two not only achieves performance superposition but also enables the model to acquire a stable and continuous gradient flow from the initial random state to the final fine-tuning stage. Especially in small object scenarios, the detection robustness and convergence speed are significantly enhanced.

3.4. PPA Module

Parallelized Patch-aware Attention (PPA) employs attention mechanisms and feature integration to maintain and strengthen the feature representation of small objects, thus guaranteeing the retention of key information during successive downsampling processes. This method utilizes a parallel multi-branch structure, with each branch performing feature extraction at different scales and hierarchical stages. Such a multi-branch strategy facilitates the capture of multi-scale object features, thereby improving the detection precision of small-scale objects. Concretely, this approach comprises three parallel branches: a serial convolution branch, a global branch, and a local branch [30]. Finally, it fuses the feature maps from the three branches and performs attention-based weighting on them. Its structure is presented in Figure 5.

Specifically, the parameter p, which is responsible for regulating patch size, combined with the integration and repositioning of non-overlapping patches across the spatial dimension, is adopted to distinguish between the global branch and the local branch. At the end of both branches, a feature selection mechanism [31] is introduced to choose task-adaptive features in both the token and channel dimensions.

3.5. Algorithm Process

The algorithm proposed in this paper adopts an anchor-free detection paradigm. It takes the grid points on multi-scale feature maps as the basic prediction units, where each grid point is responsible for predicting objects whose center points fall within the corresponding receptive field. Each grid point outputs a fixed-length prediction vector containing the bounding box coordinates and category confidence scores. The input image is then fed into the improved backbone and neck modules to generate hierarchical multi-scale feature maps. Finally, Non-Maximum Suppression (NMS) is applied to filter the candidate bounding boxes and retain the optimal detection results. The detailed implementation pipeline is illustrated in Figure 6.

4. Experimental Results and Analysis

4.1. Experimental Environment

The VisDrone2019 dataset, curated and released by the AISKYEYE team affiliated with the Laboratory of Machine Learning and Data Mining at Tianjin University, is adopted as the core experimental data source in this work. This dataset is divided into three subsets: 6471 training images, 548 validation images, and 1610 test images, with annotations covering 10 distinct object categories for detection tasks. The training subset alone contains a total of 353,550 labeled detection instances. The ultra-small scale of the objects to be detected, combined with intricate background contexts, presents substantial challenges for small object detection in UAV aerial imagery. An example from the VisDrone2019 dataset is shown in Figure 7.

In total, the training split comprises 353,550 labeled object instances, among which small-sized people and cars account for the largest proportion. The statistical data indicate that 212,630 objects occupy pixel areas of less than 32 × 32, with 34,827 objects covering fewer than 10 × 10 pixels. These characteristics pose enormous challenges for detection models, thereby degrading tracking performance. The distribution of these objects is shown in Figure 8.

In addition, most objects in the VisDrone2019 dataset are of tiny sizes, and such small objects are susceptible to annotation errors caused by occlusion, blurriness and other adverse factors. To address this issue, the dataset first defines the category of ignored regions, which marks areas difficult to annotate precisely due to low resolution or dense crowds as ignored regions. Secondly, it divides the occlusion degree of individual objects into three levels, namely 0 (no occlusion), 1 (partial occlusion) and 2 (severe occlusion). During the training phase of SODet-YOLO on this dataset, we remove all ignored regions and only adopt objects without occlusion and with partial occlusion for training. Only clear and accurately annotated samples are used for model optimization. This mechanism prevents the model from fitting ambiguous, incorrect and unidentifiable annotations, and allows the model to concentrate on learning effective object features.

The experimental environment is equipped with an RTX 4090 graphics card. All simulations are implemented using Python 3.9.21 based on the PyTorch 1.12.0 framework, and the specific version details are presented in Table 1. Additionally, the training parameters were set to 200 epochs, 4 workers, and a batch size of 12. Stochastic Gradient Descent (SGD) was adopted as the optimization algorithm, and training was terminated if there was no performance improvement for 50 consecutive epochs. Finally, seeing the initial learning rate and final learning rate are set to 0.01, as detailed in Table 2.

4.2. Ablation Experiments

As shown in Table 3, this ablation experiment uses YOLO11n as the base model. By gradually incorporating the P2 detection head, the FGA-AFPN neck network, the C3k2_IDC module, the MPDInterpIoU loss function, and the PPA module, it systematically verifies the impact of each improved module on both mAP@0.5 and the computational cost.

First, Model ② is obtained by adding the P2 to the YOLO11n. This model can effectively capture the detailed features of objects, improving the mAP@0.5 by 4.349% to 36.916%. Then, Model ③ is derived by replacing the neck network of Model ② with FGA-AFPN. This modification enables the full fusion of features across different levels, and the built-in spatial and channel attention mechanisms further highlight salient features while effectively suppressing background interference, leading to a further increase of 2.217% in mAP@0.5, which reaches 39.133%. After introducing the C3k2_IDC module into Model ③ to obtain Model ④, the receptive field of the model is expanded, allowing for effective capture of global information, which results in a slight improvement of 0.335% in mAP@0.5 to 39.468%. To optimize the regression process, Model ④ is trained with the MPDInterpIoU loss function for bounding box regression to yield Model ⑤. It effectively alleviates the gradient vanishing problem during the early-stage regression. Moreover, the incorporated factors like the width–height deviation and center point distance further optimize the bounding box regression process, thus improving the mAP@0.5 by 0.849% to 40.317%. Finally, the PPA module is introduced into Model ⑤ to preserve key features during downsampling, which boosts the mAP@0.5 by 1.17% to 41.487%. The specific metrics of each model are shown in Figure 9.

To verify the effectiveness of MPDInterpIoU, Model ⑦ was obtained by training on the basis of Model ④ with MPDIoU. The results show that the mAP@0.5 reaches 39.795%, representing an increase of 0.327% compared with Model ④, which is significantly inferior to the improvement achieved by MPDInterpIoU. We also conduct ablation studies on individual modules of YOLO11, where we separately incorporate the C3k2_IDC module and the PPA module, and train the model using the MPDInterpIoU loss function. Among these modifications, the PPA module achieves the largest improvement in mAP@0.5 but also incurs the highest computational cost, while MPDInterpIoU yields the smallest improvement. The plausible reason is that although MPDInterpIoU accelerates the bounding box regression speed of YOLO11n in the early training phase, it becomes challenging to achieve further performance gains after approximately 200 iterations.

While the mAP@0.5 of the model gradually improves with the deepening of modifications, the computational complexity and model parameters also increase accordingly. Among all the modifications, the introduction of FGA-AFPN causes the largest increase in the training cost, with the average inference time per image increasing by 10.8 ms, and adding 0.63 M model parameters and 4.2 GFLOPs. However, the incorporation of this module improves the mAP@0.5 by 2.217%, which is second only to the P2 small object detection layer, indicating a favorable overall effectiveness. Although introducing the PPA module improves the mAP@0.5 by 1.17%, it also increases the computational cost by 6.0 GFLOPs, which is even higher than the GFLOPs increment brought by the FGA-AFPN, and the average inference time per image increases by 14.1 ms. Therefore, its overall effectiveness is inferior to that of FGA-AFPN. In contrast, other modules or loss functions contribute relatively minor improvements to the model performance, with a correspondingly small increase in training cost. Compared with YOLO11n, SODet-YOLO increases the number of parameters by 1.19 M and GFLOPs by 19.7, thereby elevating the training cost of the model. The average inference time increases from 25.5 ms to 60.2 ms, representing an increment of approximately 136%.

Overall, based on YOLO11n, SODet-YOLO progressively integrates the FGA-AFPN, C3k2_IDC and PPA modules, and adopts the novel MPDInterpIoU loss function. All these designs are specifically constructed to solve common bottlenecks in UAV aerial tiny object detection, indicating the feature degradation of small objects, the loss of shallow fine-grained information and inadequate cross-level feature fusion, thereby achieving a remarkable rise in the mAP@0.5. Nevertheless, the deployment of high-resolution detection heads and multi-scale feature fusion strategies inevitably leads to a synchronous growth in model parameters, computational complexity and inference time, and all these improvements are indispensable structural designs for high-precision detection in UAV aerial scenarios.

The average single-frame inference time of SODet-YOLO is 60.2 ms, which is equivalent to about 16.6 FPS. Although it can realize the real-time processing of UAV aerial videos, its real-time performance is relatively constrained. As revealed by the ablation experiments, the PPA module causes the most obvious increase in the average inference time and severely limits the real-time detection performance. Accordingly, the PPA module can be discarded when computing resources are scarce to obtain a lower average single-frame inference time with only a moderate sacrifice of detection accuracy. In general, SODet-YOLO is mainly dedicated to improving the detection accuracy of tiny objects in UAV aerial scenes, and no lightweight optimization has been implemented in the current work. In future studies, we aim to greatly cut down computational costs while causing only a slight decline in detection accuracy.

We additionally pick three images in a random manner and test them with each model from the ablation experiment; Figure 10 displays the visualization of the detection results for each model.

In the detection results of the first image, given the ultra-small size of distant objects present in UAV aerial photographs, model ① exhibits a substantial number of undetected objects. With the introduction of each module, the detection performance gradually improves, significantly reducing the occurrence of missed detections. From the detection outcomes of the model ① on the second image, there are not only a large number of missed detections but also a misclassification issue: an entire row of Car objects in the lower-left area is misclassified as Truck. After a series of improvements, most objects are successfully detected, yet the row of Car objects in the lower-left area still suffers from missed detections—only the final model ⑥ manages to detect the objects in this area. This result successfully demonstrates the high performance of the improved model proposed in this paper. Finally, the detection results of the third image show that the number of missed detections decreases overall as more modules are added. Since the Motor objects are extremely small in aerial photography scenarios, the model ① fails to sufficiently extract their features, leading to numerous instances of missed detection. On the contrary, the detection results of the f model ⑥ indicate that a large number of Motor objects undetected by the model ① are successfully identified. Overall, the model ⑥ achieves a superior detection capability while the model ① performs the worst, which verifies the effectiveness of the ablation experiments.

4.3. Confusion Matrix Comparison Experiment

As illustrated in Figure 11, the vertical axis indicates the ground-truth categories and the horizontal axis corresponds to the predicted class labels. Overall, the proposed SODet-YOLO model attains superior classification accuracy for the majority of object classes relative to the baseline YOLO11n, with the sole exception of the Bus category, for which performance exhibits a slight decrement. The most substantial gain in the classification rate is observed for the People category, which increases from 0.43 to 0.69. This improvement substantiates the model’s ability to comprehensively extract features associated with small objects. Overall, the proposed algorithm yields a marked enhancement in small object detection capabilities without compromising the performance for larger-scale objects.

Although SODet-YOLO achieves improved recognition accuracy for all categories compared with YOLO11n, the correct classification rates of most categories remain relatively low, including People, Bicycle, Van, Tricycle and Awning-tricycle. Specifically, the People category still maintains a false-negative rate of approximately 31%. This is mainly because most people objects appear on motorcycles or bicycles, which inevitably causes severe object occlusion and increases the difficulty of effective feature learning for the model. In addition, categories such as Bicycle, Van, Tricycle and Awning-tricycle have scarce samples in the training set, leading to extremely imbalanced class distribution of the dataset. Insufficient training on these categories further raises their false-negative ratios. In summary, SODet-YOLO still has considerable room for improvement in its accurate object classification performance.

4.4. Object Detection for Various Sizes

As shown in Figure 12, we calculate APs of different sizes on the VisDrone2019 test dataset. The experimental outcomes indicate that the enhanced model delivers relatively weak detection performance for objects smaller than 10 × 10 pixels, yet it still outperforms the baseline YOLO11n model. For objects sized between 10 × 10 and 32 × 32 pixels, both the AP@[0.5:0.95] and AP@0.5 metrics achieve the most substantial improvements. This verifies that the proposed algorithm has markedly strengthened the detection performance for small objects; furthermore, for the detection performance of large objects, it is still slightly better than YOLO11n.

For objects within the size range of 0 × 10 to 10 × 10 pixels, SODet-YOLO only achieves marginal performance improvements, and its overall detection capability in this range is still quite limited. One primary reason is that objects of this size occupy a small proportion in the training set, with only 34,827 samples accounting for 9.9% of the total data. In addition, severe class imbalance exists among these tiny objects such as People, Pedestrian and Motor. This makes the model mainly concentrate on objects ranging from 10 × 10 to 32 × 32 pixels and pay insufficient attention to most ultra-small objects in the above size interval. Another factor lies in the extreme scarcity of effective feature information for such miniature objects. Even though we have greatly optimized the feature extraction capability and alleviated the loss of fine-grained features, the model still fails to fully extract adequate discriminative features from these ultra-small objects, which leads to the restricted detection performance of SODet-YOLO for objects in this size range.

The AP@0.5 values for objects across different size intervals were computed for each model in the ablation study, as illustrated in Figure 13. Given the intrinsic difficulty of detecting such tiny objects, improvements in this range are very limited, a constraint that can be attributed to the incorporation of the P2. For objects with dimensions between 10 × 10 and 32 × 32 pixels, the AP increases progressively as the proposed modules are successively integrated, thereby exhibiting the most pronounced gain among all size ranges. Considering the substantial scale variation in UAV-acquired aerial images—where small objects constitute the majority yet larger instances also appear—the proposed algorithm enhances detection accuracy for small objects without degrading performance on larger ones. This balanced outcome arises because the combined modules further elevate detection performance for large- and medium-sized objects. Although this relative enhancement is less marked than that achieved for small objects, the overall result represents a significant advancement in the small-object detection capability.

4.5. Stability Comparison Experiment

As illustrated in Figure 14, the present experiment evaluates the mAP@0.5 of YOLO11n and its enhanced counterpart through five independent training repetitions. The results indicate that the SODet-YOLO markedly outperforms the baseline. The mean mAP@0.5 attained by SODet-YOLO is 41.478%, which represents an increase of approximately 8.89 percentage points compared to the 32.591% achieved by YOLO11n. Furthermore, the standard deviations of both models are relatively low—0.095% for YOLO11n and 0.0883% for SODet-YOLO—attesting to their robust training stability. Additionally, SODet-YOLO obtains a higher average mAP@0.5 with a lower standard deviation and narrower 95% confidence interval, indicating superior stability. The extremely low p-value of 4.28 × 10⁻¹⁵ confirms that the performance improvement is statistically extremely significant and reliable. Overall, SODet-YOLO not only considerably elevates detection performance but also ensures consistent and reliable results throughout repeated test runs, thereby underscoring its considerable utility in practical applications.

4.6. Experimental Results on Other Dataset

To further validate the generalization ability of the improved model, training experiments are carried out on the Aerial traffic image dataset for road traffic detection scenarios. This public dataset is hosted on the Kaggle platform and was created by Roboflow user Shaha; it provides annotated aerial videos of traffic scenes in Almaty captured by drones. The dataset contains 1710 training images, 558 validation images and 440 test images, with object categories including PMT, articulated-bus, bus, car, freight, motorbike, small bus, and truck. In fact, the number of objects labeled as PMT in the dataset is extremely scarce, which makes it difficult to conduct adequate training for this category. As a result, the model achieves excellent training performance on other categories, while the AP@0.5 of the PMT category is almost zero. Therefore, we exclude this category from the experimental analysis. Finally, slight adjustments are made to the training parameters, with only the training epoch changed from 200 to 100.

As illustrated by the experimental data in Table 4, both YOLO11n and the proposed SODet-YOLO achieve outstanding performance for large-sized vehicles like trucks Freight, Truck, and Bus; however, the improved model attains marginally higher accuracy for these categories overall. The baseline model exhibits comparatively lower detection accuracy for small-sized objects, specifically Car and Motorbike, with values of 73.6% and 20.6% respectively, suggesting a tendency toward missed detections and false detections in small-object scenarios. In contrast, the enhanced model yields accuracies of 83.5% for Car and 56.1% for Motorbike, corresponding to improvements of 9.9 and 35.5 percentage points, respectively. These findings collectively confirm that the proposed algorithm substantially enhances small-object detection performance.

In addition, we select two dedicated datasets for vehicle and pedestrian detection tasks, namely the Top-View Drone Car Detection Dataset and the Tiny Person Dataset. The Top-View Drone Car Detection Dataset is derived from the Kaggle platform and split into training and validation sets, which contain 11,586 and 794 images respectively, with only the vehicle category involved. The Tiny Person Dataset includes 1610 training images and 759 validation images, with a total of 72,651 annotations. It divides people into two categories: sea people and earth people. In the experiments, we unexpectedly observe that outstanding performance can be achieved on the Top-View Drone Car Detection Dataset by setting the training epoch to only 5. On the contrary, since the objects in the Tiny Person Dataset are extremely tiny, favorable detection metrics cannot be obtained even if we set the epoch to 200 for training. The specific experimental results are shown in Table 5 and Table 6.

As shown in Table 5 and Table 6, we compare the experimental results of YOLO11n and SODet-YOLO on the Top-View Drone Car Detection Dataset and Tiny Person Dataset. The results reveal that both models achieve excellent performance on the simple Top-View Drone Car Detection Dataset. Notably, SODet-YOLO surpasses YOLO11n in all evaluation metrics, achieving substantial improvements in mAP@0.5 and Recall in particular. Nevertheless, both methods obtain relatively limited performance on the challenging Tiny Person Dataset. This is because the objects in this dataset are extremely tiny, making it hard for detection models to extract effective detailed features. Even under such harsh conditions, SODet-YOLO still delivers better detection performance, which validates the effectiveness of the proposed model modifications.

4.7. Visualization of Feature Maps on the VisDrone2019 Dataset

This experiment extracted the feature maps from the layer preceding the highest-resolution detection head of YOLO11n and SODet-YOLO, respectively. Specifically, as shown in Figure 15. Experimental findings reveal that SODet-YOLO delivers superior detection performance in both bright daytime and dark nighttime environments, with only a few objects missed, whereas YOLO11n suffers from severe missed detections. As displayed in Figure 15a, the SODet-YOLO has higher-resolution feature maps, which effectively preserve the features of small objects, the feature regions of most individual objects are highly concentrated, and the interference caused by complex environments is also suppressed effectively. In contrast, the feature maps of YOLO11n show scattered feature regions of individual objects along with severe background interference. In addition, Figure 15b shows that YOLO11n yields a large number of missed detections, the main reason for which is that highly dense crowds result in the feature regions of various objects being indistinguishable. However, in the SODet-YOLO’s feature map, the features of each object are clearly distinguished, thus demonstrating superior small object detection performance.

4.8. Comparison Experiments

Experiments on the VisDrone2019 dataset were carried out to compare the detection performance of our proposed algorithm with other mainstream approaches, and the detailed results are tabulated in Table 7.

The experimental outcomes verify that our proposed algorithm achieves an mAP@0.5 score of 41.5%, which is significantly higher than that of other classic object detection algorithms. Compared with small-scale YOLO series algorithms, it achieves superior detection performance with relatively fewer parameters (3.78 M), surpassing these algorithms by approximately 1–2 percentage points in terms of mAP@0.5. Compared with recently proposed improved algorithms tailored for UAV aerial photography scenarios, such as YOLO-FEPA and Drone-YOLO, although it has slightly larger parameters, it achieves the highest mAP@0.5 and thus the optimal detection performance. Overall, the improved algorithm proposed in this paper has substantial advantages.

5. Conclusions

Detection of small objects in UAV aerial imagery scenarios is confronted with issues such as serious missed detection and misclassification problems, which are caused by the small size of objects and complex background environments. To resolve these problems, this paper presents an improved model built on YOLO11n. To address the limited feature representation of small objects and ineffective multi-scale feature integration, a small object detection head and FGA-AFPN are proposed. In addition, integrating C3k2_IDC into the feature extraction process enlarges the receptive field, which supports better capturing of the contextual information pertaining to objects. Additionally, a combination of MPDIoU and InterpIoU is adopted to speed up the regression process in the early training stage of the model, thereby enhancing the overall performance of the model. Finally, the PPA module is added to mitigate feature loss in the downsampling process and retain key feature information.

Experimental results obtained from the VisDrone2019 dataset demonstrate that the SODet-YOLO achieves a significant enhancement in mAP@0.5 in comparison with the YOLO11n, and exceeds the performance of most classic object detection models. In terms of feature maps, those output by the SODet-YOLO are also markedly better than those generated by the YOLO11n. On the Aerial Traffic Images dataset, the model still shows a notable performance boost, particularly when dealing with tiny objects like motorbikes, where the detection accuracy achieves the greatest improvement.

Although the SODet-YOLO has obtained remarkable performance improvements, it is still less superior than the YOLO11n in terms of training speed, parameter quantity, single-frame detection speed and computational complexity. In detail, the parameter count rose by roughly 45.9%, and the detection time of per image increases by around 136%. These shortcomings restrict the practical application of the model. Consequently, in future research, we will concentrate on model lightweighting, with the goal of greatly cutting down the quantity of model parameters as well as the computational complexity while sacrificing only a small amount of the improved model’s performance.

Author Contributions

W.Y. conceived the experiment. K.Z. and X.Q. conducted the experiment. K.Z. completed the writing of this paper. S.L., Z.H., S.M. and T.L. completed the model design. W.Y., K.Z., X.Q., S.L., Z.H., S.M., T.L. and P.W. reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by The Research on Semantic Change Detection in PolSAR Images with Limited Samples by Knowledge Guidance and Data Drive (Shaanxi Provincial Science and Technology Project: grant number 2025JC-YBMS-255).

Data Availability Statement

The original contributions presented in this study are included in this article. The full code for implementation can be found at https://github.com/taisui123-code/SODet-YOLO and accessed on 1 May 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AFPN	Asymptotic Feature Pyramid Network
IDC	Inception Depthwise Convolution
FGA-AFPN	Fine-Grained Aggregation-Asymptotic Feature Pyramid Network
PPA	Parallelized Patch-aware Attention
CSPDarknet-53	Cross Stage Partial Network Darknet-53
C2PSA	Cross Stage Partial with Pyramid Squeeze Attention
CSP	Cross Stage Partial
BCE	Binary Cross-Entropy
CoT	Contextual Transformer
EMA	Efficient Multi-Scale Attention
SCAM	Spatial-Channel Attention Mechanism
BiFPN	Bidirectional Feature Pyramid Network
ASFF	Adaptive Spatial Feature Fusion
SCFM	Spatial-Channel Fusion Module

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA; IEEE: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Hoshino, W.; Seo, J.; Yamazaki, Y. A study for detecting disaster victims using multi-copter drone with a thermographic camera and image object recognition by SSD. In 2021 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM); IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Sorbelli, F.B.; Palazzetti, L.; Pinotti, C.M. YOLO based detection of halyomorpha halys in orchards using RGB cameras and drones. Comput. Electron. Agric. 2023, 213, 108228. [Google Scholar] [CrossRef]
Bisio, I.; Haleem, H.; Garibotto, C.; Lavagetto, F.; Sciarrone, A. Performance evaluation and analysis of drone-based vehicle detection techniques from deep learning perspective. IEEE Internet Things J. 2021, 9, 10920–10935. [Google Scholar] [CrossRef]
Yang, X.; Wang, Y.; Wang, Y.; Li, W.; Liu, B.; Liu, G. CSSA-YOLO: A Clutter-Suppressed and Scale-Aware Framework for Robust Object Detection in UAV Imagery. Remote Sens. 2026, 18, 1533. [Google Scholar] [CrossRef]
Ma, L.; Luo, Y.; Xu, J. HMF-DEIM: High-Fidelity Multi-Domain Fusion Transformer for UAV Small Object Detection. Sensors 2026, 26, 2187. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Tan, M.; Pang, R. EfficientDet: Scalable and efficient object detection. arXiv 2020, arXiv:1911.09070. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, X.; Wang, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5602015. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. arXiv 2023, arXiv:2305.13563. [Google Scholar]
Zhou, X.; Liu, Q.; Huang, H. An Enhanced YOLOv11 Model for Small Object Detection in UAV Aerial Images. In Proceedings of the International Conference on Intelligent Transportation and Future Mobility (ITFM2025), Guilin, China, 11 April 2025; pp. 1–8. [Google Scholar]
Wang, X.; Hu, Y. UAV Image Small Object Detection on Complex Background. Comput. Eng. Appl. 2023, 59, 107–114. [Google Scholar]
Wang, Z.; Xu, H.; Zhu, X.; Li, C.; Liu, Z.; Wang, Z. An improved dense pedestrian detection algorithm based on YOLOv8: MER-YOLO. Comput. Eng. Sci. 2024, 46, 1050–1062. [Google Scholar]
Feng, Z.; Xie, Z.; Bao, Z.; Chen, K. Real-time dense small object detection algorithm for UAV based on improved YOLOv5. Acta Aeronaut. Astronaut. Sin. 2023, 44, 251–265. [Google Scholar] [CrossRef]
Gomaa, A.; Abdelwahab, M.M.; Abo-Zahhad, M. Efficient vehicle detection and tracking strategy in aerial videos by employing morphological operations and feature points motion analysis. Multimed. Tools Appl. 2020, 79, 26023–26043. [Google Scholar] [CrossRef]
Umirzakova, S.; Muksimova, S.; Shavkatovich Buriboev, A.; Primova, H.; Choi, A.J. A Unified Transformer Model for Simultaneous Cotton Boll Detection, Pest Damage Segmentation, and Phenological Stage Classification from UAV Imagery. Drones 2025, 9, 555. [Google Scholar] [CrossRef]
Xu, Z.; Zhao, H.; Liu, P.; Wang, L.; Zhang, G.; Chai, Y. SRTSOD-YOLO: Stronger Real-Time Small Object Detection Algorithm Based on Improved YOLO11 for UAV Imageries. Remote Sens. 2025, 17, 3414. [Google Scholar] [CrossRef]
Ben Rouighi, I.; Chtioui, H.; Jegham, I.; Alouani, I.; Ben Khalifa, A. FRD-YOLO: A faster real-time object detector for aerial imagery. J. Real-Time Image Process. 2025, 22, 169. [Google Scholar] [CrossRef]
Hussain, S.; Mumtaz, I.; Wang, C.; Lv, P. SF-YOLOv9: PGI based hybrid backbone with dual-path attention for small object detection in aerial imagery. Egypt. Inform. J. 2026, 33, 100888. [Google Scholar] [CrossRef]
Liu, C.; Wang, P.; Gong, Y.; Cheng, A. YOLO-WL: A Lightweight and Efficient Framework for UAV-Based Wildlife Detection. Sensors 2026, 26, 790. [Google Scholar] [CrossRef] [PubMed]
Khanam, R.; Hussain, M. YOLO11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Li, Y.; Zhu, C.; Zhang, Q.; Zhang, J.; Wang, G. IF-YOLO: An efficient and accurate detection algorithm for insulator faults in transmission lines. IEEE Access 2024, 12, 167388–167403. [Google Scholar] [CrossRef]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. arXiv 2023, arXiv:2306.15988. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
Yu, W.; Zhou, P.; Yan, S.; Wang, X. InceptionNeXt: When Inception meets ConvNeXt. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA; IEEE: Piscataway, NJ, USA, 2024; pp. 5672–5683. [Google Scholar] [CrossRef]
Ma, S.; Xu, Y. MPDIoU: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Liu, H.; Watanabe, H. InterpIoU: Rethinking bounding box regression with interpolation-based IoU optimization. arXiv 2025, arXiv:2507.12420. [Google Scholar]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical context fusion network for infrared small object detection. arXiv 2024, arXiv:2403.10778. [Google Scholar] [CrossRef]
Shi, B.; Gai, S.; Darrell, T.; Wang, X. Refocusing is key to transfer learning. arXiv 2023, arXiv:2305.15542. [Google Scholar]
Wang, X.; Huang, J.; Tan, W.; Shen, Z. Object detection based on deep feature enhancement and path aggregation optimization. Comput. Sci. 2025, 52, 184–195. [Google Scholar]
Zhang, Z. Drone-YOLO: An efficient neural network method for object detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Wang, Z.; Su, Y.; Kang, F.; Wang, L.; Lin, Y.; Wu, Q.; Li, H.; Cai, Z. Pc-YOLO11s: A lightweight and effective feature extraction method for small object image detection. Sensors 2025, 25, 348. [Google Scholar] [CrossRef] [PubMed]
Zeng, K.; Yu, W.; Qin, X.; Han, J.; Hou, Z.; Ma, S. Improved UAV aerial vehicle detection algorithm based on YOLO11n. In Proceedings of Image and Graphics; Springer Nature Singapore: Singapore, 2025; Volume 16162. [Google Scholar]
Zeng, K.; Yu, W.; Qin, X.; Long, S. Small-target detection algorithm based on improved YOLO11n. Sensors 2026, 26, 71. [Google Scholar] [CrossRef]

Figure 1. SODet-YOLO network architecture. Three essential modules constitute the main structure of this framework, including Backbone, Neck and Head. The PPA is inserted ahead of the first three downsampling operations in the Backbone, the Neck employs the FGA-AFPN structure, which incorporates attention mechanisms and C3k2_IDC module. Finally, all feature layers are fed into the Head to perform object detection for objects of all scales.

Figure 2. Structure of FGA-AFPN. This network architecture takes the feature maps of four feature layers at different scales (P2, P3, P4, P5) as input, which are denoted as A, B, C and D respectively. Different feature maps in the same layer are distinguished by the subscripts 1, 2 and 3.

Figure 3. SCFM Structural Diagram.

Figure 4. C3k2_IDC Structural Diagram. This module integrates the IDC with the C3k2 module. Specifically, the IDC is inserted between the two Conv modules in the Bottleneck to derive the Bottleneck_IDC module. When C3k is set to true, the original C3k module is still employed; if C3k is configured to false, the Bottleneck module is superseded by the Bottleneck_IDC module.

Figure 5. Structure of PPA.

Figure 6. Algorithm process.

Figure 7. VisDrone2019 dataset example.

Figure 8. Distribution of object pixel areas in the VisDrone2019 training set: (a) depicts the statistical distribution of object quantities across different size ranges; (b) presents a heatmap of object distribution, taking height as the vertical axis and width as the horizontal axis.

Figure 9. Comparison of performance of every model.

Figure 10. Visualization of detection results for each model.

Figure 11. Confusion matrix comparison.

Figure 12. AP of objects with different sizes.

Figure 13. Performance change trends with various sizes for each model.

Figure 14. Results of stability experiments.

Figure 15. Visualization of feature maps in VisDrone2019: (a) is an example of feature maps in dim scenes; (b) is an example of feature maps in bright scenes.

Table 1. Experimental Environment.

Experimental Environment	Version
Operating System	Windows 10 Pro
GPU	NVIDIA GeForce RTX 4090(24 G)
Programming Language	Python 3.9.21
CUDA	CUDA 11.3
Deep Learning Framework	Pytorch 1.12.0

Table 2. Learning parameters.

Parameter	Parameter Settings
Epoch	200
Workers	4
Batch size	12
lr0	0.01
lrf	0.01
Patience	50
Optimizer	SGD

Table 3. Results of ablation experiment.

Model	Params/M	GFLOPs	mAP@0.5	Average Inference Time per Image
①: YOLO11n	2.59	6.4	32.567%	25.5 ms
②: ① + P2	2.70	12.3	36.916%	28.4 ms
③: ② + FGA-AFPN	3.39	16.5	39.133%	39.2ms
④: ③ + C3k2_IDC	3.67	20.1	39.468%	45.6 ms
⑤: ④ + MPDInterpIoU	3.67	20.1	40.317%	46.1 ms
⑥: ⑤ + PPA	3.78	26.1	41.487%	60.2 ms
⑦: ④ + MPDIoU	3.67	20.1	39.795%	45.9 ms
⑧: ① + C3k2_IDC	2.91	7.7	33.166%	31.6 ms
⑨: ① + MPDInterpIoU	2.59	6.4	32.866%	25.7 ms
⑩: ① + PPA	2.74	14.9	35.681%	37.1 ms

Table 4. Detection results of each category. Text in bold denotes better experimental.

Classes	AP@0.5 (SODet-YOLO)	AP@0.5 (YOLO11n)
Truck	84.2%	75.6%
Articulated-bus	99.4%	99.2%
Freight	96.5%	98.3%
Car	83.5%	73.6%
Motorbike	56.1%	20.6%
Small bus	98.1%	96.5%
Bus	97.5%	97.0%

Table 5. Experimental results of the Top-View Drone Car Detection Dataset.

Model	mAP@0.5	mAP@[0.5:0.95]	Precision	Recall
YOLO11n	88.34%	55.10%	88.29%	83.25%
SODet-YOLO	90.57%	60.41%	87.01%	79.86%

Table 6. Experimental results of the Tiny Person Dataset.

Model	mAP@0.5	mAP@[0.5:0.95]	Precision	Recall
YOLO11n	12.76%	4.47%	30.11%	15.62%
SODet-YOLO	17.03%	6.0%	35.50%	21.98%

Table 7. Comparison results with other algorithms.

Model	Params/M	GFLOPs	mAP@0.5
Faster-R-CNN	63.20	207.0	30.9%
YOLOv5s	9.10	23.8	38.8%
SSD	12.30	63.2	24.0%
YOLOv8s	11.20	28.5	39.0%
YOLOv10n	2.26	6.5	34.2%
YOLOv10s	7.22	21.4	39.0%
YOLO11s	9.40	21.3	39.0%
YOLO-FEPA [32]	2.8	7.5	36.7%
Drone-YOLO [33]	3.05	-	37.0%
PC-YOLOn [34]	2.00	-	36.1%
MI-YOLO [35]	3.29	21.9	38.3%
YOLO-AISD [36]	3.30	16.3	39.3%
Ours	3.78	26.1	41.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, K.; Yu, W.; Long, S.; Qin, X.; Wang, P.; Hou, Z.; Ma, S.; Li, T. SODet-YOLO: A Small Object Detection Algorithm for UAV Aerial Photography Perspective. Remote Sens. 2026, 18, 1714. https://doi.org/10.3390/rs18111714

AMA Style

Zeng K, Yu W, Long S, Qin X, Wang P, Hou Z, Ma S, Li T. SODet-YOLO: A Small Object Detection Algorithm for UAV Aerial Photography Perspective. Remote Sensing. 2026; 18(11):1714. https://doi.org/10.3390/rs18111714

Chicago/Turabian Style

Zeng, Ke, Wangsheng Yu, Siyu Long, Xianxiang Qin, Peng Wang, Zhiqiang Hou, Sugang Ma, and Tianxin Li. 2026. "SODet-YOLO: A Small Object Detection Algorithm for UAV Aerial Photography Perspective" Remote Sensing 18, no. 11: 1714. https://doi.org/10.3390/rs18111714

APA Style

Zeng, K., Yu, W., Long, S., Qin, X., Wang, P., Hou, Z., Ma, S., & Li, T. (2026). SODet-YOLO: A Small Object Detection Algorithm for UAV Aerial Photography Perspective. Remote Sensing, 18(11), 1714. https://doi.org/10.3390/rs18111714

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

SODet-YOLO: A Small Object Detection Algorithm for UAV Aerial Photography Perspective

Highlights

Abstract

1. Introduction

2. Related Work

3. The Proposed Algorithm

3.1. FGA-AFPN and P2 Small Object Detection Layer

3.2. C3k2_IDC

3.3. MPDInterpIoU

3.4. PPA Module

3.5. Algorithm Process

4. Experimental Results and Analysis

4.1. Experimental Environment

4.2. Ablation Experiments

4.3. Confusion Matrix Comparison Experiment

4.4. Object Detection for Various Sizes

4.5. Stability Comparison Experiment

4.6. Experimental Results on Other Dataset

4.7. Visualization of Feature Maps on the VisDrone2019 Dataset

4.8. Comparison Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI