Small-Target Detection Algorithm Based on Improved YOLOv11n

Zeng, Ke; Yu, Wangsheng; Qin, Xianxiang; Long, Siyu

doi:10.3390/s26010071

Open AccessArticle

Small-Target Detection Algorithm Based on Improved YOLOv11n

¹

Graduate School, Air Force Engineering University, Xi’an 710051, China

²

School of Information and Navigation, Air Force Engineering University, Xi’an 710077, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(1), 71; https://doi.org/10.3390/s26010071 (registering DOI)

Submission received: 8 November 2025 / Revised: 14 December 2025 / Accepted: 17 December 2025 / Published: 22 December 2025

(This article belongs to the Section Sensor Networks)

Download

Browse Figures

Review Reports Versions Notes

Highlights

To address the problems of missed detections and false detections in aerial small-target detection under drone scenarios, this paper comprehensively improves the YOLOv11n model from aspects such as network structure and loss function. It not only adds a small-target detection layer but also adopts AFPN as the neck network. The newly proposed improved modules, C3k2_IDC and SCASPPF, further enhance the model performance. Finally, MPDInnerIoU is presented as the loss function to optimize the regression process.

What are the main findings?

First, we add a 160 × 160 resolution detection head with AFPN, replace SPPF with SCASPPF (which highlights small target features and suppresses background clutter), optimize the loss function via MPDIoU-InnerIoU fusion, and enhance C3k2 with IDC (which improves localization accuracy and receptive field). These measures collectively boost performance.
Second, on the Visdrone2019 dataset, we find that the improved YOLOv11n achieves 39.256% mAP@0.5, a 6.689% gain over the benchmark.

What are the implications of the main findings?

First, it provides a new method for small-target detection in drones, demonstrating that integrating non-adjacent feature fusion, attention mechanisms, expanded receptive fields, and improved loss functions can enhance performance.
Second, the algorithm can be directly applied to UAV surveillance, rescue, reconnaissance, and environmental monitoring, reducing missed/false detections.

Abstract

Target detection in UAV aerial photography scenarios faces challenges of small targets and complex backgrounds. Thus, we proposed an improved YOLOv11n small-target detection algorithm. First, a detection head is added to the 160 × 160 resolution feature layer, and non-adjacent layer feature is fused via Asymptotic Feature Pyramid Network (AFPN) to alleviate feature loss caused by downsampling and reduce cross-level feature conflicts. Second, the Spatial Channel Attention SPPF (SCASPPF) module replaces the original Spatial Pyramid Pooling-Fast (SPPF) module to highlight key features and suppress irrelevant ones. Moreover, the loss function is enhanced by fusing MPDIoU and InnerIoU to boost detection accuracy. Finally, Inception Deep Convolution (IDC) is adopted to improve the C3k2 module, expanding the model’s receptive field and enhancing small-target detection performance. Experiments on the Visdrone2019 dataset show that the algorithm achieves 39.256% mAP@0.5, 6.689% higher than 32.567% mAP@0.5 of the benchmark model (YOLOv11n).

Keywords:

YOLOv11n; drone; small-target detection; AFPN; MPDIoU; InnerIoU; SPPF; IDC

1. Introduction

Nowadays, deep learning-based target detection technology exhibits excellent performance in feature extraction and high-accuracy detection, which can be roughly divided into two categories: single-stage target detection algorithms and two-stage target detection algorithms. Among them, the two-stage target detection algorithm includes R-CNN [1], Fast R-CNN [2], Faster R-CNN [3], etc. Their core logic is to generate region proposals, followed by performing regression and classification tasks on these proposals. Single-stage target detection algorithms include the YOLO series [4], SSD [5], Retina-Net [6], etc. The core idea is to directly complete the prediction of bounding box and categories without generating region proposals. Owing to the high inference speed of single-stage target detection, they are widely applied in real-time application scenarios.

In recent years, drones have been widely adopted in military and civilian fields due to their low cost and high flexibility. Achieving accurate target detection using drones remains a key challenge. However, detection in UAV aerial images is often confronted with small target sizes, blurred features, and background clutter, which pose significant difficulties for small-target detection. To address these challenges, numerous improved approaches have been proposed. Peng et al. [7] incorporated a contextual semantic enhancement module to enhance the representation ability of multi-scale feature maps and the recognition performance of small targets, but the module still exhibits limitations in reducing the false-positive rate for similar targets. Wang et al. [8] used the global Contextual Transformer (CoT) module and max pooling operation to enhance the extraction of texture information for small targets, thereby improving the detection accuracy. Wang et al. [9] introduced the Efficient Multi-Scale Attention (EMA) module into YOLOv8; this module encodes global information and aggregates pixel-level features. However, it is limited by long-distance dependencies and limited flexibility in parameter sharing. Feng et al. [10] introduced the Spatial-Channel Attention Mechanism (SCAM) into the YOLOv5 model, which improves the model’s focus on small target regions by fusing spatial attention and channel attention, but suffers from high computational complexity and increased overfitting risk. Ahmed Gomaa et al. [11] proposed a real-time method combining detection and tracking features, using Top-hat/Bottom-hat transformations, KLT+K-means, and an efficient association algorithm, to address vehicle occlusion, camera movement, and high computational cost in detecting and tracking moving vehicles in aerial videos; more valuable related work can be found in [12,13,14,15,16,17,18]. Sabina Umirzakova et al. [19] proposed Cotton Multitask Learning (CMTL), a transformer-driven multitask learning framework, which achieves cross-task mutual learning and feature preservation via the Cross-Level Multi-Granular Encoder (CLMGE) and Multitask Self-Distilled Attention Fusion (MSDAF), enabling accurate detection for cotton cultivation scenarios. Overall, these improvements offer valuable insights for the research of UAV aerial photography small-target detection.

YOLOv11n [20], released by the Ultralytics team in September 2024, is a new addition to the YOLO family, achieving significant breakthroughs in architectural design, operational efficiency, and multitask capability. Based on this latest model (YOLOv11n), this paper proposes an improved small-target detection algorithm, with the key modifications as follows: Firstly, a detection head is added to the 160 × 160 feature layer of the backbone network, and the multi-scale feature is fused via the Asymptotic Feature Pyramid Network (AFPN). Secondly, the SPPF module optimized by the Spatial-Channel Attention Mechanism (SCASPPF, as defined earlier) is adopted to replace the original SPPF module. In addition, a new IoU-based loss function, MPDInnerIoU Loss (combining MPDIoU and InnerIoU) is designed to optimize the bounding box regression and improve the detection accuracy. Finally, an improved C3k2 module (named C3k2_IDC) is developed to significantly expand the receptive field with only a slight increase in the number of parameters, thereby enhancing the overall performance of the model.

2. Related Work

At present, numerous research approaches for small-target detection tasks have been proposed, and these methods can be applied to the improvement of the YOLO series models, to significantly enhance small-target detection performance. Specifically, data augmentation technologies such as Mosaic data augmentation [21] are used to solve the problems of small-target data volume and uneven distribution. The latest multi-scale feature fusion neck structures such as BiFPN [22] and RCA [23] are adopted to enhance the feature expression ability of small targets. Attention mechanisms such as EMA [24] and SCAM [25] are incorporated into the model to highlight key features of small targets and mitigate background clutter.

The issues of small-target detection tasks (e.g., false detection and missed detection) can be effectively addressed by integrating the aforementioned methods into YOLOv11. Evolved from YOLOv8, YOLOv11 replaces C2f with C3k2, incorporates the C2PSA module into the backbone network, and optimizes the detection head using depthwise separable convolutions. Its architecture is divided into three parts: backbone, neck, and head. The backbone network is built upon Cross Stage Partial Darknet-53 (CSPDarknet-53) [21]. The Conv module in the backbone network first conducts convolution operations, then applies batch normalization, and finally activates the output via the SiLU activation function. Additionally, the backbone network also incorporates the Spatial Pyramid Pooling Fast (SPPF) module, which pools feature maps to a fixed size, thereby increasing the diversity of feature denotation. Finally, a Cross Stage Partial with Pyramid Squeeze Attention (C2PSA) module is incorporated at the end of the backbone network, which integrates the CSP (Cross Stage Partial) structure with the PSA (Pyramid Squeeze Attention) attention mechanism to enhance the feature extraction capability. At the same time, the parametric module C3k2 is designed. (k is an adjustable kernel size, such as 3 × 3 or 5 × 5; 2 indicates using two Bottleneck modules or C3k modules.) It is an improved design based on the traditional CSP Bottleneck with three convolutions (C3) module, and introduced multi-scale convolution kernels C3K. C3k2 can be dynamically adjusted by the C3k parameter (equivalent to C2f when C3k = false and C3k = true, replacing the internal Bottleneck with C3k) to improve model flexibility. The neck network adopts the PAN+FPN (Path Aggregation Network+Feature Pyramid Network) strategy to achieve multi-scale feature fusion.

The detection head employs a decoupled architecture, with separate branches for predicting class probabilities and localization information, and employs task-specific loss functions: Binary Cross-Entropy (BCE) loss for classification tasks, and Distribution Focal Loss (DFL) and Complete-IoU (CIoU) loss for bounding box regression tasks. It also incorporates depthwise separable convolution to reduce the number of parameters and computational complexity, and outputs feature maps at three scales: P3, P4, and P5, which are used for the detection of small, medium, and large targets, respectively. Together, these improvements have contributed to a substantial improvement in the performance of YOLO models.

The model structure of YOLOv11n is illustrated in Figure 1. Compared to prior YOLO series models, it exhibits advantages in both detection accuracy and inference speed, yet it remains primarily suited for the detection of large-scale targets, while its detection accuracy for small targets remains inadequate. To address this limitation, this paper proposes an improved small-target detection algorithm based on YOLOv11n.

3. Proposed Algorithm

The feature maps of each detection head of YOLOv11n suffer from feature loss due to successive downsampling operations. Thus, small target features are enriched through the utilization of the higher-resolution feature layer P2. Meanwhile, the neck network adopts FPN+PAN architecture, which only supports feature fusion at adjacent levels despite the significant multi-scale feature fusion effect, while AFPN addresses this limitation by progressively fusing non-adjacent layer feature. Additionally, the features extracted by SPPF are prone to background clutter, so SCASPPF highlights salient features and suppresses background clutter through the improvement of spatial and channel attention mechanisms. In terms of loss function, the training process of small-target detection is optimized by combining the recently proposed MPDIoU and InnerIoU. Finally, the C3k2 module is enhanced using IDC, which expands the receptive field to extract more global information, thereby significantly enhancing the small-target detection performance. The improved model network structure diagram is illustrated in Figure 2.

3.1. P2 Small-Target Detection Layer and AFPN

Downsampling continually reduces the resolution of feature maps, with the resolution halved per round, resulting in only a small number of pixels remaining in the image and almost disappearing features, which is easy to cause false detection and missed detection. Therefore, small-target detection accuracy can be significantly improved by incorporating a high-resolution (160 × 160) small-target detection layer (P2) after the second downsampling. In addition, improving the neck network with the AFPN to fuse cross-level features avoids information loss or degradation in multi-level transmission [26], as illustrated in Figure 3.

As can be seen from the figure, during bottom-up feature extraction in the backbone network, the AFPN gradually integrate low-level, high-level, and top-level features, and fuse different scales of feature maps via upsampling, downsampling, and element-by-element addition. During the fusion process, AFPN also incorporates adaptive spatial fusion operations to mitigate conflicts arising from cross-level feature fusion. Since P5 has less detailed information about small target features, high-resolution feature layers can better retain small target features. Therefore, we only progressively fuse the feature from P2, P3, and P4, without significantly increasing the model’s parameters or computational complexity. We first fuse the feature of P3 and P4 to make the semantic information of P4 and P2 closer. Then, features from P2, P3, and P4 are fused to alleviate the problem of excessive semantic differences and realize the gradual fusion of multi-level features. This network architecture has been widely applied in target detection tasks. For instance, Gao et al. [27] incorporated AFPN into YOLOv8 for road target detection, achieving a 1.5% improvement in mAP@0.5. Xu [28] integrated AFPN into YOLOv4 for the detection and recognition of parasite eggs, and its performance was significantly enhanced.

The multi-scale feature fusion process is realized via the ASFF [29] module. ASFF_2 denotes the fusion of two different hierarchical features, while ASFF_3 denotes the fusion of three different hierarchical features. Taking ASFF_2 as an example, the input feature map is first passed through a trainable convolutional layer to achieve weight learning:

F_{l} = C o n v ({I n p u t}_{l}), F_{h} = C o n v ({I n p u t}_{h})

(1)

Among them,

{I n p u t}_{l}

,

{I n p u t}_{h}

respectively denote low-level input feature maps and high-level input feature maps.

F_{h}

,

F_{l}

respectively denote the low-level feature map and high-level feature map after convolution.

C o n v

denotes convolution operation.

Subsequently, different level features are fused, the steps are as follows:

If

L

= 0, integrating low-level features into high-level features yields

F = C o n c a t (U p s a m p l e (F_{l}), F_{h}), L = 0

(2)

If

L

= 1, integrating high-level features into low-level features yields

F = C o n c a t (D o w n s a m p l e (F_{h}), F_{l}), L = 1

(3)

Among them,

Upsample

and

Downsample

respectively denote upsampling and downsampling operations,

Concat

denotes concatenation operations,

F

denote the fused feature map, and

L

is used to determine the hierarchical position of the current ASFF module.

Furthermore, fusing high-level and low-level features via the Concat operation and reducing the number of channels to 2 via

1 \times 1

point convolution, we obtain spatial weights using the softmax activation function. The output weights can be expressed as follows:

W_{1}, W_{2} = Softmax (Conv 1 \times 1 (F))

(4)

Among them,

Conv 1 \times 1

denotes point convolution operation,

Softmax

denotes performing softmax normalization on the elements at each spatial position along the channel dimension, and

W_{1}, W_{2}

denotes the corresponding output feature weights.

Finally, multiply the weights

W

with the fused feature to obtain the final fused feature.

If

L

= 0,

y = W_{1} \times U p s a m p l e (F_{l}) + W_{2} \times F_{h}, L = 0

(5)

If

L

= 1,

y = W_{1} \times F_{l} + W_{2} \times D o w n s a m p l e (F_{h}), L = 1

(6)

Y

denotes the final fused feature.

3.2. Improved SPPF

The continuous max pooling of the original SPPF may result in some spatial information loss, and fails to highlight key features or suppress irrelevant features. Therefore, spatial and channel attention SPPF (SCASPPF) is proposed, which incorporates channel attention mechanism into the original SPPF. The channel attention mechanism is applied immediately after the initial convolution. At this stage, the feature maps have not yet suffered spatial information loss from max pooling. These feature maps are then fused with those after max pooling, enabling the module to capture global information while preserving channel and spatial feature details. Finally, the concatenated feature maps are processed via the spatial attention mechanism to further highlight salient features. Among them, the channel attention is similar to the Squeeze and Excitation (SE) Module. It uses global pooling to reduce the dimension to

1 \times 1 \times C 1

, followed by a

1 \times 1

convolution and ReLU activation to generate the channel attention weights. This channel attention is then unfolded to match the dimension of the input. Subsequently, the spatial attention aggregates spatial features using average pooling and max pooling along the channel dimension, and concatenates the two generated feature maps. It finally uses a

7 \times 7

convolution and a sigmoid activation function to output a feature map of size

1 \times H \times W

, which is then subjected to element-wise multiplication. Therefore, the SCASPPF generates feature maps that enrich fine spatial and channel information, mitigates background clutter, and enhances the overall performance of the model. Its structure is illustrated in Figure 4 and the calculation process is as follows:

Firstly, the input feature map is processed through

1 \times 1

point convolution with trainable weights:

f = C o n v 1 \times 1 (i n p u t)

(7)

The

i n p u t

denotes the input feature map, and

f

denotes the output feature map.

Subsequently, channel attention weighting and max-pooling are, respectively, performed on the output feature map

f

:

f_{1} = C A (f), f_{2} = {m a x p o o l}_{5 \times 5} (f)

(8)

C A

refers to the channel attention mechanism,

{m a x p o o l}_{5 \times 5}

refers to a max pooling operation with a 5 × 5 kernel, a stride of 1, and padding of 2,

f_{1}

stands for the feature map after channel attention weighting, and

f_{2}

denotes the feature map after max-pooling.

Then,

f_{2}

undergoes two consecutive max-pooling operations to obtain

f_{3}

and

f_{4}

:

f_{3} = {m a x p o o l}_{5 \times 5} (f), f_{4} = {m a x p o o l}_{5 \times 5} (f_{3})

(9)

Thereafter, the concatenation operation is employed to fuse

f, f_{1}, f_{2}, f_{3}

, and

f_{4}

, resulting in the feature map

f_{5}

.

f_{5} = C o n c a t (f, f_{1}, f_{2}, f_{3}, f_{4})

(10)

Finally, spatial attention weighting is applied to

f_{5}

, followed by processing through another convolutional layer with trainable weights. Here,

f_{f i n a l}

denotes the ultimately obtained feature map, and

S A

denotes the spatial attention mechanism.

f_{f i n a l} = c o n v (S A (f_{5}))

(11)

3.3. MPDInnerIoU Loss Function

MPDIoU is a bounding box similarity comparison measure based on minimum point distance, combining multiple factors of IoU, such as overlapping area, center point distance, and size deviation [30]. The calculation process is as follows:

Firstly, calculate the distance

d_{1}

between the top left corner and the distance

d_{2}

between the top right corner of the GT box and Prd box.

d_{1}^{2} = {(x_{1}^{g t} - x_{1}^{p r d})}^{2} + {{(y}_{1}^{g t} - y_{1}^{p r d})}^{2}

(12)

d_{2}^{2} = {(x_{2}^{g t} - x_{2}^{p r d})}^{2} + {{(y}_{2}^{g t} - y_{2}^{p r d})}^{2}

(13)

Finally,

MPDIoU

is calculated:

M P D I o U = I o U - \frac{d_{1}^{2} + d_{2}^{2}}{w^{2} + h^{2}}

(14)

For the ground truth (GT) box, the top-left corner coordinates are

{(x}_{1}^{gt}, y_{1}^{gt})

, and the bottom-right corner coordinates are

{(x}_{2}^{gt}, y_{2}^{gt})

; for the predicted (Prd) box, the top-left corner coordinates are

{(x}_{1}^{prd}, y_{1}^{prd})

, and the bottom-right corner coordinates are

{(x}_{2}^{prd}, y_{2}^{prd})

. The variables

h

and

w

respectively denote the height and width of the smallest enclosing box.

InnerIoU optimizes the model training process via auxiliary boxes of varying sizes, with large-sized auxiliary boxes accelerating the convergence speed of small IoU samples and small-sized auxiliary frames accelerating the convergence speed of large IoU samples [31].

As illustrated in Figure 5, the center point of the ground truth (GT) box is (

x_{c}^{gt}

,

y_{c}^{gt}

), and the width and height is denoted as

w^{gt}

,

h^{gt}

, respectively. Similarly, the center point of the predicted box is (

x_{c}

,

y_{c}

), the width and height are denoted by

w, h,

and

R

is the scale factor. The calculation method for InnerIoU is as follows:

Firstly, the boundary positions of the auxiliary boxes belonging to the ground truth (GT) box are calculated using the scale factor

R

, and

x_{l}^{gt}, x_{r}^{gt}, y_{t}^{gt}, y_{b}^{gt}

denotes the left boundary x-axis, right boundary x-axis, top boundary y-axis, and bottom boundary y-axis of the auxiliary boxes belonging to the ground truth (GT) box.

x_{l}^{gt} = x_{c}^{gt} - \frac{w^{gt} \times R}{2}, x_{R}^{gt} = x_{c}^{gt} + \frac{w^{gt} \times R}{2}

(15)

y_{t}^{gt} = y_{c}^{gt} + \frac{h^{gt} \times R}{2}, y_{b}^{gt} = y_{c}^{gt} - \frac{h^{gt} \times R}{2}

(16)

Use the same method to calculate the boundary positions of the auxiliary boxes belonging to the predicted box.

x_{l}, x_{r}, y_{t}, y_{b}

denotes the left boundary x-axis, right boundary x-axis, top boundary y-axis, and bottom boundary y-axis of the auxiliary boxes belonging to the predicted box.

x_{l} = x_{c} - \frac{w \times R}{2}, x_{R} = x_{c} + \frac{w \times R}{2}

(17)

y_{t} = y_{c} + \frac{h \times R}{2}, y_{b} = y_{c} - \frac{h \times R}{2}

(18)

Calculate the size of the intersection area

inter

between two auxiliary boxes.

i n t e r = (m i n (x_{r}^{g t}, x_{r}) - m a x (x_{l}^{g t}, x_{l})) \times (m i n (y_{t}^{g t}, y_{t}) - m a x (y_{b}^{g t}, y_{b})

(19)

Calculate the size of the shared area

union

between two auxiliary boxes.

union = (w^{gt} \times h^{gt}) \times R^{2} + (w + h) \times R^{2} - inter

(20)

Finally, calculate InnerIoU.

InnerIoU = \frac{inter}{union}

(21)

For small-target detection, the sizes of ground truth boxes and predicted boxes are extremely small, rendering it prone to situations where IoU is very small or even zero in the early stages of training. In such cases, the gradient provided by IoU becomes very small or even vanishes. However, InnerIoU can make two non-overlapping bounding boxes overlap by enlarging them. Additionally, even if they remain non-overlapping after enlargement, the

d_{1}^{2}

and

d_{2}^{2}

terms in MPDIoU do not rely on IoU for calculation; instead, they compute the distance between the top-left corners and the distance between the bottom-right corners of the boxes. This ensures that the overall IoU loss function does not become excessively small, thereby avoiding the problem of extremely small or vanishing gradients. Therefore, the algorithm in this article combines MPDIoU with InnerIoU in the loss function section, and the calculation method is as follows:

MPDInnerIoU = InnerIoU - \frac{d_{1}^{2} + d_{2}^{2}}{w^{2} + h^{2}}

(22)

3.4. C3k2_IDC

Large kernel depth separable convolution has a large receptive field and can significantly improve model performance, but its efficiency is low. Although small kernel depthwise separable convolutions are fast, their receptive field which is too small leads to a decrease in model performance. Therefore, IDC (Inception Depthwise Convolution) is inspired by the Inception model and decomposes large kernel deep convolution into multiple parallel branches: small-sized square kernel branch; identity mapping branch; orthogonal nuclear branches. IDC can significantly improve model efficiency through parallel branching and partial channel processing, and orthogonal kernel branches can effectively expand the receptive field, resulting in better performance [32].

This article combines IDC with C3k2 to propose a new feature extraction module C3k2_IDC. It uses Inception-style multi-branch deep convolutions (including 3 × 3, 1 × 11, 11 × 1, and identity branches) embedded between the standard convolutions in the original bottleneck sequential structure. By retaining the residual learning framework and hierarchical feature extraction process of the original module, it effectively expands the receptive field while maintaining strong feature extraction. This not only enhances feature representation capability but also avoids potential structural simplification and gradient disappearance issues that could result from directly replacing with parallel deep convolution branches, thereby more effectively balancing receptive field expansion and model efficiency while preserving YOLO’s lightweight characteristics. C3k2_IDC is illustrated in Figure 6.

The detailed algorithm process of IDC is as follows:

Firstly, the number of channels of the input feature map for each branch is calculated:

c_{1} = I n_c h a n n e l s - 3 \times I n_c h a n n e l s \times 0.125 c_{2}, c_{3}, c_{4} = I n_c h a n n e l s \times 0.125

(23)

The

c_{1}

,

c_{2}

,

c_{3}

, and

c_{4}

denote the number of channels of the input feature maps for the identity mapping branch, the small-size square kernel branch, the 1 × 11 orthogonal branch, and the 11 × 1 orthogonal branch, respectively, and

I n_c h a n n e l s

refers to the number of input channels.

Secondly, the split function is used to divide the input feature map into four parts:

f_{i d}, f_{w h}, f_{w}, f_{h} = split (input, (c_{1}, c_{2}, c_{3}, c_{4}))

(24)

The

f_{i d}, f_{w h}, f_{w}, f_{h}

denote the split feature maps and the

i n p u t

denotes the input feature map.

Then, each split feature map is fed into the corresponding branch:

f_{w h}^{'} = DWConv 3 \times 3 (f_{w h})

(25)

f_{w}^{'} = DWConv 1 \times 11 (f_{w})

(26)

f_{h}^{'} = DWConv 11 \times 1 (f_{h})

(27)

DWConv3 × 3 denotes a depthwise convolution with a kernel size of 3 × 3 and a padding of 1; DWConv1 × 11 denotes a depthwise convolution with a kernel size of 1 × 11 and a padding of 5 along the width dimension; DWConv11 × 1 denotes a depthwise convolution with a kernel size of 11 × 1 and a padding of 5 along the height dimension.

f_{w h}^{'}

,

f_{w}^{'}

, and

f_{h}^{'}

denote the output feature maps of each branch, respectively.

Finally,

f_{i d}

,

f_{w h}^{'}

,

f_{w}^{'}

, and

f_{h}^{'}

are concatenated via the concatenation operation, where

f

denotes the final output feature map of the IDC module

f = c o n c a t (f_{i d}, f_{w h}^{'}, f_{w}^{'}, f_{h}^{'})

(28)

The parameter count and computational complexity of the IDC under fixed channel numbers and large tensor sizes is calculated. We specify both the input and output channel numbers as 256, with a width and height of 80 × 80, respectively. Only 3/8 of the channels are selected and equally allocated to the small square kernel branch and the two orthogonal kernel branches, while the remaining 5/8 are reserved for identity mapping. The results show that the IDC has a parameter count of only 992 and FLOPs are merely 198,400. In contrast, a depthwise separable convolution with a kernel size of 7 × 7 has 12,544 parameters and 313,600 FLOPs. This indicates that the IDC significantly reduces the parameter count while expanding the model’s receptive field.

3.5. Algorithm Process

The algorithm divides the input image into several grid cells, each of which undertakes the task of predicting the center point falling on the target in its own area, and outputs a fixed-length vector containing bounding box coordinates and category information. After inputting the images into the improved and trained backbone network and neck network, feature maps of different scales are generated through downsampling process such as convolution and pooling, among which high-resolution feature maps are used for small-target detection, and low-resolution feature maps are used for large-target detection. Finally, the algorithm uses non-maximal suppression (NMS) to process the predicted bounding boxes, and the optimal detection results are screened out and superimposed on the original image. The specific process is illustrated in Figure 7 and Algorithm 1.

Algorithm 1: Algorithm of improved target detection

Input: Image dataset

D

.

1: for each image

I

in

D

do

2: Divide the image into S × S grids.

3: Extract the feature map

m

through improved YOLOv11n Network

4: Extract feature vectors

v

through the detection

5: for each

v

in

I

6: Calculate the best

v

and delete the remaining

v

(NMS)

7: Generate test results R

8: end for

9: end for

Output: R

4. Experimental Results and Analysis

4.1. Dataset and Experimental Environment

The dataset used in this experiment is VisDrone2019, an example of which can be seen in Figure 8. The dataset was compiled and constructed by the AISKYEYE team of the Machine Learning and Data Mining Laboratory of Tianjin University, including 6471 training set images, 548 verification set images, and 1610 test set images, covering a total of 10 types of detection targets. There are 353,550 detection targets in the training set, among which the target size is small and the background environment is complex, which brings great challenges to the detection of small aerial targets.

The COCO dataset considers targets with an area smaller than 32 × 32 as small targets. In the VisDrone2019 training set, there are a total of 353,550 detection targets, with the highest numbers being small-sized pedestrians and vehicles. Statistics show that 212,630 targets have a pixel area smaller than 32 × 32, and 34,827 have a pixel area smaller than 10 × 10, posing significant challenges for UAV aerial target detection. This is illustrated in Figure 9.

The training parameters were 200 epochs, 16 batch sizes, and four workers. Gradient descent (SGD) was used for optimization, and training was stopped 50 times without performance improvement. The initial learning rate (lr0) and final learning rate (lrf) were both 0.01. The details are illustrated in Table 1.

4.2. Analysis of Ablation Experiment

As illustrated in Table 2, the mAP@0.5 detected by model ① on the dataset Visdrone2019 reached 32.567%. By adding the small-target detection layer P2, the mAP@0.5 of model ② is increased by 4.439%. The network structure of PAN+FPN is replaced with AFPN to obtain model ③, which fully integrates the features of different scales, and the mAP@0.5 is improved by 1.117%. The model ④ is obtained by introducing the spatial and channel attention spatial pyramid pooling-fast module (SCASPPF), highlighting the important features, and the mAP@0.5 is further improved by 0.124% compared with the model ③. The model ⑤ was obtained by replacing C3k2 with C3k2_IDC, which significantly expanded the receptive field, increasing the mAP@0.5 by 0.573% compared with the model ④, and only slightly increasing the number of parameters. Finally, model ⑤ is further trained with MPDInnerIoU to obtain model ⑥, and the mAP@0.5 is increased by 0.556% again, indicating that it is better than the CIoU used by the benchmark model. In addition, the experimental results of models ⑦ and ⑧ show that the effect of MPDInnerIoU is better than that of InnerIoU. Through a series of improvements to YOLOv11n, the detection accuracy was significantly improved by 6.689% in mAP@0.5.

With the gradual integration of various improved modules, the corresponding number of parameters has modestly grown from 2.59 M to 3.30 M. At the same time, the GFLOPs have surged dramatically from 6.4 to 16.3, and the average inference time per image has also increased from 24.4 ms to 44.5 ms. The core reasons are as follows: the P2 feature layer significantly expands the feature map size, and the AFPN increases the computational overhead of multi-scale feature fusion. Although modules such as SCASPPF and C3k2_IDC contribute to performance improvement, they further accumulate computational costs. Specifically, the integration of the P2 feature layer results in an increase of only 0.11 M in parameters and 5.9 in GFLOPs, with the average inference time per image rising by merely 3.5 ms, yet it achieves the maximum single-module performance gain. Secondly, the addition of AFPN leads to the largest increase in parameters (0.54 M), while GFLOPs only increase by 2.8; due to the substantial growth in parameters, the average inference time per image increases by 8.3 ms, and it delivers the second-largest single-module performance gain, second only to that of the P2 feature layer. Overall, the increase in training cost is within an acceptable range, and significant overall performance improvement is achieved.

The indicators of different models in this experiment are illustrated in Figure 10. The comparison of some test sets between YOLOv11n and the improved YOLOv11n is illustrated in Figure 11.

As can be seen from the confusion matrix in Figure 12, the improved model has higher correct classification rates for most target categories compared with the baseline model. Particularly for “people” (an extremely small-sized target category), its correct classification rate has increased from 0.43 to 0.67. This demonstrates that the proposed model fully extracts the feature of small targets, thus enhancing the performance in small-target detection. Overall, this algorithm can significantly enhance the small-target detection capability without sacrificing the large-target detection capability.

4.3. Experimental Analysis of MPDInnerIoU Parameter Settings

According to the experimental results in Figure 13, it can be seen that after the final improved model training, the ratio increased from 1.1 to 1.9 mAP@0.5. It shows an upward downward upward trend, with a ratio of 1.3 and 1.7 mAP@0.5. When reaching the extreme value and ratio = 1.3 overall, within the entire interval, mAP@0.5 reached the maximum value of 39.256%. Therefore, the final model selected the MPDInnerIOU loss function with a ratio of 1.3.

4.4. Detection Results of Different Sizes

As illustrated in Figure 14, we calculated the AP values for different size ranges of each target in the Visdrone2019 test set. The experimental results in Figure 3 show that the improved model does not perform well when detecting targets with size < 10 × 10, but it is still better than the baseline model. For 10 × 10 < size < 32 × 32, both mAP@0.5 and mAP@[0.5:0.95] show the most significant improvements, fully demonstrating that the improved model significantly enhances small-target detection performance, while its performance for detecting larger targets remains slightly better than the baseline model.

As illustrated in Figure 15, the AP@0.5 for each target size across different models under ablation experiments is calculated. Due to the small number of targets smaller than 10 × 10 pixels in the VisDrone2019 dataset and the extreme difficulty in detecting such tiny targets, there is a slight improvement in their AP performance, which is mainly attributed to the addition of the P2 feature layer. For targets in the size range of 10 × 10–32 × 32 pixels, the AP values gradually increase with the sequential integration of improved modules, representing the most significant improvement across all size intervals. Targets in UAV aerial imagery exhibit large size variations; although small targets dominate, some large targets also exist. Therefore, the proposed algorithm improves the detection performance of small targets without degrading that of large targets. Additionally, the integration of attention mechanisms and AFPN further enhances the detection performance of medium and large targets, albeit to a lesser extent compared to the improvement achieved for small targets. In summary, the proposed algorithm achieves a significant improvement in small-target detection performance.

4.5. Experimental Results of Other Datasets

To further validate the model’s generalization, training was conducted on the Aerial Traffic Image dataset, which is used for road traffic detection. This dataset is from the Kaggle platform and contains 1710 training set images, 558 validation set images, and 440 test set images. There is a slight change in the training parameters, only changing epoch = 200 to epoch = 100.

According to Table 3, the model performs well in identifying different vehicles on the road and shows significant improvement compared to the benchmark model.

The experimental results show that both the baseline model and the improved model have high detection accuracy for large vehicles such as bus, freight, truck, etc. The bold numbers in the figure indicate that the model performs better in detecting that category. However, overall, the improved model has slightly higher detection accuracy for large vehicles than the baseline model. The benchmark model has lower detection accuracy for Car and Motorbike with smaller target sizes, only 73.6% and 20.6%, respectively, indicating that small targets are easy to miss and false detection easily occurs. The improved model achieved detection accuracy of 78.2% and 49.3% for Car and Motorbike, respectively, with an increase of 4.6% and 28.7%. In addition, for large-sized targets such as Trucks and Bus, the improved model is not inferior to the baseline model. This is because we retain the P5 detection head, which is specifically designed for large-target detection. Furthermore, due to the integration of the attention mechanism and more advanced feature fusion, certain categories even achieve certain performance improvements. Therefore, the experimental results demonstrate that the improved algorithm can significantly enhance the ability of small-target detection.

4.6. Comparative Experiment

The metrics for each model were obtained through experiments on the Visdrone2019 dataset. As illustrated in Table 4, the proposed algorithm achieves 39.3% of mAP@0.5 with a low parameter of 3.30 M and a GFLOPs of 16.3, which is much higher than other classical algorithms for target detection and surpasses many s-size YOLO model variants in terms of detection performance. In terms of the number of model training parameters and GFLOPs, the improved model is far lower than other comparative models, and only slightly higher than YOLOv10n and the baseline model YOLOv11n. However, its mAP@0.5 is much higher than those of these two models, which meet the requirements of high efficiency and high precision. Finally, we also compared our model with several state-of-the-art variants specialized in small-target detection. Compared with YOLO-FEPA and PC-YOLOn, our model achieves significantly higher mAP@0.5, albeit with a slightly larger number of parameters and higher computational complexity than YOLO-FEPA. When compared with Drone-YOLO, our algorithm outperforms it in both detection accuracy and number of parameters. Overall, the improved model proposed in the algorithm of this paper is more suitable for real-time detection of UAV aerial images.

4.7. VisDrone2019 Feature Map Visualization

This article selects some images from the VisDrone2019 test set as input, and outputs feature maps in the previous layer of the P3 detection head of YOLOv11n and the previous layer of the improved YOLOv11n P2 detection head. The results are illustrated in Figure 16.

As illustrated in the experimental results of Figure 16, the improved YOLOv11n achieves a higher resolution of output feature maps compared to the original YOLOv11n, ensuring effective preservation of features for small targets. In contrast, the feature of small targets in the output feature maps of YOLOv11n is only denoted by a small number of pixels. Furthermore, as illustrated in Figure 16a, the original YOLOv11n mistakenly recognizes stone pillars that resemble the shape of cars as real cars, and it can also incorrectly identify vehicles reflected on glass surfaces. These phenomena are attributed to interference from complex environments. However, the improved YOLOv11n effectively suppresses such background clutter, eliminating analogous false detection issues. Finally, Figure 16b reveals that the original YOLOv11n suffers from numerous missed detections, and the feature regions corresponding to individual targets in the feature maps are highly scattered. In contrast, by incorporating the P2 layer and the AFPN neck structure, the improved YOLOv11n achieves more concentrated feature regions for each target, thereby enhancing the overall performance of the model.

4.8. Comparative Experiment on the Stability of YOLOv11n and the Improved YOLOv11n

As illustrated in Figure 17, this experiment was conducted to compare the mAP@0.5 metric between YOLOv11n and the improved YOLOv11n through five repeated training runs. The results demonstrate that the performance of the improved model is significantly superior to that of the baseline model: the average mAP@0.5 of the improved YOLOv11n reaches 39.134%, representing an increase of approximately 6.54 percentage points compared with 32.591% of the original YOLOv11n. Meanwhile, both models exhibit low standard deviations, with 0.095% for YOLOv11n and 0.1284% for the improved YOLOv11n, indicating that both models possess excellent training stability. The improved YOLOv11n not only effectively enhances the target detection accuracy but also yields reliable results in repeated experiments, thus demonstrating substantial practical application value.

5. Conclusions

Small-target detection in UAV aerial photography scenarios faces challenges such as small target size and a complex background environment. In order to solve these problems, an improved model is proposed based on YOLOv11n: aiming at the problems of insufficient feature and insufficient multi-scale feature fusion of small targets, a small-target detection head and an asymptotic feature pyramid network are introduced. In addition, SCASPPF is used to replace the original SPPF to highlight the salient features of the image and suppress background clutter. At the same time, the integration of MPDIoU and InnerIoU optimizes the training process and significantly improves the detection accuracy. Finally, the C3k2_IDC module is introduced to expand the receptive field of feature extraction and better capture small target features.

Experiments on the VisDrone2019 dataset of the improved model show that the average accuracy of the model is greatly improved compared with the benchmark model, and it is better than most classical target detection models. From the perspective of feature maps, the output feature maps of the improved YOLOv11n are also significantly better than the original model. On the Aerial Traffic Images dataset, the performance improvement is also significant, especially for tiny targets such as motorcycles, where the improvement in detection accuracy is most pronounced.

In general, the improved YOLOv11n algorithm in this paper significantly improves the performance of small-target detection. However, while the improved model achieves significant performance improvements, it is inferior to the baseline model in terms of training speed, model parameters, computational complexity, and single-frame detection speed. Specifically, the number of model parameters has increased by approximately 26%, computational complexity by around 137%, and single-frame detection time by roughly 82%. These drawbacks hinder the model’s practical applications. Therefore, in future work, we will focus on model lightweighting, aiming to slightly reduce model performance while significantly decreasing model parameters and computational complexity.

Author Contributions

W.Y. conceived the experiment. K.Z. and X.Q. conducted the experiment. K.Z. completed the writing of this paper. S.L. completed the model design. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Research on Semantic Change Detection in PolSAR Images with Limited Samples by Knowledge Guidance and Data Drive (Shanxi Provincial Science and Technology Project: 2025JC-YBMS-255).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AFPN	Asymptotic Feature Pyramid Network
IDC	Inception Depthwise Convolution
SCASPPF	Spatial Channel Attention SPPF
SPPF	Spatial Pyramid Fast Pooling
SE	Squeeze and Excitation Module
CSPDarknet-53	Cross Stage Partial Network Darknet-53
C2PSA	Cross Stage Partial with Pyramid Squeeze Attention
CSP	Cross Stage Partial
CIoU	Complete-IoU
BCE	Binary Cross-Entropy
DFL	Distribution Focal Loss
CoT	Contextual Transformer
EMA	Efficient Multi-Scale Attention
SCAM	Spatial-Channel Attention Mechanism
CMTL	Cotton Multitask Learning
CLMGE	Cross-Level Multi-Granular Encoder
MSDAF	Multitask Self-Distilled Attention Fusion

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate target Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time target Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Sorbelli, F.B.; Palazzetti, L.; Pinotti, C.M. YOLO based detection of halyomorpha halys in orchards using RGB cameras and drones. Comput. Electron. Agric. 2023, 213, 108228. [Google Scholar] [CrossRef]
Hoshino, W.; Seo, J.; Yamazaki, Y. A study for detecting disaster victims using multi-copter drone with a thermo- graphic camera and image target recognition by SSD. In Proceedings of the 2021 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Lyon, France, 12–16 July 2021. [Google Scholar]
Bisio, I.; Haleem, H.; Garibotto, C.; Lavagetto, F.; Sciarrone, A. Performance evaluation and analysis of drone-based vehicle detection techniques from deep learning perspective. IEEE Internet Things J. 2021, 9, 10920–10935. [Google Scholar] [CrossRef]
Peng, Y.; Zhao, T.; Chen, Y.; Yuan, X. Small-target detection algorithm for unmanned aerial vehicles based on contextual information and feature refinement. Comput. Eng. Appl. 2024, 60, 183–190. [Google Scholar] [CrossRef]
Wang, X.; Hu, Y. Small-target detection in drone images under complex backgrounds. Comput. Eng. Appl. 2023, 59, 107–114. [Google Scholar]
Wang, Z.; Xu, H.; Zhu, X.; Li, T.; Liu, Z. Improved dense pedestrian detection algorithm based on YOLOv8: MER-YOLO. Comput. Eng. Sci. 2024, 46, 1050–1062. [Google Scholar]
Feng, Z.; Xie, Z.; Bao, Z.; Chen, K. Real time dense small-target detection algorithm for unmanned aerial vehicles based on improved YOLOv5. Acta Aeronaut. Sin. 2023, 44, 251–265. [Google Scholar]
Gomaa, A.; Abdelwahab, M.M.; Abo-Zahhad, M. Efficient vehicle detection and tracking strategy in aerial videos by employing morphological operations and feature points motion analysis. Multimed. Tools Appl. 2020, 79, 26023–26043. [Google Scholar] [CrossRef]
Gomaa, A. Advanced Domain Adaptation Technique for target Detection Leveraging Semi-Automated Dataset Construction and Enhanced YOLOv8. In Proceedings of the 2024 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES), Giza, Egypt, 20–22 February 2024; pp. 211–214. [Google Scholar]
Salem, M.; Gomaa, A.; Tsurusaki, N. Detection of Earthquake-Induced Building Damages Using Remote Sensing Data and Deep Learning: A Case Study of Mashiki Town, Japan. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 2350–2353. [Google Scholar]
Gomaa, A.; Abdalrazik, A. Novel Deep Learning Domain Adaptation Approach for target Detection Using Semi-Self Building Dataset and Modified YOLOv4. World Electr. Veh. J. 2024, 15, 255. [Google Scholar] [CrossRef]
Abdalrazik, A.; Gomaa, A.; Afifi, A. Multiband circularly-polarized stacked elliptical patch antenna with eye-shaped slot for GNSS applications. Int. J. Microw. Wirel. Technol. 2024, 16, 1229–1235. [Google Scholar] [CrossRef]
Abdalrazik, A.; Gomaa, A.; Kishk, A.A. A wide axial-ratio beamwidth circularly-polarized oval patch antenna with sunlight-shaped slots for gnss and wimax applications. Wirel. Netw. 2022, 28, 3779–3786. [Google Scholar] [CrossRef]
Hassan, O.F.; Ibrahim, A.F.; Gomaa, A.; Makhlouf, M.A.; Hafiz, B. Real-time driver drowsiness detection using transformer architectures: A novel deep learning approach. Sci. Rep. 2025, 15, 17493. [Google Scholar] [CrossRef]
Gomaa, A.; Abdelwahab, M.M.; Abo-Zahhad, M. Real-Time Algorithm for Simultaneous Vehicle Detection and Tracking in Aerial View Videos. In Proceedings of the 2018 IEEE 61st International Midwest Symposium on Circuits and Systems (MWSCAS), Windsor, ON, Canada, 5–8 August 2018; pp. 222–225. [Google Scholar]
Rahima, K.; Muhammad, H. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Umirzakova, S.; Muksimova, S.; Shavkatovich Buriboev, A.; Primova, H.; Choi, A.J. A Unified Transformer Model for Simultaneous Cotton Boll Detection, Pest Damage Segmentation, and Phenological Stage Classification from UAV Imagery. Drones 2025, 9, 555. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of target Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Tan, M.; Pang, R. EfficientDet: Scalable and Efficient target Detection. arXiv 2020, arXiv:1911.09070. [Google Scholar]
Gomaa, A.; Saad, O.M. Residual Channel-attention (RCA) network for remote sensing image scene classification. Multimed. Tools Appl. 2025, 84, 33837–33861. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. arXiv 2023, arXiv:2305.13563. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small-target detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic Feature Pyramid Network for target Detection. arXiv 2023, arXiv:2306.15988. [Google Scholar]
Gao, D.; Chen, T.; Miao, L. Road target Detection Algorithm Based on Improved YOLOv8n. Comput. Eng. Appl. 2024, 60, 186–197. [Google Scholar]
Xu, X. Research and Implementation of Parasite Egg Detection and Recognition Based on Improved YOLOv5 Algorithm. Master’s Thesis, Nanchang University, Nanchang, China, 2024. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot target Detection. arXiv 2023, arXiv:1911.09516. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Yu, W.; Zhou, P.; Yan, S.; Wang, X. InceptionNeXt:When Inception Meets ConvNeXt. arXiv 2023, arXiv:2303.16900. [Google Scholar]
Wang, X.; Hang, J.; Tan, W.; Shen, Z. Target detection based on deep feature enhancement and path aggregation optimization. Comput. Sci. 2025, 52, 184–195. [Google Scholar]
Wang, Z.; Su, Y.; Kang, F.; Wang, L.; Lin, Y.; Wu, Q.; Li, H.; Cai, Z. Pc-yolo11s: A lightweight and effective feature extraction method for small target image detection. Sensors 2025, 25, 348. [Google Scholar] [CrossRef]
Zeng, K.; Yu, W.; Qin, X.; Han, J.; Hou, Z.; Ma, S. Improved UAV Aerial Vehicle Detection Algorithm Based on YOLOv11n. In Proceedings of the Image and Graphics, Xi’an, China, 18–20 June 2025. Volume 16162. [Google Scholar]

Figure 1. YOLOv11n model structure [20].

Figure 2. Improved YOLOv11n model. This diagram illustrates the internal structure of the improved YOLOv11n. The AFPN module first fuses features from P2 and P3, then subsequently fuses multi-scale features from P2, P3, and P4. The Spatial Channel Attention SPPF (SCASPPF) module enhances salient features while mitigating background clutter. Finally, all C3k2 modules in the neck network are replaced by C3k2_IDC, thereby expanding the receptive field during feature extraction.

Figure 3. AFPN Structure Diagram.

Figure 4. SCASPPF structure diagram.

Figure 5. Schematic diagram of InnerIoU.

Figure 6. C3k2-IDC Structure Diagram.

Figure 7. Algorithm process.

Figure 8. Example of Visdrone2019 dataset.

Figure 9. Target pixel area distribution of visdrone2019 training set.

Figure 10. Performance Comparison Results of Different Models.

Figure 11. Comparison results of partial test images.

Figure 12. Comparison of YOLOv11n and improved YOLOv11n confusion matrices.

Figure 13. Comparison of MPDInnerIoU with different parameters.

Figure 14. Detection results of different sizes.

Figure 15. The AP@0.5 performance of the models under ablation experiments across different targets sizes.

Figure 16. VisDrone2019 feature map visualization. Figure (a) illustrates the feature maps under the scenario of false detection, while Figure (b) presents those under the scenario of missed detection.

Figure 17. Stability experimental results. Figure (a) illustrates the average performance and fluctuations of the improved model and the baseline model; Figure (b) depicts the trends of the two models across 5 runs.

Table 1. Training Parameters.

Parameter	Parameter Settings
Training Batch (epoch)	200
Batch size	16
Workers	4
Optimizer	SGD
Patience	50
lr0	0.01
lrf	0.01

Table 2. Ablation Experiment.

Model	mAP@0.5	Params/M	GFLOPs	Average Inference Time per Image
①: YOLOv11n	32.567%	2.59	6.4	24.4 ms
②: ① + P2	36.916%	2.70	12.3	27.9 ms
③: ② + AFPN	38.033%	3.24	15.1	36.2 ms
④: ③ + SCASPPF	38.157%	3.28	15.2	37.7 ms
⑤: ④ + C3k2_IDC	38.730%	3.30	16.3	43.1 ms
⑥: ⑤ + MPDInnerIoU	39.256%	3.30	16.3	44.5 ms
⑦: ④ + InnerIoU	38.393%	3.28	15.2	37.9 ms
⑧: ④ + MPDInnerIoU	38.537%	3.28	15.2	38.1 ms

Table 3. Various detection results of traffic detection dataset.

Category	AP@0.5(YOLOv11n)	AP@0.5 (Improved YOLOv11n)
Articulated-bus	99.2%	98.9%
Bus	97.0%	97.6%
Car	73.6%	78.2%
Freight	98.3%	96.8%
Motorbike	20.6%	49.3%
Small bus	96.5%	97.7%
Truck	75.6%	84.6%

Table 4. Comparative experiment.

Model	Params/M	mAP@0.5	GFLOPs
Faster-R-CNN	63.20	30.9%	207.0
SSD	12.30	24.0%	63.2
YOLOv5s	9.10	38.8%	23.8
YOLOv8s	11.20	39.0%	28.5
YOLOv11s	9.40	39.0%	21.3
YOLOv10n	2.26	34.2%	6.5
YOLOv10s	7.22	39.0%	21.4
YOLO-FEPA [33]	2.8	36.7%	7.5
Drone-YOLO	3.91	37.0%	-
PC-YOLOn [34]	2.00	36.1%	-
[35]	3.29	38.3%	21.9
Ours	3.30	39.3%	16.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, K.; Yu, W.; Qin, X.; Long, S. Small-Target Detection Algorithm Based on Improved YOLOv11n. Sensors 2026, 26, 71. https://doi.org/10.3390/s26010071

AMA Style

Zeng K, Yu W, Qin X, Long S. Small-Target Detection Algorithm Based on Improved YOLOv11n. Sensors. 2026; 26(1):71. https://doi.org/10.3390/s26010071

Chicago/Turabian Style

Zeng, Ke, Wangsheng Yu, Xianxiang Qin, and Siyu Long. 2026. "Small-Target Detection Algorithm Based on Improved YOLOv11n" Sensors 26, no. 1: 71. https://doi.org/10.3390/s26010071

APA Style

Zeng, K., Yu, W., Qin, X., & Long, S. (2026). Small-Target Detection Algorithm Based on Improved YOLOv11n. Sensors, 26(1), 71. https://doi.org/10.3390/s26010071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Small-Target Detection Algorithm Based on Improved YOLOv11n

Highlights

Abstract

1. Introduction

2. Related Work

3. Proposed Algorithm

3.1. P2 Small-Target Detection Layer and AFPN

3.2. Improved SPPF

3.3. MPDInnerIoU Loss Function

3.4. C3k2_IDC

3.5. Algorithm Process

4. Experimental Results and Analysis

4.1. Dataset and Experimental Environment

4.2. Analysis of Ablation Experiment

4.3. Experimental Analysis of MPDInnerIoU Parameter Settings

4.4. Detection Results of Different Sizes

4.5. Experimental Results of Other Datasets

4.6. Comparative Experiment

4.7. VisDrone2019 Feature Map Visualization

4.8. Comparative Experiment on the Stability of YOLOv11n and the Improved YOLOv11n

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI