NSC-YOLOv8: A Small Target Detection Method for UAV-Acquired Images Based on Self-Adaptive Embedding

Chen, Dongmin; Chen, Danyang; Zhong, Cheng; Zhan, Feng

doi:10.3390/electronics14081548

Open AccessArticle

NSC-YOLOv8: A Small Target Detection Method for UAV-Acquired Images Based on Self-Adaptive Embedding

by

Dongmin Chen

¹

,

Danyang Chen

^1,*

,

Cheng Zhong

^1,2 and

Feng Zhan

¹

School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China

²

Key Laboratory of Parallel, Distributed and Intelligent Computing in Guangxi Universities and Colleges, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(8), 1548; https://doi.org/10.3390/electronics14081548

Submission received: 4 March 2025 / Revised: 2 April 2025 / Accepted: 7 April 2025 / Published: 11 April 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Existing drone image processing algorithms for small target detection in Unmanned Aerial Vehicle (UAV) aerial images struggle with challenges like missed detection of small objects, information loss from downsampling, loss of low-dimensional features, and information drop of contextual features. In order to alleviate the four problems just mentioned, we propose a self-adaptive small target detection method, NSC-YOLOv8, based on the YOLOv8 model. First, we introduce a small target detection head that enhances the model’s ability to fuse shallow and deep features, effectively handling low-pixel targets. Second, a Non-lossy Downsampling Block (NDB) is introduced into the backbone, which optimizes the detection accuracy of small targets in large scenes through dimensional transformation. In addition, we introduce a Self-Adaptive Embedding Block (SAEB) based on low-dimensional information, which enhances the comprehensive performance of the model by expanding the local sensing field to enhance the focus on important contextual information. Finally, we design a Content-Aware Resampling Block (CARB), which is able to enhance the model’s ability to recognize small targets by resampling low-dimensional features. Experiments on the VisDrone2019-DET dataset show that NSC-YOLOv8s improves target detection accuracy over YOLOv8s, with an 11.7% increase in mAP@0.5. Additionally, removing a large detection head and adjusting the bottom-up layers reduces NSC-YOLOv8s’ parameters by 1.2 M compared to YOLOv8s. Therefore, NSC-YOLOv8 shows better performance in small target detection for UAV imagery.

Keywords:

small target detection; YOLOv8; UAV aerial images

1. Introduction

As UAV control technology continues to progress [1], drones are being utilized extensively in a variety of industries because of their adaptability and versatility, including military [2], agriculture [3], emergency rescue [4], geological exploration [5], intelligent safety [6,7], autonomous driving [8,9], and industrial inspection [10]. Due to the wide application of drones, the UAV’s detection of small targets has become particularly important. For example, at sea, drones can monitor distant ships, buoys, debris, and other small targets, which are crucial for maritime rescue, ship monitoring, and environmental protection. In urban environments, the application of drones focuses on monitoring pedestrians, bicycles, vehicles, and other small objects. This is of great importance for urban traffic management, public safety monitoring, and the assistance of autonomous driving systems. In the field of agriculture, monitoring farmlands with drones can detect diseases, weeds, or abnormal growth of crops in small areas. Additionally, in the field of post-disaster search and rescue, rescuers can efficiently and quickly survey disaster areas with drones to find trapped individuals or abnormal situations in disasters. However, due to the height and perspective of drones, small targets usually occupy a smaller area in images and often lack unique features or strong contrast, which makes detection more difficult. Furthermore, the challenge of detecting small targets is particularly noticeable in complex surroundings. For example, in urban areas, the low contrast between targets and backgrounds often leads to interference from background noise, while in vast natural landscapes, the size of small targets, combined with constantly changing climatic and lighting conditions, makes detection more complex. Therefore, current drone target detection research is concentrating on increasing the precision and resilience of recognizing small targets.

Moreover, most deep neural network models (e.g., PSPNet [11], VGG [12], and U-Net [13]) rely on human-collected datasets, typically from ground-based photography [14]. However, UAV images often involve small targets with fewer pixels and simpler features, complex backgrounds, and less relevant contextual information. These issues contribute to poor performance in existing algorithms [15]. For example, firstly, many target detection models like SSD [16], which are usually optimized for medium to large targets, result in poor detection of small targets, leading to missed detection issues. Secondly, the downsampling procedure causes some models, including ResNet [17] and FPN [18], to lose low-level features and detailed information about microscopic targets, which reduces the model’s accuracy in target location and classification. Thirdly, due to the lower resolution, some models like YOLOv1 [19] make small targets appear blurred or pixel-sparse, further exacerbating the problem of information loss. Finally, in complex environments where the relationship between the target and surrounding objects exceeds the model’s perception range, some models, such as Faster R-CNN [20], lose the contextual association information between the background and the target, making it more difficult for the model to accurately identify the target.

We propose the NSC-YOLOv8 tiny object detection model. With an extra detection head made especially for small objects, the model is an upgrade on the YOLOv8 architecture. We present a framework with three main elements to improve the detection effect even more: NDB, SAEB, and CARB. The NDB module performs dimensional operations, allowing the model to achieve non-lossy downsampling of information and retain key information. The SAEB module uses a residual structure to embed low-dimensional features into high-dimensional ones, increasing the receptive field in the process. By resampling features and creating sample sets, the CARB module facilitates effective feature upsampling and improves the correlation of contextual features. With these techniques, our model exhibits excellent performance in low-pixel target recognition.

The following are this paper’s main contributions:

We incorporate an NDB based on dimensional transformations to the backbone. It lessens the information loss from downsampling in small object detection by better preserving important information in feature maps.
We integrate an SAEB into the model, which self-adaptively embeds features that are prone to loss into the output features and minimizes the loss of low-dimensional features through a residual structure.
We add a CARB to the model, which selectively extracts key features during the feature extraction process while effectively preserving important information in low-resolution data, thus enhancing the correlation of contextual features.
We introduce a small-target detection head with extended receptive fields, enabling our model to effectively integrate deep and surface features. Because the compact detection head has fewer input channels, the parameters of the full detection head are reduced.

The remainder of the paper is structured as follows: we present the architecture of the YOLOv8 model and discuss relevant work that has already been carried out in Section 2. We present the NSC-YOLOv8 model’s design concept and describe how each component is implemented in Section 3. Section 4 describes the NSC-YOLOv8 experiments, quantitatively and qualitatively assesses our suggested small object detection model, validates each of our suggested components using comparison and ablation experiments, and explains the reasons why NSC-YOLOv8’s detection performance is better than YOLOv8’s, as well as the algorithm’s current limitations. We provide a summary of the design process, findings, and future work in Section 5.

2. Related Work

This section starts with an overview of the evolution of target detection models, then goes into great detail about the evolution of the YOLO model [21], and concludes with a focus on advancements in small object detection models [22].

2.1. Target Detection Models

The two primary categories of target detection models are deep learning-based detection models [23] and traditional detection models [24], which differ significantly in their theoretical foundations, performance, and applicable scenarios.

Traditional detection models often rely on hand-designed features and machine learning methods. Different from traditional detection models, deep learning detection models automatically learn features from data through neural networks, which fall mostly into two categories: single-stage detection models [25] and two-stage detection models [26]. The R-CNN family, which includes Faster R-CNN [20], Fast R-CNN [27], and R-CNN [28], among others, are the traditional representatives of two-stage detection models. Prior to performing classification and regression operations on each region, these models first produce candidate regions. In contrast, the single-stage detection model turns the detection problem into a regression work, usually represented by the YOLO series. Because of its rapid detection speed, this model is ideal for real-time detection applications. Furthermore, the SSD [16] performs target prediction on feature maps of various scales, demonstrating an exceptional balance between speed and accuracy. Deep learning detection models have a high time overhead for training and inference, and they demand a significant quantity of training data and processing resources despite their potent feature learning capabilities.

2.2. YOLO Network

The continuous evolution of the YOLO family [29,30,31,32,33] has accelerated the development of target detection. Among them, we choose the YOLOv8 model as the baseline for improvement. Figure 1 states that the head, neck, and backbone make up the majority of YOLOv8. Furthermore, YOLOv8 employs the anchor-free detection method [34]. At the same time, YOLOv8 introduces the CIoU [35] loss function. Furthermore, YOLOv8 also combines several data augmentation techniques, such as Mosaic and MixUp [36].

2.3. Small Object Detection

Tiny items in pictures usually exist at different scales, which pose an additional challenge for target detection. To address this problem, some studies [37] used specially designed small object detectors. For example, the ClusDet model [38] proposes an approach that combines semantic and spatial information among objects, thereby reducing redundant computation and improving detection efficiency. From an information theoretic perspective, capturing more features usually helps to improve detection accuracy. As a result, numerous research studies have begun incorporating contextual information to improve small object detection. For instance, by optimizing the instance distances inside and between classes, the FS-SSD model [39] greatly enhances the detection performance of low-confidence targets. In order to improve the model’s ability to detect targets, SSPNet [40] suggests a scale enhancement and selection module in conjunction with the Extended Feature Pyramid Network (EFPN) [41]. These techniques greatly increase the detection accuracy by enabling the target detection system to identify and locate small objects more precisely. However, in the current tiny target identification difficulties, the aforementioned approaches’ real-time capabilities and effectiveness have not yet reached an acceptable level.

3. Methods

As illustrated in Figure 2, the general design of our NSC-YOLOv8 is as follows: in the backbone, we adopt an NDB structure as the new downsampling layer design, which retains the discriminative feature information during the downsampling process by reparameterizing the convolution module. Through internal information interaction, we add an SAEB to the neck to expand the convolutional layer’s receptive field while successfully preserving the image’s essential low-dimensional properties. Additionally, we adopt a CARB to improve the correlation of important contextual features through feature resampling operations. We incorporate a detection head in the detection head portion that is specially made for small targets in order to further lessen the likelihood of missing tiny objects. Finally, a WIoU loss function [42] is used in the loss function section to optimize the detection of small targets.

3.1. Backbone

As seen in Figure 2, the backbone of our proposed model is made up of five stage blocks. We employ the NDB structure with a step size of one as the downsampling unit in NSC-YOLOv8 for every stage block.

NDB Module

Inspired by the SPDConv structure [43], we designed the NDB structure, which, unlike existing algorithms such as CMS-YOLOv7 [44], places convolution before the rearranging operation. The advantage of this is that it avoids the loss of information concerning very small objects due to convolution and enhances the accurate extraction of structural small object information.

As seen in Figure 3, the NDB mechanism divides the feature map into several sub-feature maps, which are then spliced at various scales to finish the downsampling process. In order to do this, the feature map’s channels are expanded using 2D convolution, which continually extracts detailed information while managing the number of channels, thus minimizing the loss of small target features.

By converting the input image’s spatial dimension into a depth dimension, the NDB structure deepens the feature map without sacrificing any information, which is essential for maintaining spatial information when working with small objects or low-resolution images. By converting the spatial dimension to the depth dimension, the NDB layer preserves the integrity of the details while successfully avoiding the information loss that is typical of conventional stepwise convolution or pooling processes.

Assuming the original input feature map is

X_{I n p u t} \in R^{B \times C_{1} \times D \times D}

, we first reduce the number of input channels to an appropriate value through a convolutional layer to avoid generating too many parameters when performing channel concatenation, thus obtaining an intermediate feature map

X_{I n t e r m} \in R^{B \times C_{2} \times D \times D}

. As shown in Equation (1), we rearrange the feature map

X_{I n t e r m}

and slice

X_{I n t e r m}

into a set

F_{R e a r r}

, which contains small blocks

f_{i, j} \in R^{B \times C_{2} \times \frac{D}{s c a l e} \times \frac{D}{s c a l e}}

in quantity

s c a l e^{2}

, where

s c a l e

is the scale factor. Subsequently, we concatenate these blocks in

F_{R e a r r}

together in the order of the channel dimension to obtain a new feature map

X_{O u t} \in R^{B \times s c a l e^{2} \cdot C_{2} \times \frac{D}{s c a l e} \times \frac{D}{s c a l e}}

. The entire mathematical expression of this module is shown in Equation (2).

\begin{matrix} f_{0, 0} = X [B, C_{2}, 0 : : s c a l e, 0 : : s c a l e], f_{1, 0} = X [B, C_{2}, 1 : : s c a l e, 0 : : s c a l e], \dots \\ f_{0, 1} = X [B, C_{2}, 0 : : s c a l e, 1 : : s c a l e], f_{1, 1} = X [B, C_{2}, 1 : : s c a l e, 1 : : s c a l e], \dots \\ ⋮ \end{matrix}

(1)

X_{O u t} = C o n c a t (S l i c e (C o n v_{s t r i d e = 1} (X_{I n}), g r o u p s = 4)) .

(2)

The NDB layer improves the model’s performance and generalization capacity by minimizing downsampling-related information loss while processing small target features, in contrast to the conventional downsampling technique. Small target features may be lost as a result of the downsampled component Conv in YOLOv8 oversampling the image through convolution operations in early layers. However, by avoiding the loss of small target information as a result of oversampling, the NDB module enhances the retention of tiny target features at different sizes. Unlike the conventional YOLOv8 downsample component Conv operation, the NDB module effectively improves the detection performance of the network by splicing slices to improve the recognition and localization ability of small objects.

SCDown is a module in YOLOv10 [45], which combines spatial downsampling and channel downsampling in contrast to NDB. It aims to optimize network performance by reducing computational resource consumption and enhancing the ability of feature representation. SCDown has some problems detecting small targets, though. First off, the accuracy of small target recognition may be impacted by spatial downsampling since it may result in the loss of detailed information about smaller objects. Secondly, excessive information compression, caused by channel downsampling, may result in some important channel information being compressed or discarded, which is very unfavorable for the detection of small targets, as small targets often require richer feature information for accurate identification.

3.2. Neck

3.2.1. SAEB Module

We propose the SAEB module, which is utilized in the neck and backbone’s top-down branching layer, as illustrated in Figure 4. Using convolutional procedures with partial filters that take advantage of low-dimensional embeddings of the filter’s partial transformations, SAEB’s main objective is to balance spatial information from low and high-dimensional features. Through this approach and the interaction between the filters, the perceptual field of each spatial location can be efficiently extended.

The primary distinction between SAEB and the conventional feature extraction module is the addition of a residual structure that maps intermediate features from the small-scale feature space back to the original feature space while maintaining the precise positional information of the smaller target. This approach preserves the target’s contextual information by embedding often-overlooked low-dimensional data within the high-dimensional information. The residual structure in the SAEB module is inspired by the basic principles of ResNet, but there are key differences. ResNet alleviates the problem of gradient vanishing through simple addition operations, helping to train deep networks. SAEB focuses on small target localization and context retention, preserving target details through low-dimensional to high-dimensional mapping, which is especially effective in small target detection.

Assuming the input features are

X_{I n p u t} \in R^{B \times C \times H \times W}

, first downsample X using an average pooling layer and a convolutional layer with a stride of r. Then, fuse the original feature X with the pooled output features. Next, apply the

σ

function to the fused features to obtain

X_{I n t e r m 1} \in R^{B \times C \times \frac{H}{r} \times \frac{W}{r}}

, thereby enhancing its expressive ability and information fusion effect. Subsequently, perform feature transformation on the original features X to obtain

X_{I n t e r m 2} \in R^{B \times C \times H \times W}

. Finally, perform feature transformation on the dot product result of

X_{I n t e r m 1}

and

X_{I n t e r m 2}

to obtain the output feature

X_{O u t} \in R^{B \times C \times H \times W}

. The specific process is shown in Equations (3)–(5):

X_{I n t e r m 1} = σ (X + f_{C o n v, k = 3}^{1} (A v g P o o l_{r = 4} (X_{I n p u t}))),

(3)

X_{I n t e r m 2} = f_{C o n v, k = 3}^{2} (X_{I n p u t}),

(4)

X_{O u t} = f_{C o n v, k = 3}^{3} (X_{I n t e r m 1} \cdot X_{I n t e r m 2}),

(5)

where the

f_{c o n v, k = 3}^{i}

represents a 3 × 3 filter, the operator

σ

is the sigmoid function. Lastly, the input feature and the output feature

X_{O u t}

have the same size. The SAEB has two main advantages: it models inter-channel dependencies and encodes larger, more accurate discriminative regions through an adaptation operation, and it focuses on local low-dimensional features around each spatial location with the self-adaptive operation.

CIB is an efficient module in YOLOv10 [45], different from SAEB, which extends and compresses channels through inverted residual structures to reduce computational complexity and adopts depthwise separable convolutions and skip connections to retain key information and alleviate gradient vanishing. However, although the inverted residual structure of CIB improves computational efficiency, during the process of extending and compressing channels, detailed information may be compressed. Small targets are harder to detect because they take up less space in the image and have weaker features. Compression procedures may cause these specific details to be lost.

3.2.2. CARB Module

As seen in Figure 5, we use the CARB structure as the upsampling module in between the neck network’s stages. Here, input features are passed through bilinear interpolation to generate content-aware sampling points for resampling the continuous mapping, and the grid sample operation is then used to generate features with contextual correlation.

Assuming the input feature image is

X_{I n p u t} \in R^{B \times C \times H \times W}

, we use the scaling factor s to define the resampling rate. First, the first step is to apply a convolution transformation to the input feature image

X_{I n p u t}

and adjust its size according to the factor s to obtain the output feature

X_{I n t e r m} \in R^{B \times 2 \cdot g \cdot s^{2} \times H \times W}

. Next, reshape the feature map to achieve the desired upsampling effect, thus obtaining the output feature

X_{O f f s} \in R^{B \times 2 \cdot g \times s \cdot H \times s \cdot W}

. In the following steps, feature fusion is performed, combining the original grid information

X_{G r i d} \in R^{B \times 2 \cdot g \times s \cdot H \times s \cdot H}

and offset information

X_{O f f s}

. To further improve the quality of the features, a pixel shuffling operation is introduced, which disrupts the distribution of the upsampled set

X_{R e s a m p l e} \in R^{B \times 2 \cdot g \times s \cdot H \times s \cdot H}

. Finally, the input feature map

X_{I n p u t}

is merged with the resampled set

X_{R e s a m p l e}

using the grid sampling operation, resulting in an output feature map

X_{O u t} \in R^{B \times C \times s \cdot H \times s \cdot W}

. The specific process is shown in Equations (6)–(8):

X_{O f f s} = R e s h a p e (C o n v_{O u t = 2 \cdot g \cdot s^{2}} (X_{I n p u t})),

(6)

X_{R e s a m p l e} = P i x e l S h u f f l e_{u p s c a l e = s} (X_{O f f s} + X_{G r i d}),

(7)

X_{O u t} = G r i d S a m p l e (X_{I n p u t}, X_{R e s a m p l e}),

(8)

where

X_{O f f s}

is the generation bias after the convolution layer,

X_{R e s a m p l e}

represents the set of resample features, and

X_{O u t}

is the output feature. Compared to other dynamic upsamplers, the CARB module retains important information about the target data, enhancing the correlation of contextual features of the target.

YOLOv8’s upsampling component uses traditional upsampling methods, usually restoring image size through simple bilinear interpolation operations, but this may lead to detail loss, especially when dealing with small targets. Unlike the upsampling module used in YOLOv8, our CARB module fully focuses on the contextual relationship of tiny targets. By generating a resampling set, it integrates information rich in tiny targets into the original features, helping to distinguish tiny targets from the surrounding background or noise. To address the upsampling issue, the CARB upsampling component introduces refined feature reconstruction methods and improved resampling interpolation algorithms, thus more accurately preserving the detailed information of small objects. By reconstructing the semantic relationship between small objects and their surrounding areas, CARB enhances the relevance of small target features, especially in complex backgrounds, which helps improve the recognition accuracy of object categories and properties and also helps the model better distinguish different types of objects in dense environments.

3.3. Head

The existing algorithms have various designs for small object detection heads, but unlike UAV-YOLOv8 [46], our detection head receives P/8, P/4, and P/8 inputs, where P is the size of the feature. By lowering the missed detection rate of small objects and eliminating the loss of fine details during network sampling, this method improves performance by enabling larger features to be input into the detection head. Additionally, we removed the detection heads for large objects, P/16 and P/32, as we found they have minimal impact on performance for small object detection tasks. The NSC-YOLOv8 adjusts the neck layers to control different inputs to the detection head. By adding a top-down T module and removing a bottom-up B module, we reduce network parameters and optimize the convolution layers for large features. Although YOLOv8 and NSC-YOLOv8 have the same number of detection heads, our network performs better in small object detection.

After making the above-mentioned modifications to the baseline YOLOv8 model, we used the outputs of the T1, T2, and B1 layers from Figure 2 as the input to the detection head. As a result, the input for the new detection head featured fewer channels and contained more information on the original target features. Compared to the original detection head, our model effectively lowered the missed identification of small objects in circumstances requiring small target detection tasks. Additionally, our model’s parameter count was significantly decreased in comparison to the baseline model.

4. Experiments

In order to confirm the NSC-YOLOv8 framework’s detection accuracy and lightweight performance in various complex scenarios, we quantitatively assess it in this section. Using the test and validation sets of VisDrone2019-DET [37], we compare them with currently available YOLO networks and lightweight UAV target identification models. The experimental findings demonstrate that NSC-YOLOv8 is useful in a variety of detection tasks by greatly increasing detection accuracy and performing exceptionally well in terms of lightweight performance.

4.1. Dataset and Settings

4.1.1. Dataset

The VisDrone2019-DET dataset contains 1610 test images, 548 validation images, and 6471 training images. Ten categories are covered by the dataset: tricycle, awning tricycle, bus, motor, van, truck, bicycle, pedestrian, and people. Figure 6a, which shows the number of items in each category, shows that cars and pedestrians comprise the majority of the dataset. Figure 6b displays the size of the object’s bounding boxes, indicating that the dataset contains a sizable number of small targets. Figure 6c shows the distribution of the centroids of the object bounding boxes and shows that most of the targets’ centroids are clustered in the lower and center areas of the image. The dataset is primarily composed of small targets, as seen by the darker region in the lower-left corner of Figure 6d, which plots the width of the bounding box against height.

Based on the examination of the VisDrone2019-DET dataset, we conclude that the majority of the dataset consists of tiny-sized targets. UAV photos from various locations and viewpoints make up this dataset, which is more diverse in scene complexity, viewpoint variation, and target scale compared to general computer vision tasks, making the processing and target detection of the task more difficult.

4.1.2. Experimental Settings

Table 1 lists the hardware platform and ambient settings used during the experimental training phase.

4.1.3. Training Strategies

The YOLOv8 model is altered to produce five distinct model sizes by varying the width and depth parameters, allowing it to be deployed flexibly in various hardware devices and application contexts. The amount of parameters and resource usage progressively rise with model size, while detection performance also improves in tandem.

To balance the real-time character and performance of the experimental model, we choose size s as its size. Table 2 contains a list of important parameter settings utilized during model training. To speed up the model’s convergence, we turned off mosaic data augmentation during the final ten training cycles.

4.1.4. Evaluation Indicators

We used a number of evaluation criteria, including mAP@0.5,

P r e c i s i o n

,

R e c a l l

, mAP@0.5:0.95,

{mAP}^{S}

, Param, FPS, and GFLOPs, to assess how well our upgraded model performed in the detection task. To compute these measurements, we employed the parameters referred to as True Positive (TP), False Positive (FP), and False Negative (FN). The entire amount of learnable weights and biases in the model, which reflects the model’s complexity, is referred to as Parameters (Param). The model’s processing speed, which is appropriate for real-time jobs, is indicated by Frames Per Second (FPS), which counts the number of frames it processes in a second. The number of floating-point operations the model completes in a second is measured by giga floating-point operations per second or GFLOPs. The mean Average Precision of Small objects (

{mAP}^{S}

) measures the accuracy of objects smaller than 32 × 32 pixels; the mean Average Precision of Medium objects (

{mAP}^{M}

) measures the accuracy of objects larger than or equal to 32 × 32 pixels but smaller than 96 × 96 pixels; the mean Average Precision of Large objects (

{mAP}^{L}

) measures the accuracy of objects larger than or equal to 96 × 96 pixels. Additionally, the degree of overlap between the actual and expected frames is measured using the Intersection over Union (IoU) ratio. Together, these metrics assess model performance by accounting for speed, accuracy, and complexity.

The ratio of correctly predicted positive samples to actual positive samples is called

Recall

, whereas the ratio of model-predicted positive samples to all detected samples is called

Precision

. Average Precision (AP) is the area under the

P r e c i s i o n

–

R e c a l l

curve that shows the trade-off between

P r e c i s i o n

and

R e c a l l

. The model’s overall detection performance across all categories is evaluated using the mean Average Precision (mAP), which is the weighted average of the AP values for each sample category. mAP@0.5 measures the average accuracy of the target detection model at an IoU threshold of 0.5, whereas mAP@0.5:0.95 displays the average of the model’s accuracies at multiple IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05. Equations (9)–(12) show specific formulas for calculations:

P r e c i s i o n = \frac{T P}{T P + F P},

(9)

R e c a l l = \frac{T P}{T P + F N},

(10)

A P = \int_{0}^{1} P r e c i s i o n (R e c a l l) d (R e c a l l),

(11)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i},

(12)

where

N

indicates the number of sample categories in the training dataset, and

{AP}_{i}

indicates the AP value with the category index value

i

.

4.2. Ablation Experiment

4.2.1. Small Target Detection Head

We included a dedicated small target detection head to satisfy the requirements of small target identification tasks. We carried out the trials listed in the table to assess the small target head’s performance in NSC-YOLOv8s.

The experimental results in Table 3 demonstrate that our model enhances the identification of small objects and reduces the missed detection of small items, with a 5.4% gain in mAP@0.5 over the baseline YOLOv8 model. By comparing the

{mAP}^{S}

,

{mAP}^{M}

, and

{mAP}^{L}

of various models, we can determine that the model’s detection accuracy for medium and large objects remains unchanged after the large detection head is removed. Additionally, the overall mAP has improved, albeit with a slight decrease in the number of parameters. This demonstrates that, in this instance, eliminating the huge detection head is a fair trade-off between performance and parameters.

4.2.2. Comparison of Loss Functions

Unlike CIoU, WIoU uses a gradient gain allocation strategy that weights the IoU values of small objects, increasing their contribution to the loss and alleviating their minimal impact in conventional IoU loss functions. WIoU performs differently across network structures and feature extraction strategies. Thus, we need to conduct an ablation experiment to provide a baseline for subsequent models.

We performed studies comparing WIoU v3 with various popular loss functions on YOLOv8 to confirm its benefits while maintaining consistency in the other training circumstances.

The findings are shown in Table 4 and show that when WIoU v3 is used as the bounding box regression loss function, the model’s detection performance peaks at 0.82% improvement in mAP@0.5.

Furthermore, even while SIoU smoothes IoU to increase training stability, its global IoU loss-based architecture makes small target detection difficult. In particular, the regression error in the IoU computation is high because small targets are small, which results in inadequate positioning precision. A weighting mechanism, on the other hand, is introduced by WIoU v3 to help balance the identification of large and tiny targets. In situations involving multiple scales, this method works particularly well. Additionally, the weighting approach of WIoU v3 greatly increases the regression accuracy of small targets while lowering background noise interference. By assigning different weights to targets of varying scales, WIoU v3 optimizes both the regression accuracy and recall rate of small targets. Therefore, compared to SIoU, WIoU v3 has distinct advantages in small target detection.

The baseline model for the subsequent ablation tests will be the improved YOLOv8 model, which includes the WIoU v3 loss function and a small target detecting head.

4.2.3. Ablation of the NDB Module

The baseline model substitutes the NDB module for the Conv module at various backbone and neck network levels in order to confirm the performance improvement following the introduction of the NDB module. The experiment’s results are shown in Table 5. The “+” symbol indicates that the NDB module has been inserted into the relevant network layer of the baseline model, whereas “+S1-NDB” denotes the substitution of the NDB module for the Conv module in the S1 layer and “+ALL-NDB” denotes the substitution of the NDB modules for the Conv modules in all pertinent levels.

According to the experimental findings, the +ALL-NDB model has the best detection performance. The model’s mAP 0.5 is 1.1% higher than the baseline model’s. This implies that the NDB can more efficiently extract information that is accessible from the input characteristics and reduce the amount of information lost due to downsampling.

4.2.4. Ablation of the SAEB Module

We connect the SAEB module to several backbone and neck network levels in order to confirm the performance boost that was attained. The experimental findings are shown in Table 6. The following are the specifications for the operations: the SAEB module is located in the neck layer between the C2f layer and the Concat operation in the S1–S3 and B1 layers of the backbone network, and in the layer preceding the downsampling layer Conv.

According to the experimental results, the +ALL-SAEB model achieved optimal detection performance. The model’s mAP 0.5 increased by 2.8% when compared to the baseline. Consequently, the SAEB’s ability to extract information from targets is enhanced. Through the SAEB module, the loss of low-dimensional features is reduced.

4.2.5. Ablation of the CARB Module

To verify the performance advantage of the CARB module, we conducted the following comparison experiment: Table 7 displays the results of the experiment. In the T1–T3 layer of the neck network, the CARB module takes the place of the original upsample module.

According to the experiment, the +ALL-CARB model achieves the optimal detection performance. In comparison to the baseline model, the model’s mAP@0.5 is 1.33% better. As a result, CARB may significantly improve the correlation of contextual features and improve detection performance in networks.

4.2.6. Overall Performance

To assess the efficacy of each improvement method suggested in this study, we conducted comparative experiments after integrating the previously indicated optimal improvement techniques into the network using the VisDrone2019-DET dataset. Table 8 presents the results of these experiments.

With an 11.1% increase in mAP@0.5, NSC-YOLOv8 successfully enhanced small object identification performance, as the table illustrates. Although our model has high computational resource requirements, it is still within a reasonable range and can provide significantly higher accuracy. In real-world applications, we can freely select the right model size to suit the demands of various scenarios, depending on the task’s precision and real-time requirements. We believe that improving the accuracy of the model should be considered a more important priority while ensuring that the computational resource requirements are within an acceptable range. Through appropriate model adjustments and optimizations, we can maintain reasonable use of computational resources while ensuring high accuracy, thereby achieving better results in various complex applications.

4.3. Comparison Experiment

4.3.1. Comparison with YOLOv8

To evaluate the improvement effect of the updated model on the detection performance, we conducted comparative experiments with YOLOv8 models of different sizes. Table 9 presents the results of the experiment.

The experimental findings reveal that the enhanced model exhibits differing levels of improvement in every detection performance indicator. Figure 7 shows that when the true label is Car, there are possible error cases such as FP and FN due to background noise, target occlusion, overlap, and complex background. Compared to YOLOv8s, our model improved precision by 7.3%, indicating that our model has a relatively lower FP. In addition, our model’s recall increased by 10.8%, further indicating that the FN was also relatively reduced on this drone dataset. Our model had an 8.9% increase in mAP@0.5 when compared to YOLOv8x. This implies that the enhanced model successfully raises the accuracy of detection for tiny targets. Additionally, the structural optimizations, including a smaller detection head and hierarchical adjustments to the bottom-up layers, helped to reduce the overall parameter count of the model.

NSC-YOLOv8 provides models of different sizes to meet various computational needs. The n size is lightweight, which is suitable for low-cost drones for basic tasks; the s size balances computation and task complexity, supporting high-precision detection, and it is suitable for patrol, agricultural monitoring, etc.; the m size is suitable for computationally intensive tasks, which supports high-resolution video analysis and path planning; the l and x sizes have powerful computational capabilities, which makes them suitable for large-scale data analysis and multiple drone collaboration, commonly used for urban surveillance and unmanned driving simulation.

4.3.2. Comparison with the YOLO Series

Because of their many parameters, the previous YOLO series algorithms are difficult to implement effectively on unmanned platforms. In contrast, YOLOv5, YOLOv7, and YOLOv8 have improvements in a number of parameters and detection performance, but these models still have limitations in dealing with the task of detecting a high percentage of small targets. We conducted the following tests to confirm that our approach is better than the YOLO series. The first six rows of Table 10, which presents the results of the comparison experiment, contain information from the references [46,50].

NSC-YOLOv8x has a greater mAP@0.5 of 58.3%, according to the testing data, but NSC-YOLOv8s has a better overall trade-off between real-time and accurate detection performance.

4.3.3. Comparison with Existing Models

We carried out comparative experiments to assess NSC-YOLOv8’s performance in comparison to other popular algorithms. Table 11 displays the experimental results. The data in the table’s first five rows come from the references [51].

Table 11. Detection outcomes of NSC-YOLOv8 and existing algorithms, with the bold values in the table representing the best results.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)
RetinaNet	-	-	21.4	11.8
Faster R-CNN	-	-	41.8	21.8
Cascade R-CNN [52]	-	-	39.1	24.3
ClusDet	-	-	56.2	32.4
HRDNet [53]	-	-	62.0	35.5
SAIC-FPN [54]	-	-	62.3	35.7
UAV-YOLOv8(s)	54.4	45.6	47.0	29.2
BGF-YOLOv10(n)	-	-	32.0	-
CMS-YOLOv7 [44]	-	-	52.3	30.7
DC-YOLOv8 [55]	52.7	40.1	41.5	24.7
Drone-YOLO [56]	-	-	44.3	27.0
Ours(n)	56.0	44.4	46.6	27.9
Ours(s)	59.7	50.5	53.0	33.1
Ours(m)	61.8	52.9	55.5	34.7
Ours(l)	63.1	54.5	57.1	36.3
Ours(x)	65.2	55.0	58.3	37.4

As shown in Table 12, we perform better on mAP@0.5–0.95 compared to SAIC-FPN, but our mAP@0.5 is relatively lower. The reason is that information loss from downsampling may not be completely eliminated. At higher IoU thresholds, however, our mAP@0.5–0.95 measure improves because our SAEB module keeps low-dimensional characteristics from being lost.

To ensure a fair comparison between algorithms like the data source and NSC-YOLOv8, we will elaborate on their implementation details. Although there are slight differences in the momentum and final learning rate, they have minimal impact. Our settings for the learning rate, drop-out rate, weight decay, optimizer, dataset, and metrics are consistent. This ensures that the performance evaluation of NSC-YOLOv8 is as similar as possible to that of the data source. Through this setup, the training and evaluation of models are conducted under the same experimental conditions, effectively ensuring the fairness of the comparison.

The Region Proposal Network (RPN)-based method is a two-stage detection model. It is designed to decompose the target detection problem into two sub-problems, regression and classification, thereby improving the multi-scale detection capability. Its main function is to create candidate boxes using RPN, then classify and regress the candidate boxes that are generated at each point. After being aligned using ROI pooling, the candidate regions are sent to the fully connected layer for bounding box regression and classification. On the other hand, the YOLO-based approach effectively balances accuracy and processing speed as a single-stage detection model. Converting the target detection task into a regression problem is the idea behind its design. The main concept is to use a basic convolution head to forecast the target’s category, position, and confidence all at once. The goal is to maximize inference speed and computing efficiency while implementing a lightweight, more effective convolutional architecture. Our conclusion is that the RPN-based method has a high computational complexity, especially when dealing with small target tasks. The YOLO-based method further optimizes the performance by improving the multi-scale detection capability while taking into account the inference speed and computational efficiency. YOLO can better balance inference speed and model correctness by introducing some high-complexity modules that are suited to the features of tiny target datasets.

4.3.4. Comparison with the Downsampling Module

To compare with other downsampling strategies, such as adaptive pooling methods, we conducted the following experiments based on the benchmark model. Adaptive pooling automatically adjusts the size of the pooling window according to the size of the input features, thus enabling the processing of input images of different sizes. Table 13 displays the experimental data, and our comparative model obtained a 48.7% mAP, outperforming the comparative methods. In addition, our comparative model also achieved a 35.7%

{mAP}^{S}

in the performance of detecting small targets, which is better than the comparative methods. Although adaptive pooling can flexibly adapt to different input sizes and avoid computational overhead, information loss is still an issue, particularly when it comes to maintaining the specifics of small target areas.

4.3.5. Comparison with CBAM and SAEB

We used a comparison approach, incorporating the transformer-based CBAM and the SAEB based on the baseline model, as indicated in Table 14. According to the experimental data, our comparison model outperformed the comparison approaches with an overall detection performance of 50.4% mAP. At the same time, our comparison model also achieved a 37.9%

{mAP}^{S}

in small target detection, which is also superior to other methods. Even yet, CBAM successfully avoids complex network layers and extra processing operations by using a few fully linked layers in conjunction with basic global average pooling and global max pooling operations. However, by highlighting crucial channels and areas, CBAM can greatly increase feature expressiveness. It might not record enough details in small target areas, particularly when the target size is near background noise.

4.3.6. Comparison with the UAVDT Dataset

We tested the model on the difficult UAVDT-benchmark-M dataset to confirm its capacity for generalization. In total, 40,735 image label pairs make up this dataset, which we split into training, validation, and test sets in a 5:2:3 ratio. The UAVDT-benchmark-M dataset is a common test set in the field of drone target identification and includes complex background, multi-object, and multi-scale detection tasks. As a result, using this dataset for testing can more accurately represent how well the model performs in challenging settings.

The experimental results illustrate that our model has exhibited the ability to generalize and has produced improved outcomes on various datasets, as indicated in Table 15. In particular, the model outperformed the baseline YOLOv8n by 99.2% mAP@0.5 on the UAVDT dataset. This outcome not only shows how well the model performed on the common test set but also demonstrates how flexible it is in a variety of situations.

Particularly remarkable is that the model performs well in recognizing microscopic objects, attaining 40.1%

{mAP}^{S}

, which is also better than the baseline YOLOv8n. Especially in complex background environments, the model shows robustness and can deal with the interference of cluttered backgrounds.

4.4. Visualization and Analysis

4.4.1. Confusion Matrix

As seen in Figure 8, we visualized the confusion matrices for NSC-YOLOv8 and YOLOv8 to illustrate the efficacy of our approach. The anticipated categories are represented by the columns in these matrices, whereas the true categories are represented by the rows. The diagonal shows the correctly anticipated categories, whereas the off-diagonal shows the wrongly predicted categories.

As can be observed, the diagonal portions of the confusion matrix for NSC-YOLOv8s are darker than those for YOLOv8s, suggesting that NSC-YOLOv8s has reduced the frequency of tiny target missed detections and increased the precision of object category prediction.

Bicycles, tricycles, and awning-tricycles, among other small targets, are often misjudged as background, resulting in a high miss-detection rate. Although the model has been improved, the proportion of correct predictions is still low. Small vehicles have small volumes and are often obscured in complex backgrounds, increasing the difficulty of detection. Therefore, improving the detection accuracy of these small targets is one of the challenges in low-pixel target detection. To solve the miss-detection problem, we can introduce an occlusion perception mechanism in our future work, using the relative position or temporal information of the target to identify occluded small targets. In addition, using Generative Adversarial Networks (GANs) to generate real occlusion scenarios can enhance the model’s adaptability to occluded targets, thereby improving the detection effect.

4.4.2. Inference Results

To visualize the detection performance of our method, this study conducted inference experiments with NSC-YOLOv8x, YOLOv8s, and YOLOv8x. The experimental data were chosen from four common situations, including market junctions, public buildings, urban roadways, and traffic intersections, each containing a high number of diverse small targets, making them ideal for the inference experiments.

As seen in Figure 9, the missed detection rate of the inferred images is lowered at the points represented by red boxes in the figure. NSC-YOLOv8x exhibits greater precision in identifying objects at the far end of the field of view when compared to YOLOv8s and YOLOv8x. Additionally, fewer small targets are missed, increasing the overall effectiveness of detection.

5. Conclusions

There are difficulties in identifying small objects in UAV aerial photography, including high missed detection rates for small objects, information loss from downsampling, loss of low-dimensional features, and weak correlation of contextual features. We propose the NSC-YOLOv8 as a solution to these problems. By adding a tiny target-detecting head, we increased detection precision and decreased missed detections. Model parameters can also be decreased by changing the bottom-up layer structure and replacing the huge detecting head. Secondly, we incorporated an NDB to reduce information loss from downsampling. Thirdly, we integrated an SAEB to mitigate the loss of low-dimensional features. Finally, we added a CARB to enhance contextual feature correlation. Compared to YOLOv8s, the NSC-YOLOv8s model improves average detection accuracy by 11.7%, outperforming other existing algorithms.

Despite making some progress, the NSC-YOLOv8 still struggles with missed detection for very small targets like bicycles, tricycles, and awning tricycles, highlighting the need for further improvements in these challenging categories. To address the issue of missed detection, in future work, we could first incorporate an occlusion awareness mechanism that leverages the target’s relative position or temporal data to identify occluded small targets. Second, we could employ GANs to simulate real-world occlusion scenarios, enhancing the model’s ability to detect occluded targets and ultimately improving its overall performance. Additionally, regarding dataset diversity, we could use an infrared dataset, as infrared datasets can capture the thermal radiation of objects, helping to identify small targets at night or in smoke. We could also use multispectral datasets, as they enhance target discrimination by providing information at different wavelengths. It is feasible to combine multidimensional information, strengthen the model’s resilience in complicated settings, and increase the detection accuracy of dynamic changes and obscured small targets by merging RGB, infrared imaging, and multispectral datasets. Future research will examine the potential and real-world impacts of this approach to enhance the model’s functionality.

Author Contributions

Conceptualization: D.C. (Dongmin Chen); methodology: D.C. (Dongmin Chen); validation: D.C. (Dongmin Chen), D.C. (Danyang Chen) and F.Z.; formal analysis: D.C. (Dongmin Chen), D.C. (Danyang Chen) and C.Z.; data curation: D.C. (Dongmin Chen); writing—review and editing: D.C. (Dongmin Chen), D.C. (Danyang Chen) and C.Z.; visualization: D.C. (Dongmin Chen) and F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Natural Science Foundation of Guangxi under Grant No. 2025GXNSFAA069540 and the Research Capacity Improvement Project of Young Researcher under Grant No. 2024KY0017.

Data Availability Statement

The Visdrone2019-DET dataset is available at https://github.com/VisDrone/VisDrone-Dataset (accessed on 6 April 2025). The UAVDT-benchmark-M dataset is available at https://sites.google.com/view/grli-uavdt (accessed on 6 April 2025).

Acknowledgments

We would like to express our gratitude to the Intelligent Computing and the High-Performance Computing Platform of Guangxi University for their resources and support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dong, J.; Zhang, Y. Optimization of Autonomous UAV Control Technology based on Computer Algorithms. In Proceedings of the 2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), Dalian, China, 20–21 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 194–197. [Google Scholar]
Lee, M.; Choi, M.; Yang, T.; Kim, J.; Kim, J.; Kwon, O.; Cho, N. A study on the advancement of intelligent military drones: Focusing on reconnaissance operations. IEEE Access 2024, 12, 55964–55975. [Google Scholar] [CrossRef]
Puri, V.; Nayyar, A.; Raja, L. Agriculture drones: A modern breakthrough in precision agriculture. J. Stat. Manag. Syst. 2017, 20, 507–518. [Google Scholar] [CrossRef]
Sowmya, V.; Janani, A.S.; Hussain, S.M.; Aashica, A.; Arvindh, S. Creating a resilient solution: Innovating an emergency response drone for natural disasters. In Proceedings of the 2024 10th International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, India, 12–14 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 344–348. [Google Scholar]
Zhang, R.; Galvin, R.; Li, Z. Drone-Based Geological Heritage Conservation and Exploration: Insights from Copper UNESCO Geopark. Geoheritage 2024, 16, 98. [Google Scholar] [CrossRef]
Wang, H.; Wu, J.; Zhang, C.; Lu, W.; Ni, C. Intelligent security detection and defense in operating systems based on deep learning. Int. J. Comput. Sci. Inf. Technol. 2024, 2, 359–367. [Google Scholar] [CrossRef]
Ahmari, R.; Hemmati, V.; Mohammadi, A.; Mynuddin, M.; Kebria, P.; Mahmoud, M.; Homaifar, A. Evaluating Trojan Attack Vulnerabilities in Autonomous Landing Systems for Urban Air Mobility. In Proceedings of the Automation, Robotics & Communications for Industry 4.0/5.0, Granada, Spain, 19–21 February 2025; p. 80. [Google Scholar]
Xu, C.; Zhang, Y.; Chen, S. A Multi-Strategy Integrated Improved Yolov8n Algorithm and Its Application to Automatic Driving Detection. In Proceedings of the 2024 IEEE 7th International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 27–29 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 330–336. [Google Scholar]
Ahmari, R.; Hemmati, V.; Mohammadi, A.; Kebria, P.; Mahmoud, M.; Homaifar, A. A Data-Driven Approach for UAV-UGV Integration. In Proceedings of the Automation, Robotics & Communications for Industry 4.0/5.0, Granada, Spain, 19–21 February 2025; p. 77. [Google Scholar]
Zhang, Y.; Lv, C.; Wang, D.; Mao, W.; Li, J. A novel image detection method for internal cracks in corn seeds in an industrial inspection line. Comput. Electron. Agric. 2022, 197, 106930. [Google Scholar] [CrossRef]
Zhou, J.; Hao, M.; Zhang, D.; Zou, P.; Zhang, W. Fusion PSPnet image segmentation based method for multi-focus image fusion. IEEE Photonics J. 2019, 11, 6501412. [Google Scholar] [CrossRef]
Vedaldi, A.; Zisserman, A. Vgg Convolutional Neural Networks Practical; Department of Engineering Science, University of Oxford: Oxford, UK, 2016; Volume 66. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Akbari, Y.; Almaadeed, N.; Al-Maadeed, S.; Elharrouss, O. Applications, databases and open computer vision research from drone videos and images: A survey. Artif. Intell. Rev. 2021, 54, 3887–3938. [Google Scholar] [CrossRef]
Wang, X.; He, N.; Hong, C.; Wang, Q.; Chen, M. Improved YOLOX-X based UAV aerial photography object detection algorithm. Image Vis. Comput. 2023, 135, 104697. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Kaur, R.; Singh, S. A comprehensive review of object detection with deep learning. Digit. Signal Process. 2023, 132, 103812. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Zhang, H.; Cloutier, R.S. Review on One-Stage Object Detection Based on Deep Learning. EAI Endorsed Trans. e-Learn. 2022, 7, 1–10. [Google Scholar] [CrossRef]
Du, L.; Zhang, R.; Wang, X. Overview of two-stage object detection algorithms. J. Phys. Conf. Ser. 2020, 1544, 012033. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Zendehdel, N.; Chen, H.; Leu, M.C. Real-time tool detection in smart manufacturing using You-Only-Look-Once (YOLO) v5. Manuf. Lett. 2023, 35, 1052–1059. [Google Scholar] [CrossRef]
He, P.; Chen, W.; Pang, L.; Zhang, W.; Wang, Y.; Huang, W.; Han, Q.; Xu, X.; Qi, Y. The survey of one-stage anchor-free real-time object detection algorithms. In Proceedings of the Sixth Conference on Frontiers in Optical Imaging and Technology: Imaging Detection and Target Recognition, Nanjing, China, 22–24 October 2023; SPIE: Bellingham, WA, USA, 2024; Volume 13156, p. 1315602. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 0–0. [Google Scholar]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320. [Google Scholar]
Liang, X.; Zhang, J.; Zhuo, L.; Li, Y.; Tian, Q. Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1758–1770. [Google Scholar] [CrossRef]
Hong, M.; Li, S.; Yang, Y.; Zhu, F.; Zhao, Q.; Lu, L. SSPNet: Scale selection pyramid network for tiny person detection from UAV images. IEEE Geosci. Remote. Sens. Lett. 2021, 19, 8018505. [Google Scholar] [CrossRef]
Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended feature pyramid network for small object detection. IEEE Trans. Multimed. 2021, 24, 1968–1979. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; Springer: Cham, Switzerland, 2022; pp. 443–459. [Google Scholar]
Qin, J.; Yu, W.; Feng, X.; Meng, Z.; Tan, C. A UAV Aerial Image Target Detection Algorithm Based on YOLOv7 Improved Model. Electronics 2024, 13, 3277. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 658–666. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Mei, J.; Zhu, W. BGF-YOLOv10: Small object detection algorithm from unmanned aerial vehicle perspective based on improved YOLOv10. Sensors 2024, 24, 6911. [Google Scholar] [CrossRef]
Zhang, J.; Yang, X.; He, W.; Ren, J.; Zhang, Q.; Zhao, Y.; Bai, R.; He, X.; Liu, J. Scale optimization using evolutionary reinforcement learning for object detection on drone imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 410–418. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution detection network for small objects. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Zhou, J.; Vong, C.M.; Liu, Q.; Wang, Z. Scale adaptive image cropping for UAV object detection. Neurocomputing 2019, 366, 305–313. [Google Scholar] [CrossRef]
Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-size object detection algorithm based on camera sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]

Figure 1. YOLOv8 architecture.

Figure 2. NSC-YOLOv8 architecture.

Figure 3. NDB structure: (a) input features, (b) labeling each channel, (c) rearranging the labeled channels, and (d) concatenating the rearranged feature by channel dimension.

Figure 4. SAEB architecture.

Figure 5. CARB structure, where the generated offsets, original grid, and resample features are denoted by

X_{Offs}

,

X_{Grid}

, and

X_{Res}

, respectively.

Figure 5. CARB structure, where the generated offsets, original grid, and resample features are denoted by

X_{Offs}

,

X_{Grid}

, and

X_{Res}

, respectively.

Figure 6. Information about the labeling of objects in the VisDrone2019-DET dataset. In the x-axis scale of (a), “ped” stands for “pedestrian”; “bic” for “bicycle”; “tri” for “tricycle”; “a.tri” for “awning-tricycle”.

Figure 7. Error cases of FP and FN with true label Car.

Figure 8. Confusion matrix for different models: (a) YOLOv8s and (b) NSC-YOLOv8s. On the axis scale, “ped” stands for “pedestrian”, “bic” for “bicycle”, “tri” for “tricycle”, “a.tri” for “awning-tricycle”, and “bacg” for “background”.

Figure 9. Inference results of different models on the VisDrone 2019-DET-val dataset. The red box indicates that the more objects are detected in the box, the stronger the model’s ability to detect small objects is.

Table 1. Deep learning framework and hardware platform parameters.

Environmental	Configuration
CPU	i5-12400F
Memory	16 G
Operating Systems	Windos 10
GPU	NVIDIA GeForce RTX 3090
GPU Memory	24 G
Python IDE	pycharm
Python Version	3.10.15
Deep Learning Framework	Pytorch1.12.1 + Cuda11.3

Table 2. Important parameters for training the model.

Hyperparameter	Allocation
Epochs	200
Batch Size	6
Input Image Size	640 × 640
Momentum	0.932
Learning Rate	0.01
Final Learning rate	0.005
Weight Decay	0.0005
Optimizer	SGD

Table 3. Comparison of detection outcomes between standard detection heads and tiny target detection heads. The best outcomes are denoted in bold in the table.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5– 0.95 (%)	${mAP}^{S}$ (%)	${mAP}^{M}$ (%)	${mAP}^{L}$ (%)	Param (M)
YOLOv8s	52.4	39.7	41.3	24.7	32.0	46.11	2.03	11.1
${+Head}_{s m a l l}$	56.7	44.5	46.7	28.2	38.0	46.10	2.08	7.2

Table 4. Comparison of detection outcomes using different bounding box loss functions, with the bold values highlighting the best results.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)
+CIoU	56.68	44.47	46.71	28.19
+DIoU	56.44	45.13	47.29	28.45
+GloU [47]	56.82	44.63	47.32	28.48
+EloU [48]	56.54	44.79	46.97	28.22
+SloU [49]	56.46	44.71	47.30	28.48
+WloU v1	56.24	44.71	46.99	28.22
+WloU v2	56.79	45.10	47.36	28.38
+WloU v3	57.08	45.15	47.59	28.71

Table 5. Experimental results of introducing the NDB module, with the bold values in the table highlighting the best outcomes.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)	GFLOPs
Baseline	57.1	45.1	47.6	28.7	46.6
+S1-NDB	56.4	44.9	47.4	28.8	50.0
+S2-NDB	57.6	45.4	48.1	29.3	50.1
+S3-NDB	57.8	44.7	47.7	29.0	50.1
+S4-NDB	56.2	44.8	47.2	28.6	50.1
+S5-NDB	56.1	44.1	46.7	28.2	50.1
+B1-NDB	56.8	45.5	47.9	29.1	50.1
+ALL-NDB	57.5	46.1	48.7	29.9	75.8

Table 6. Experimental results of introducing the SAEB module, with the bold values representing the best results.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)	GFLOPs
Baseline	57.1	45.1	47.6	28.7	46.6
+S1-SAEB	57.0	45.7	47.9	28.9	50.6
+S2-SAEB	56.3	45.5	48.1	29.0	50.6
+S3-SAEB	56.8	46.2	48.1	29.2	50.5
+T1-SAEB	58.9	46.4	49.8	30.5	81.8
+B1-SAEB	56.1	46.3	48.0	28.8	50.6
+ALL-SAEB	58.7	47.5	50.4	30.9	97.4

Table 7. Experimental outcomes of introducing the CARB module, with the bold values indicating the best results.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)	GFLOPs
Baseline	57.1	45.1	47.6	28.7	46.6
+T1-CARB	58.58	45.94	47.28	28.79	46.7
+T2-CARB	57.13	46.43	48.61	29.07	46.7
+T3-CARB	57.59	46.27	48.45	28.75	46.7
+ALL-CARB	57.78	46.81	48.93	29.18	46.7

Table 8. Results of the detection after introducing tactics, with the bold values in the table highlighting the best outcomes.

Model	DetectHead	WIoUv3	NBD	SAEB	CARB	mAP@0.5 (%)	mAP@0.5–0.95 (%)	GFLOPs
YOLOv8s						41.3	24.7	28.5
	✓					46.7	28.2	46.6
	✓	✓				47.6	28.7	46.6
	✓	✓	✓			48.7	29.9	75.8
	✓	✓	✓	✓		52.9	32.7	157.7
	✓	✓	✓	✓	✓	53.0	33.1	157.8

Table 9. NSC-YOLOv8 and YOLOv8 experimental results on the VisDrone2019-DET-val dataset. The best outcomes are highlighted in bold in the table below.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5– 0.95 (%)	Param (M)	${mAP}^{S}$ (%)	FPS	GFLOPs
YOLOv8n	45.2	33.6	34.3	19.8	3.0	28.2	127.0	8.1
Ours(n)	56.0	44.4	46.6	27.9	2.6	36.6	69.3	43.1
YOLOv8s	52.4	39.7	41.3	24.7	11.1	32.0	133.6	28.6
Ours(s)	59.7	50.5	53.0	33.1	9.9	38.7	53.3	157.8
YOLOv8m	55.7	43.9	45.7	28.0	25.8	33.3	91.3	78.7
Ours(m)	61.8	52.9	55.5	34.7	22.8	39.0	37.6	364.9
YOLOv8l	58.3	45.5	47.7	29.9	43.6	33.8	72.3	164.9
Ours(l)	63.1	54.5	57.1	36.3	39.9	39.8	29.9	672.8
YOLOv8x	59.8	46.8	49.4	30.7	68.1	34.8	58.4	257.4
Ours(x)	65.2	55.0	58.3	37.4	62.3	40.7	21.6	1050.3

Table 10. Detection outcomes of NSC-YOLOv8 and the YOLO series models, with the bold values in the table highlighting the best results.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5–0.95 (%)
YOLOv3	54.0	43.6	41.9	23.3
YOLOv4	36.0	48.6	42.1	25.7
YOLOv5	46.4	34.6	34.4	19.0
YOLOv7	51.4	42.1	39.9	21.6
YOLOv8(s)	52.4	39.7	41.3	24.7
YOLOv10(n)	-	-	29.0	-
Ours(n)	56.0	44.4	46.6	27.9
Ours(s)	59.7	50.5	53.0	33.1
Ours(m)	61.8	52.9	55.5	34.7
Ours(l)	63.1	54.5	57.1	36.3
Ours(x)	65.2	55.0	58.3	37.4

Table 12. Comparison with implementation details.

	Ours	Reference [51]
Hyperparameter	Ours	Reference [51]
Momentum	0.932	0.9
Learning Rate	0.01	0.01
Final Learning Rate	0.005	0.01
Drop-Out Rate	0.5	0.5
Weight Decay	0.0005	0.0005
Optimizer	SGD	SGD
Dataset	Visdrone2019-DET	Visdrone2019-DET
Metrics	mAP&mAP@50	mAP&mAP@50

Table 13. Detection outcomes of different downsampling modules, with the bold values in the table representing the best results.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5– 0.95 (%)	${mAP}^{S}$ (%)	Param (M)	FPS
+ALL-AdPool	55.2	45.3	46.8	26.2	34.9	8.1	78.4
+ALL-NDB	57.5	46.1	48.7	29.9	35.7	8.3	75.5

Table 14. Detection outcomes of the transformer-based module and SAEB, with the bold values in the table representing the best results.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5– 0.95 (%)	${mAP}^{S}$ (%)	Param (M)	FPS
+ALL- CBAM	52.9	41.9	43.0	25.5	36.0	8.2	72.8
+ALL- SAEB	58.7	47.5	50.4	30.9	37.9	8.9	66.9

Table 15. Detection outcomes on UAVDT-benchmark-M-val, with the bold values in the table representing the best results.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5– 0.95 (%)	${mAP}^{S}$ (%)	Param (M)	FPS
YOLOv8n	97.4	96.5	98.9	78.6	38.9	3.0	195.6
Ours(n)	98.4	96.9	99.2	83.1	40.1	2.6	112.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, D.; Chen, D.; Zhong, C.; Zhan, F. NSC-YOLOv8: A Small Target Detection Method for UAV-Acquired Images Based on Self-Adaptive Embedding. Electronics 2025, 14, 1548. https://doi.org/10.3390/electronics14081548

AMA Style

Chen D, Chen D, Zhong C, Zhan F. NSC-YOLOv8: A Small Target Detection Method for UAV-Acquired Images Based on Self-Adaptive Embedding. Electronics. 2025; 14(8):1548. https://doi.org/10.3390/electronics14081548

Chicago/Turabian Style

Chen, Dongmin, Danyang Chen, Cheng Zhong, and Feng Zhan. 2025. "NSC-YOLOv8: A Small Target Detection Method for UAV-Acquired Images Based on Self-Adaptive Embedding" Electronics 14, no. 8: 1548. https://doi.org/10.3390/electronics14081548

APA Style

Chen, D., Chen, D., Zhong, C., & Zhan, F. (2025). NSC-YOLOv8: A Small Target Detection Method for UAV-Acquired Images Based on Self-Adaptive Embedding. Electronics, 14(8), 1548. https://doi.org/10.3390/electronics14081548

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

NSC-YOLOv8: A Small Target Detection Method for UAV-Acquired Images Based on Self-Adaptive Embedding

Abstract

1. Introduction

2. Related Work

2.1. Target Detection Models

2.2. YOLO Network

2.3. Small Object Detection

3. Methods

3.1. Backbone

NDB Module

3.2. Neck

3.2.1. SAEB Module

3.2.2. CARB Module

3.3. Head

4. Experiments

4.1. Dataset and Settings

4.1.1. Dataset

4.1.2. Experimental Settings

4.1.3. Training Strategies

4.1.4. Evaluation Indicators

4.2. Ablation Experiment

4.2.1. Small Target Detection Head

4.2.2. Comparison of Loss Functions

4.2.3. Ablation of the NDB Module

4.2.4. Ablation of the SAEB Module

4.2.5. Ablation of the CARB Module

4.2.6. Overall Performance

4.3. Comparison Experiment

4.3.1. Comparison with YOLOv8

4.3.2. Comparison with the YOLO Series

4.3.3. Comparison with Existing Models

4.3.4. Comparison with the Downsampling Module

4.3.5. Comparison with CBAM and SAEB

4.3.6. Comparison with the UAVDT Dataset

4.4. Visualization and Analysis

4.4.1. Confusion Matrix

4.4.2. Inference Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI