An Improved Model Based on YOLOv8 for Small Object Detection and Recognition

He, Jia; Luo, Suyun

doi:10.3390/info17020173

Open AccessArticle

An Improved Model Based on YOLOv8 for Small Object Detection and Recognition

by

Jia He

and

Suyun Luo

^*

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(2), 173; https://doi.org/10.3390/info17020173

Submission received: 23 December 2025 / Revised: 20 January 2026 / Accepted: 26 January 2026 / Published: 9 February 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement of remote sensing technology, remote sensing images are increasingly being used in applications such as geographical monitoring, disaster warning, and urban planning. However, detecting small objects—such as vehicles and small buildings—in such imagery remains challenging due to complex backgrounds, weak features, and interference from factors like terrain, clouds, and lighting, leading to high rates of missed detections and false alarms. To tackle these issues, this paper proposes an improved YOLOv8-based framework for small object detection in remote sensing images. The enhancements include a multi-scale feature fusion mechanism, optimized data augmentation strategies incorporating super-resolution techniques, and a redesigned loss function that emphasizes small objects. These refinements significantly improve the model’s ability to extract discriminative features and detect small targets against cluttered backgrounds. Experimental results demonstrate superior performance across multiple metrics, including precision, recall, mAP50, and mAP50-95, particularly for challenging categories like small vehicles and buildings. This research not only provides an effective solution to the key technical bottleneck in small object detection, advancing the progress of related algorithms, but also offers important theoretical and practical experience for subsequent work.

Keywords:

small object detection; YOLOv8; object recognition; remote sensing image processing

1. Introduction

As a core task in computer vision, the technological evolution of target detection has profoundly impacted the accuracy and efficiency of image analysis and understanding. Early methods, such as the Haar feature-based Viola–Jones detector, as described by Huang and Shimizu, [1] and the Histogram of Oriented Gradients (HOG) descriptor, laid a crucial foundation due to their design simplicity and effectiveness in specific scenarios (e.g., face detection, pedestrian detection). However, these traditional methods rely heavily on handcrafted features, resulting in limited generalization capability. In scenarios with complex and variable backgrounds, object deformation, or diverse categories, their detection accuracy and speed struggle to meet practical application demands. Small object detection and recognition have found widespread applications in fields including security surveillance, remote sensing monitoring, equipment control, and military reconnaissance.

The rise of deep learning has led to revolutionary breakthroughs in target detection technology. Two-stage algorithms, represented by the R-CNN series and described by Liu et al. [2], pioneered the introduction of Convolutional Neural Networks (CNNs). These methods generate candidate regions via selective search and perform feature extraction and classification on each region, significantly improving accuracy. Subsequent Fast R-CNN enhanced efficiency by sharing convolutional features and introducing Region of Interest Pooling (RoI Pooling). Faster R-CNN, as in Sheng et al. [3], further proposed the Region Proposal Network (RPN), achieving end-to-end training and substantially boosting both speed and accuracy, yielding outstanding performance on benchmarks like PASCAL VOC. On the other hand, single-stage algorithms, exemplified by the YOLO series and SSD, as described by Zhai et al. [4], eliminate the region proposal step, directly predicting target categories and locations in a single forward pass. This approach demonstrates significant advantages in scenarios demanding high real-time performance. SSD effectively handles objects of varying sizes through detection on multi-scale feature maps. The YOLO series has undergone continuous iterative upgrades due to its concise and efficient design philosophy. As the latest version, YOLOv8 Ultralytics [5], as in Jocher et al. [6], incorporates several innovative optimizations building upon its predecessors: it employs a more efficient C2f module to replace traditional convolutional layers; enhances the feature fusion mechanism within the backbone network to improve cross-scale information flow, thereby boosting multi-scale object detection capability; introduces the Exponential Moving Average (EMA) optimizer to enhance training stability and convergence speed; and refines data augmentation strategies including variants of Mosaic and MixUp to strengthen model generalization. On public benchmarks like COCO, YOLOv8 demonstrates remarkable advantages in both mean Average Precision (mAP) and inference speed. It is consequently widely applied in areas such as intelligent security systems for detecting humans, vehicles, and anomalous objects, and autonomous driving for recognizing vehicles, pedestrians, and traffic signs.

Nevertheless, YOLOv8 still faces challenges in specific complex scenarios. Small object detection is particularly problematic: under low-light conditions, degraded image quality and reduced target-background contrast can easily lead to missed detections or false alarms; in scenes containing small or densely packed objects, the limited pixel representation of targets results in sparse feature information and high susceptibility to interference, creating a significant bottleneck in the detection accuracy of existing models as in Zhang et al. [7]. Current research efforts on YOLOv8 primarily focus on optimizing the model architecture, such as exploring efficient feature fusion mechanisms and attention mechanisms, to enhance robustness and accuracy, as well as on model lightweighting to facilitate deployment on resource-constrained devices.

Focusing on the challenge of small object detection, this paper introduces targeted enhancements to the YOLOv8 framework. The core challenge lies in the weak feature response of small objects, where shallow feature maps suffer from insufficient resolution, leading to detail loss. While deep feature maps, rich in semantic information, exhibit ambiguous spatial localization. To address these limitations, we propose an improved YOLOv8 detection and recognition algorithm featuring the following three key innovations.

Network Architecture: An enhanced multi-scale feature fusion mechanism incorporates a modified BiFPN structure, strengthening small object feature aggregation through cross-scale connections and adaptive weight allocation. We further explore dilated convolutions to expand the receptive field without resolution loss, capturing richer small object details.
Data Processing: Optimized augmentation strategies specifically integrate small object zooming and cropping, improving target diversity and representation in training data. Image preprocessing combined with Super-Resolution (SR) techniques enhances small object clarity and feature discriminability.
Training Optimization: Loss function adjustments significantly increase small object detection weights, while variants of focal loss suppress interference from simple background samples, concentrating optimization on challenging small object instances.

The proposed algorithm in this paper is designed to significantly enhance the detection accuracy and robustness of YOLOv8 for small objects such as small vehicles and buildings in remote sensing images within complex backgrounds while establishing a comprehensive recognition system capable of outputting detailed analytical reports. The anticipated outcomes of this research are expected to provide more reliable technical support for small object detection in critical domains, including remote sensing interpretation, intelligent security systems, and medical image analysis.

It should be emphasized that while newer versions such as YOLOv11 have since been released, YOLOv8 represents a mature and widely adopted framework that allows for clear ablation and comparison. The improvements proposed here such as multi-scale feature fusion, enhanced convolution designs, and refined loss functions are architectural in nature and can be transferred to newer baseline models in future work.

The rest of the paper is organized as follows. In Section 2, we provide a comprehensive and critical analysis of the existing literature. Section 3 illustrates the improved YOLOv8 algorithm. In Section 4, we present the experimental results and analyze the superiority of the improved YOLOv8. Section 5 concludes the paper and gives the future work.

2. Related Work

In this section, we present a comprehensive literature review on small object detection and recognition using YOLOv8 and prior models.

Small Object Detection (SOD) remains a significant challenge in computer vision due to the inherent limitations of low resolution, weak feature representation, and susceptibility to background clutter. The YOLO (You Only Look Once) series, renowned for its speed and accuracy, has evolved significantly, yet SOD presents persistent difficulties. This review synthesizes key research advancements in SOD using YOLO-family algorithms and relevant foundational work, highlighting innovations, strengths, findings, and limitations.

2.1. Foundational Object Detection Architectures Using Pre-YOLO and Early YOLO

The evolution of modern object detectors began with R-CNN, as described by Girshick et al. [8], which pioneered the use of Convolutional Neural Networks (CNNs) for detection by generating region proposals via selective search followed by per-region CNN feature extraction and classification, significantly improving accuracy over traditional methods like HOG but suffering from extreme computational inefficiency. Building upon this, Faster R-CNN, as in Ren et al. [9], introduced the Region Proposal Network (RPN), enabling end-to-end training and shared convolutional features, which substantially improved speed while maintaining high accuracy, though its two-stage nature and RoI pooling still hindered small object detection (SOD) performance. Concurrently, the first single-stage detector YOLOv1, as described by Redmon et al. [10], framed detection as a unified regression problem, achieving real-time speeds through direct prediction from full images but struggling with localization accuracy, especially for small objects. SSD, as described by Liu et al. [11], addressed multi-scale detection by leveraging feature maps from different backbone layers for predictions, outperforming YOLOv1 in accuracy and small object detection but still facing degradation for very small targets and requiring complex hard negative mining strategies.

2.2. Evolution of YOLO for Enhanced Performance from YOLOv2 to YOLOv7

YOLOv2 (YOLO9000), as described by Redmon and Farhadi [12], introduced critical enhancements, including anchor boxes, batch normalization, and a passthrough layer for fine-grained features, improving recall and speed while partially mitigating small object limitations. YOLOv3, as detailed in Redmon and Farhadi [13], marked a significant leap by adopting a Feature Pyramid Network (FPN)-inspired structure with three detection scales, a more powerful Darknet-53 backbone, and better classifiers, achieving a more robust balance between speed and accuracy across object sizes, though computational costs increased. YOLOv4, as described by Bochkovskiy et al. [14], systematized advancements through its bag of freebies like mosaic augmentation) and bag of specials such as modified PANet, SPP, and Mish activation, boosting overall performance and SOD via improved feature fusion and data diversity. Scaled-YOLOv4, as in Wang et al. [15], further refined this approach through principled model scaling, offering variants from lightweight to high-precision models, but without fundamentally resolving core SOD challenges like feature sparsity.

2.3. YOLOv8 and Contemporary SOD-Specific YOLO Variants

YOLOv8—see Ultralytics [5] and Jocher et al. [6]—representing the latest iteration, adopts an anchor-free split head, the efficient C2f module, and enhanced mosaic augmentation, achieving state-of-the-art speed-accuracy trade-offs on benchmarks like COCO while improving usability. However, its performance on dedicated SOD tasks remains suboptimal compared to specialized approaches. Complementing architectural advances, Wang et al. [16] proposed the Normalized Wasserstein Distance (NWD) metric to replace unstable IoU calculations for tiny objects, dramatically improving SOD accuracy when integrated into detection heads. Yang et al. [17] introduced the Asymptotic Feature Pyramid Network (AFPN) to enable direct feature interaction between non-adjacent levels and adaptive spatial fusion, enriching the context for small objects. YOLOF, as described by Chen et al. [18], challenged conventional wisdom by demonstrating competitive SOD performance using a single-level feature map with dilated encoders, offering a simpler alternative to FPN. PP-YOLOE, as discussed by Xu et al. [19], combined an anchor-free design, task-aligned heads, and a strong backbone to achieve high COCO APs, including competitive small object results.

2.4. Techniques Specifically Addressing Small Object Challenges

Fundamental techniques for SOD include focal loss according to Lin et al. [20], which tackles extreme foreground–background imbalance by down-weighting easy samples, proving essential for detecting rare small instances in dense scenes. Super-resolution techniques like Deep Back-Projection Networks (DBPNs), as in Haris et al. [21], enhance input image quality, potentially recovering small object details, though computational costs and integration complexity remain barriers. Dilated convolutions according to Yu et al. [22] expand receptive fields without sacrificing resolution, preserving spatial details critical for SOD but risking grid artifacts and increased resource demands. EfficientDet’s BiFPN, as described by Tan et al. [23], optimizes multi-scale feature fusion via learnable bidirectional cross-scale connections, effectively boosting small object detection efficiency compared to earlier FPN/PANet structures.

2.5. Application-Oriented SOD in Remote Sensing and Other Domains

Domain-specific challenges are particularly acute in remote sensing (RS), where Li et al. [24] comprehensively surveyed SOD difficulties—such as extreme scale variation and complex backgrounds—and introduced the DIOR-R benchmark dataset to drive progress. Cross-domain adaptation techniques, exemplified by H²FA R-CNN, as in Zhang et al. [25], address critical deployment issues like domain shift (synthetic-to-real) and weak supervision, enhancing model robustness in real-world scenarios such as aerial imagery analysis. Nie et al. [26] proposed a lightweight remote sensing small target detection model by adding a special small target detection layer to the YOLOv8 feature fusion network, which significantly improves the aggregation ability of small target features. Zhang et al. [27] proposed the YOLOv8-Extend model for agricultural pest detection, integrating GSConv to expand the perception field, BiFPN to enhance the feature fusion, and introducing the CBAM attention mechanism. Zhou et al. [28] embed the ECA channel attention mechanism in YOLOv8 for radar small target detection and captured channel dependencies through one-dimensional convolution to increase the high-resolution small target detection layer to process low signal-to-noise ratio RD images.

Furthermore, Xu et al. [29] developed the YOLOv8-MPEB algorithm to replace the backbone network with MobileNetV3 to reduce the number of parameters and integrated EMA multi-scale attention into the C2f module to improve the accuracy of drone image small target detection in combination with BiFPN. Wang et al. [30] put forward the LSOD-YOLO model, which uses cross-layer output reconstruction modules to integrate deep and shallow features, and integrates Large-scale Separable Nuclear Attention (LSKA) to improve spatial perception ability. Cao et al. [31] improved the model by inserting the upper and lower sampling branches in the 10th layer of the YOLOv8 backbone network, and the background interference is reduced through the vertical or horizontal convolutional aggregation direction characteristics.

2.6. Summary and Research Gap

The trajectory of SOD research reveals consistent innovation from foundational region-based and single-stage detectors, through iterative YOLO improvements, to specialized techniques like enhanced feature fusion (AFPN, BiFPN), advanced loss functions (Focal Loss, NWD), and resolution-preserving mechanisms (dilated convs, SR). Despite these advances, persistent challenges include inherent feature sparsity of small objects, resolution-receptive field trade-offs, sensitivity to background clutter, and computational overhead of enhancement modules. While YOLOv8 provides a robust baseline, its standard implementation lacks dedicated SOD optimizations, creating a compelling niche for our proposed YOLOv8 enhancements targeting multi-scale feature refinement, adaptive data augmentation, and specialized loss functions for complex real-world applications.

3. The Improved YOLOv8

This section provides the motivation, YOLOv8 network structure, and network structure of the improved YOLOv8.

3.1. Motivation

In the task of small object detection, conventional detection algorithms face significant challenges due to the low pixel proportion and lack of detailed information about the targets. The issue of feature loss makes it difficult for models to capture critical semantic information of small objects, resulting in consistently high rates of missed detections. Excessive computational costs not only restrict the deployment of algorithms on resource-constrained devices but also compromise real-time detection capabilities. Furthermore, insufficient localization accuracy leads to considerable deviations between the detected bounding boxes and the actual object boundaries, undermining the reliability of detection outcomes. In addition, the standard architecture of YOLOv8 suffers from several technical limitations for small object detection. The high stride and downsampling rates cause shallow feature maps to lose fine-grained details critical for small targets. The limited high-resolution detection heads reduce spatial precision for objects covering few pixels. Moreover, the fixed receptive fields in convolutional layers may not adapt to extremely small target sizes. To address these issues, this paper optimizes the YOLOv8 algorithm from four aspects: detection layer design, feature extraction modules, loss function, and multi-scale feature fusion.

3.2. YOLOv8 Network Structure

Introduced by Ultralytics in 2023, as described by Ultralytics [5] and Jocher et al. [6], YOLOv8 represents an improvement over previous algorithms in the YOLO series, with its network architecture illustrated in Figure 1. The YOLOv8 model consists of three main components: a backbone for feature extraction, a neck for feature fusion, and a head for providing detection outcomes. It offers five variants including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x to accommodate a range of task requirements.

As shown in Figure 1, YOLOv8 is a single-stage object detection framework that offers notable advantages over two-stage detectors such as R-CNN and Fast R-CNN. Traditional two-stage detectors involve two independent steps: region proposal extraction and object classification. This sequential processing approach not only results in slower detection speeds but also demands substantial computational resources. In contrast, single-stage detectors integrate both region extraction and classification into a unified process, thereby simplifying the detection pipeline. By performing these tasks simultaneously, YOLOv8 achieves greater efficiency and faster inference, while maintaining competitive detection accuracy. This efficiency has led to the widespread adoption of YOLO-based algorithms in real-time object detection applications.

However, when applied to small object detection, YOLOv8 encounters specific challenges. Small objects are often difficult to detect due to their limited size, low resolution, and susceptibility to environmental factors.

3.3. The Improved YOLOv8 Network Structure

To enhance the detection completeness and accuracy of small objects in YOLOv8, we introduce optimizations across four key aspects: detection layer, feature extraction modules, loss function, and multi-scale feature fusion. The improvements are systematically integrated into YOLOv8 as follows. In small Object Detection Layer, we add a fourth detection head for higher-resolution feature maps. In GSConv Module, we replace standard convolutions in the backbone to enhance feature sharing. Then, the CIoU is replaced by SIoU Loss to better model angular alignment for small objects. Finally, SPPF is modified using SPPCSPC Module for richer multi-scale context without significant computation increase. The architecture of the improved YOLOv8 network structure is illustrated in Figure 2.

As illustrated in Figure 2, the proposed YOLOv8 network incorporates improvements across four key aspects to boost detection accuracy for small objects. In terms of detection layer design, the network structure and parameter configuration have been optimized to strengthen the model’s perception capability for minute targets. Regarding the feature extraction module, a novel network architecture and optimization strategies are introduced to improve the completeness and effectiveness of feature extraction, thereby minimizing the loss of fine-grained features of small objects. For loss function refinement, a more tailored function has been developed according to the characteristics of small object detection, leading to improved training performance and localization accuracy. In the multi-scale feature fusion strategy, a more efficient fusion mechanism is explored to leverage the advantages of features at different scales, enhancing the representational capacity for detecting small targets. Through these coordinated enhancements, the overall detection precision for small objects is significantly improved.

3.3.1. Small Object Detection Layer

In standard object detection tasks, when small objects are present in a dataset, their characteristic features may be overlooked by conventional detection layers, often resulting in missed detections or unsatisfactory performance. The original YOLOv8 architecture includes three detection heads, which may exhibit limited effectiveness in detecting very small objects. To address this issue, an additional detection head dedicated to small objects is incorporated in the improved YOLOv8 that is proposed in this paper. The added small object detection layer operates on a feature map with a stride of 8, corresponding to an output resolution of 80 × 80 for an input of 640 × 640. This higher-resolution feature map retains finer spatial information, which is essential for detecting targets spanning fewer than 20 × 20 pixels. This enhancement aims to improve detection accuracy for minute targets, strengthen the model’s adaptability to objects at different scales, and increase overall robustness.

3.3.2. GSConv Module

The GSConv (Global-Shared Convolution) module, as described by Han et al. [32], illustrated in Figure 3 is an efficient convolutional structure specifically designed for lightweight neural networks. It optimizes performance by integrating global contextual information with a channel feature-sharing mechanism. The core concept involves splitting the input channels into two groups: one processed through standard convolution to preserve local details and the other utilizing global average pooling to generate channel attention weights. These weights modulate a shared convolutional kernel to facilitate cross-channel information interaction.

Specifically, GSConv first divides the input feature maps into groups. The standard convolution branch performs conventional convolution, while the global-shared convolution branch extracts global context through global pooling, producing a channel-wise weight vector. This vector is then combined with the shared convolutional kernel and applied to the corresponding channels. The outputs of both branches are concatenated and subjected to channel shuffling, followed by a 1 × 1 convolution for dimension adjustment.

This design maintains computational efficiency comparable to depthwise separable convolution (DWConv) while significantly enhancing inter-channel information flow and global feature capture capability. It is particularly suitable for applications requiring high precision and real-time performance, such as small object detection, and demonstrates superior feature representation ability in lightweight models.

3.3.3. SIoU Loss Function

Object detection is a core problem in computer vision, and its effectiveness heavily relies on the definition of the loss function, which essentially measures the discrepancy between predicted and ground-truth bounding boxes. Traditional loss functions such as GIoU and CIoU primarily consider distance, overlap area, and aspect ratio. However, they incorporate angle information only indirectly through the difference in width–height ratios rather than explicitly modeling angular deviation. In contrast, SIoU loss, according to Gevorgyan [33], directly introduces an angle cost term, enabling more efficient convergence of predicted boxes toward ground-truth boxes by reducing degrees of freedom during training.

Compared to the CIoU loss function used by default in YOLOv8, SIoU more effectively captures geometric discrepancies in bounding box shape. By explicitly incorporating both shape and angle considerations, it facilitates faster and more accurate localization of small objects. This reduces misalignment caused by angular deviations and shape mismatches, thereby improving detection accuracy for small targets.

In small object detection, where targets often exhibit significant scale variations, robust perception across scales is essential. The SIoU loss function adapts better to such scale changes through scale-aware computations. By enabling more precise adjustment of the position and size of predicted boxes, SIoU demonstrates superior performance in detecting small objects.

The SIoU loss is defined using three cost components: distance cost, shape cost, and IoU cost, which can be described by Equation (1) as follows:

L_{b o x} = 1 - I o U + \frac{Δ + Ω}{2}

(1)

where the angle cost is incorporated within the distance cost term. The angle cost contributes to improved training convergence and accuracy. Let α denote the angle between the horizontal axis and the line connecting the centers of the two bounding boxes, and β the angle with respect to the vertical axis, while Ch represents the vertical distance between the centers. The angle cost is formulated in Equation (2):

Λ = 1 - 2 \times {s i n}^{2} (a r c s i n (x) - \frac{π}{4})

(2)

where

x = \frac{C_{h}}{σ} = s i n (α)

(3)

x = \sqrt{{(b_{c_{x}}^{g t} - b_{c_{x}})}^{2} - {(b_{c_{y}}^{g t} - b_{c_{y}})}^{2}}

(4)

C_{h} = m a x (b_{c_{y}}^{g t}, b_{c_{y}}) - m i n (b_{c_{y}}^{g t}, b_{c_{y}})

(5)

Here, (

b_{c_{x}}

,

b_{c_{y}}

) and (

b_{c_{x}}^{g t}

,

b_{c_{y}}^{g t}

) represent the coordinates of the center point of predicted box B and ground-truth bounding box BGT, respectively.

For easy understanding, the angle cost intuition diagram is illustrated in Figure 4. The loss term resembles a trigonometric identity as Equation (6):

c o s (2 x) = 1 - 2 {s i n}^{2} (x)

(6)

The angle cost function utilizes α and β, in which the function aims to minimize α when it is less than 4π; otherwise, it minimizes β where β = 4π − α.

The distance cost is designed to build upon the angle cost. Its core idea is that as the angular discrepancy between the predicted and ground-truth boxes increases, the contribution of the distance error to the total loss should decrease significantly. This design encourages the predicted box to align more closely with the ground-truth box in spatial location, as defined in Equation (7).

Δ = \sum_{t = x, y} (1 - e^{- γ ρ_{t}})

(7)

where

ρ_{x} = {(\frac{b_{c_{x}}^{g t} - b_{c_{x}}}{C_{w}})}^{2}, ρ_{y} = {(\frac{b_{c_{y}}^{g t} - b_{c_{y}}}{C_{h}})}^{2}, γ = 2 - Λ

(8)

The shape cost accounts for aspect ratio mismatches and is formulated as shown in Equation (9):

Ω = \sum_{t = w, h} {(1 - e^{- ω_{t}})}^{θ}

(9)

where

ω_{w} = \frac{| w - w^{g t} |}{m a x (w, w^{g t})}, ω_{h} = \frac{| h - h^{g t} |}{m a x (h, h^{g t})}

(10)

Here, w and

w^{g t}

denote the widths of the predicted and ground-truth boxes, respectively, while h and

h^{g t}

represent their heights. The terms ω_w and ω_h correspond to the relative differences in width and height between the two boxes. θ is the hyperparameter.

The IoU cost is defined as 1 minus the conventional Intersection over Union (IoU) value as shown in Equation (11):

L_{I o U} = 1 - I o U

(11)

where

I o U = \frac{| B \cap B^{g t} |}{| B \cup B^{g t} |}

(12)

This formulation emphasizes the non-overlapping regions between the predicted and ground-truth bounding boxes.

3.3.4. SPPCSPC Module

The SPPCSPC (Spatial Pyramid Pooling Cross-Stage Partial Channel) module illustrated in Figure 5, as described by He et al. [34] and Wang et al. [35], integrates spatial pyramid pooling with a cross-stage partial connection structure. Building upon the SPP architecture, it employs multiple pooling kernels of different sizes to perform feature pooling, thereby capturing multi-scale contextual information. Additionally, the CSP structure is introduced to split the feature maps into two parts: one is transmitted directly, while the other undergoes cross-stage processing before being fused back. This design reduces parameter redundancy, improves computational efficiency, and mitigates the loss of information during deep network propagation.

Compared to the default SPPF module in YOLOv8, the SPPCSPC module is more effective in extracting multi-scale features, thereby enhancing the model’s ability to comprehend both target objects and complex backgrounds. The incorporation of the CSP structure not only decreases parameter redundancy and increases computational efficiency but also facilitates better gradient flow, contributing to more stable model training.

4. Experimental Results and Analysis

This section presents the experimental results and analyzes the superiority of the improved YOLOv8.

4.1. Dataset and Experimental Environment

In this study, the NWPU-VHR-10 from Cheng et al.’s [36] dataset was selected for training and validation, which is a widely used high-resolution remote sensing image dataset containing ten types of targets, such as airplanes, ships, and vehicles. Objects in high-resolution remote sensing images exhibit varying sizes, shapes, and complex backgrounds, providing a rich data source for model training. The NWPU-VHR-10 dataset was selected for its established relevance in remote sensing small object detection benchmarks. It contains ten object categories with significant scale variation, particularly including challenging small targets such as vehicles and ships in complex backgrounds. Its high spatial resolution (up to 0.5–2 m) allows for meaningful evaluation of feature retention and localization accuracy for small objects. The original dataset was converted into the YOLO format through custom code to facilitate subsequent training and evaluation.

The experiments were conducted on a Windows 11 operating system, utilizing an AMD Ryzen 7 5800H with Radeon Graphics CPU and an NVIDIA GeForce RTX 4090 GPU. The deep learning framework used was PyCharm 2023.x and Python version 3.8+. To ensure the fairness and comparability of the experiments, no pre-trained weights were used, and the same parameters were applied throughout. Some parameter settings are shown in Table 1.

In Table 1, hyperparameters were selected based on standard YOLOv8 training protocols and validated through a limited grid search on 10% of the training set. For instance, AdamW was chosen for its adaptive learning and weight decay benefits, and the learning rate of 0.01 ensured stable convergence without overfitting.

4.2. Evaluation Metrics

To evaluate the performance of the improved YOLOv8 for small object detection, this study employs precision, recall, F1 score, mAP50, and mAP50-95 as evaluation metrics. Precision refers to the proportion of correctly identified positive predictions among all detections classified as positive, that is, the ratio of true positives to the sum of true positives and false positives. Recall measures the proportion of actual positive instances that are correctly detected, defined as the ratio of true positives to the sum of true positives and false negatives. The F1 score is the harmonic mean of precision and recall. The metric mAP50 denotes the mean average precision across all classes at an Intersection over Union (IoU) threshold of 0.5. Meanwhile, mAP50-95 represents the average mean average precision computed over IoU thresholds from 0.5 to 0.95 with a step size of 0.05. The formal definitions of these metrics are provided in the following Equations:

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(15)

A P = \int_{0}^{1} p (r) d r

(16)

m A P = \frac{1}{n} \sum_{i = 1}^{n} {A P}_{i}

(17)

where TP, FP, and FN denote true positives, false positives, and false negatives, which correspond to correct detections, incorrect detections, and missed detections. AP, mAP, and n represent the average precision, the mean average precision for all categories, and the number of categories in object detection, respectively.

4.3. Training Results

An analysis of the experimental data over 300 training epochs indicates that the deep learning model achieved favorable training outcomes. Both the training and validation sets exhibited a consistent decline in bounding box loss and classification loss. Key performance metrics, including precision, recall, mAP50, and mAP50-95, showed steady improvement throughout the training process, with a noticeable performance leap in the mid- to late stages. The learning rate decay strategy employed in the training contributed positively to model convergence. Although minor fluctuations in validation loss were observed in certain epochs, and the final mAP50-95 value of 0.50 indicates room for further improvement, the current model has established a foundational framework for object detection. Future enhancements could involve improving data augmentation strategies, optimizing the neural network architecture, or extending the training duration to achieve higher performance.

4.4. F1 Score Curves

Figure 6 illustrates the F1 score versus confidence threshold curves for all classes. The dark blue curve, representing all classes combined, reaches an F1 score of 0.83 at a confidence threshold of 0.420, indicating robust overall performance of the model at this level. Considerable variation is observed across different object categories. Curves corresponding to classes such as “airplane” and “storage tank” maintain high F1 scores over a wide range of confidence thresholds, reflecting strong detection performance. In contrast, the curve for the “basketball court” class remains consistently low, while categories like “bridge” and “vehicle” also exhibit relatively lower curves, suggesting suboptimal detection accuracy. Furthermore, as the confidence threshold increases, most curves exhibit a general decline in F1 score. Therefore, selecting an appropriate confidence threshold is essential in practical applications to balance precision and recall effectively.

4.5. Precision–Recall Curves

As shown in the precision–recall curve in Figure 7, the overall mAP@0.5 of the model reaches 0.864. Considerable variation in performance is observed across different object categories. Classes such as “airplane” (0.985), “storage tank” (0.960), “baseball diamond” (0.954), “ground track field” (0.977), and “harbor” (0.953) exhibit outstanding precision–recall curves, indicating high levels of both detection accuracy and completeness. The categories “ship” (0.881) and “vehicle” (0.899) also demonstrate reasonably good performance. In contrast, the results for “tennis court” (0.838) and “bridge” (0.818) are comparatively lower. Notably, the “basketball court” category (0.371) shows a significantly inferior curve relative to other classes, reflecting unsatisfactory performance in both precision and recall.

4.6. Comparative Experiments

As illustrated by the training and validation loss and evaluation metric curves in Figure 8, all training losses—including train/box loss, train/cls loss, and train/dfl loss—exhibited a consistent decline throughout the training process before eventually stabilizing. This trend indicates continuous optimization in the model’s capabilities for object localization, category classification, and distribution focal loss-related learning. Similarly, the validation losses, namely val/box loss, val/cls loss, and val/dfl loss, also decreased and converged to stable values, demonstrating steady performance on the validation set. Concurrently, metrics/precision(B) and metrics/recall(B) showed a steady increase, while metrics/mAP50(B) and metrics/mAP50-95(B) improved progressively. These results reflect enhanced precision, recall, and detection performance across varying IoU thresholds, confirming the effectiveness of the training process and indicating that the model’s overall performance improved consistently before reaching a stable state.

In addition, to facilitate a clearer comparison, the evaluation metrics of the baseline YOLOv8 model are compared with those of the proposed improved YOLOv8 model. The results of this comparison are presented in Figure 7. The data clearly indicate that the enhanced YOLOv8 model achieves markedly higher performance across all evaluated metrics, highlighting its superior overall detection capability. It is worth noting that recent small-object-focused detectors such as YOLOv8-Extend and EfficientDet have incorporated mechanisms like GSConv, attention modules, and enhanced feature fusion. While these methods show promising results in their respective domains, our approach integrates a complementary set of improvements that collectively address feature sparsity, multi-scale representation, and localization accuracy in remote sensing imagery. Our empirical gains over baseline YOLOv8 suggest that such an integrated design is effective, though direct comparative training on the same dataset remains a future step. The comparison results are illustrated in Figure 9.

As shown in Figure 9, the improved YOLOv8 achieves higher overall precision compared to the baseline YOLOv8, with reduced fluctuation and stable performance at an elevated level, indicating improved precision. In terms of recall, the proposed model outperforms the original version in most training epochs, demonstrating better completeness in detecting objects. Furthermore, it exhibits superior performance in both mAP@50 and mAP@50-95 across the majority of epochs, along with greater stability in these metrics. In conclusion, the improved YOLOv8 demonstrates consistent and notable improvements across all evaluation metrics, indicating its superior overall detection capability compared to the baseline model.

More specifically, baseline YOLOv8 achieved an mAP50-95 of only 0.432, indicating notable degradation under higher IoU thresholds, which is a known challenge for small objects where precise localization is critical. Categories with the smallest average pixel area (e.g., “vehicle” AP@0.5 = 0.82, “basketball court” AP@0.5 = 0.37) were particularly underserved, confirming the need for architectural enhancements. The improved model achieves a 12.3% increase in mAP50 from 0.77 to 0.864 and a 15.8% increase in mAP50-95 from 0.432 to 0.50 relative to baseline YOLOv8. Precision and recall improved by 9.5% and 11.2%, respectively. The standard deviation of mAP50 over the final 10 epochs was 0.008, indicating stable convergence.

4.7. Visualization of Detection Results

To exhibit the detection performance of the proposed YOLOv8, we give the visualization results according to confusion matrices and detection results.

Figure 10 presents the confusion matrices generated by YOLOv8 and the improved YOLOv8. In these matrices, the rows correspond to the actual classes, and the columns correspond to the predicted classes. The diagonal entries indicate the proportion of correct classifications, whereas the off-diagonal entries reflect misclassification rates. As shown in Figure 10, the improved YOLOv8 demonstrates strong performance in detecting objects such as “airplane”, “storage tank”, “ground track field”, and “harbor”, achieving high accuracy rates of 0.97 or above with minimal misclassifications. Categories including “ship”, “baseball diamond”, “bridge”, and “vehicle” also attain relatively good recognition accuracy, ranging between 0.76 and 0.83, though some misclassifications remain. However, the model exhibits difficulties in distinguishing between “tennis court” and “basketball court”, where mutual misclassification occurs frequently. Additionally, the identification of “background” remains challenging, with a high rate of confusion with other object categories. Overall, while the model shows competent detection capability for certain classes, further improvements are needed to enhance discrimination between visually similar categories and background detection.

To visually demonstrate the detection performance, images with different labels under complex backgrounds were selected for display, as shown in Figure 11.

For the airplane class, as seen in images like “020.jpg” and “041.jpg”, the model achieves relatively high confidence scores (e.g., 0.8, 0.9). This indicates that the model performs well in detecting airplanes, which are typically characterized by distinct shapes and relatively large sizes among the small objects in this dataset, making them easier to identify with sufficient feature information.

Regarding the ship class, in images such as “503.jpg”, “517.jpg”, and “296.jpg”, the confidence scores vary. Some ships are detected with high confidence (e.g., 0.9), while others have lower scores (e.g., 0.3). Ships often appear in complex maritime backgrounds, and variations in their partial occlusions, sizes, and the clarity of the water-surface background can lead to differences in detection performance. Smaller ships or those with less distinct outlines are more challenging for the model to detect accurately.

For the bridge class, shown in images like “539.jpg”, “562.jpg”, and “559.jpg”, the confidence scores are moderate (e.g., 0.3–0.8). Bridges have elongated and sometimes irregular structures, and their partial overlaps with the surrounding environment (such as water and land) can make feature extraction difficult, resulting in less consistent detection confidence compared to airplanes.

The sports field-related classes, including ground track field, tennis court, baseball diamond, and basketball court, exhibit varying detection performances. Ground track field and tennis court (e.g., “265.jpg”, “477.jpg”, “351.jpg”) are detected with high confidence (0.9), likely because they have relatively regular shapes and distinct color contrasts with the surrounding areas. Baseball diamond and basketball court also have good detection results in many cases (e.g., confidence scores of 0.8–1.0 in “116.jpg”, “096.jpg”), but there are instances with lower confidence (e.g., “129.jpg” for baseball diamond with 0.6). The differences arise from the diversity in the size, shape, regularity, and background complexity of these fields. Smaller or less clearly defined fields pose greater challenges for accurate detection.

For the vehicle class in “379.jpg”, the confidence scores range from 0.4 to 0.9. Vehicles are small objects with diverse appearances and are often located in complex traffic scenes, which leads to variations in detection accuracy.

In summary, according to the detection results illustrated in Figure 11, the improved YOLOv8 model demonstrates favorable detection performance for most small object classes. However, there are still differences in detection accuracy among different classes, which are mainly attributed to the inherent characteristics of the objects (such as shape, size, and texture) and the complexity of their surrounding environments. Objects with more regular shapes, larger sizes, and distinct feature contrasts are detected more accurately, while those with irregular shapes, smaller sizes, or complex backgrounds present greater challenges for the model. This analysis provides valuable insights for further refining the model, such as incorporating class-specific enhancement strategies or improving feature extraction for challenging object classes.

5. Conclusions

This study has addressed the critical challenge of small object detection in complex remote sensing imagery by proposing an improved YOLOv8-based framework. Through targeted architectural refinements, which include a strengthened multi-scale feature fusion mechanism, optimized data augmentation strategies integrating super-resolution techniques, and a redesigned loss function emphasizing small objects, the model demonstrates marked improvements in detection accuracy, robustness, and generalization capability. Experimental validation confirms that the proposed model achieves superior performance across multiple metrics, including precision, recall, mAP50, and mAP50-95, particularly in challenging categories such as small vehicles and buildings. These outcomes underscore the efficacy of our approach in mitigating background interference and enhancing feature representation for small targets. The paper contributes meaningful insights to the broader field of small object detection. In the future, we will focus on enhancing the detection capability for currently underperforming categories through targeted data balancing and adversarial training techniques. Additionally, we will carry out the cross-dataset validation to further strengthen generalizability of the proposed improved YOLOv8 and plan to explore end-to-end super-resolution detection networks to further preserve critical fine-grained information.

Author Contributions

S.L. conducted the organization of the content. J.H. wrote the main manuscript text. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The analysis datasets for the current study are available from the first author on reasonable request (hj981780@163.com).

Acknowledgments

The authors thank the reviewers and editors for their constructive comments on this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, L.L.; Shimizu, A. A Multi-Expert Approach for Robust Face Detection. Pattern Recognit. 2006, 39, 1695–1703. [Google Scholar] [CrossRef]
Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A Denoising Fpn with Transformer R-Cnn for Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar] [CrossRef]
Sheng, W.; Yu, S.; Lin, J.; Chen, X. Faster Rcnn Target Detection Algorithm Integrating CBAM and FPN. Appl. Sci. 2023, 13, 6913. [Google Scholar] [CrossRef]
Zhai, S.; Shang, D.; Wang, S.; Dong, S. Df-Ssd: An Improved SSD Object Detection Algorithm Based on Densenet and Feature Fusion. IEEE Access 2020, 8, 24344–24357. [Google Scholar] [CrossRef]
Ultralytics: Yolov8—Ultralytics Yolov8 Documentation. 2023. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 19 May 2025).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics Yolov8. 2023. Available online: https://github.com/ultralytics/ultralytics/ (accessed on 19 May 2025).
Zhang, Y.; Gao, G.; Chen, Y.; Yang, Z. Odd-Yolov8: An Algorithm for Small Object Detection in UAV Imagery. J. Supercomput. 2025, 81, 202. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems (NeurIPS); IEEE: New York, NY, USA, 2015; p. 28. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolo9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 7263–7271. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Scaled-Yolov4: Scaling Cross Stage Partial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. IEEE Trans. Image Process. 2022, 31, 7325–7338. [Google Scholar]
Yang, F.; Wu, Y.; Zhang, S.; Li, G.; Zhang, W. Afpn: Asymptotic Feature Pyramid Network for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 1–4 October 2023. [Google Scholar] [CrossRef]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You Only Look One-Level Feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13039–13048. [Google Scholar] [CrossRef]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y. Pp-Yoloe: An Evolved Version of Yolo. arXiv 2023, arXiv:2203.16250. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollr, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Haris, M.; Shakhnarovich, G.; Ukita, N. Deep Back-Projection Networks for Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2019; pp. 1664–1673. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V.; Funkhouser, T. Dilated Residual Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 472–480. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, Y.; Jia, K.; Zhang, L. H2FA R-Cnn: Holistic and Hierarchical Feature Alignment for Cross-Domain Weakly Supervised Object Detection. IEEE Trans. Multimed. 2022, 24, 374–385. [Google Scholar]
Nie, H.; Pang, H.; Ma, M.; Zheng, R. A Lightweight Remote Sensing Small Target Image Detection Algorithm Based on Improved Yolov8. Sensors 2024, 24, 2952. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Bai, X.; Fan, J. Crop Pest Target Detection Algorithm in Complex Scenes: Yolov8-Extend. Smart Agric. 2024, 6, 49–61. [Google Scholar] [CrossRef]
Zhou, C.; Song, Q.; Zhang, Y. Small Target Detection Algorithm Based on Improved Yolov8 for Staring Radar. J. Signal Process. 2025, 41, 853–866. [Google Scholar] [CrossRef]
Xu, W.; Cui, C.; Ji, Y.; Li, X.; Li, S. Yolov8-Mpeb Small Target Detection Algorithm Based on Uav Images. Heliyon 2024, 10, e29501. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Zhao, J.; Zhao, D. Precision and Speed: Lsod-Yolo for Lightweight Small Object Detection. Expert Syst. Appl. 2025, 238, 122–135. [Google Scholar] [CrossRef]
Cao, L.; Ma, Z.; Hu, Q.; Xia, Z.; Zhao, M. DCE-Net: An Improved Method for Sonar Small-Target Detection Based on YOLOv8. J. Mar. Sci. Eng. 2025, 13, 1478. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar] [CrossRef]
Gevorgyan, Z. Siou Loss: More Powerful Learning for Bounding Box Regression. Expert Syst. Appl. 2024, 250, 124539. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. Cspnet: A New Backbone That Can Enhance Learning Capability of Cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Guo, L.; Liu, Z.; Bu, S.; Ren, J. Object Detection in Remote Sensing Imagery Using a Discriminatively Trained Mixture Model. ISPRS J. Photogramm. Remote Sens. 2014, 85, 32–43. [Google Scholar] [CrossRef]

Figure 1. The YOLOv8 network structure diagram.

Figure 2. The improved YOLOv8 network structure diagram.

Figure 3. The GSConv structure diagram.

Figure 4. The angle cost intuition diagram.

Figure 5. The SPPCSPC structure diagram.

Figure 6. The F1 Score curves.

Figure 7. The precision–recall curves.

Figure 8. The training and validation loss and evaluation metric curves.

Figure 9. The comparison results between YOLOv8 and the improved version.

Figure 10. The confusion matrices for YOLOv8 and the improved version.

Figure 11. The detection results of the improved YOLOv8.

Table 1. Key parameters.

Parameters	Setup
Epochs	300
Batch size	32
Image size	640 × 640
Optimizer	AdamW
Automatic mixed precision	True
Learning rate	0.01
Momentum	0.937
Weight decay	0.001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, J.; Luo, S. An Improved Model Based on YOLOv8 for Small Object Detection and Recognition. Information 2026, 17, 173. https://doi.org/10.3390/info17020173

AMA Style

He J, Luo S. An Improved Model Based on YOLOv8 for Small Object Detection and Recognition. Information. 2026; 17(2):173. https://doi.org/10.3390/info17020173

Chicago/Turabian Style

He, Jia, and Suyun Luo. 2026. "An Improved Model Based on YOLOv8 for Small Object Detection and Recognition" Information 17, no. 2: 173. https://doi.org/10.3390/info17020173

APA Style

He, J., & Luo, S. (2026). An Improved Model Based on YOLOv8 for Small Object Detection and Recognition. Information, 17(2), 173. https://doi.org/10.3390/info17020173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Model Based on YOLOv8 for Small Object Detection and Recognition

Abstract

1. Introduction

2. Related Work

2.1. Foundational Object Detection Architectures Using Pre-YOLO and Early YOLO

2.2. Evolution of YOLO for Enhanced Performance from YOLOv2 to YOLOv7

2.3. YOLOv8 and Contemporary SOD-Specific YOLO Variants

2.4. Techniques Specifically Addressing Small Object Challenges

2.5. Application-Oriented SOD in Remote Sensing and Other Domains

2.6. Summary and Research Gap

3. The Improved YOLOv8

3.1. Motivation

3.2. YOLOv8 Network Structure

3.3. The Improved YOLOv8 Network Structure

3.3.1. Small Object Detection Layer

3.3.2. GSConv Module

3.3.3. SIoU Loss Function

3.3.4. SPPCSPC Module

4. Experimental Results and Analysis

4.1. Dataset and Experimental Environment

4.2. Evaluation Metrics

4.3. Training Results

4.4. F1 Score Curves

4.5. Precision–Recall Curves

4.6. Comparative Experiments

4.7. Visualization of Detection Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI