SRM-YOLO for Small Object Detection in Remote Sensing Images

Yao, Bin; Zhang, Chengkun; Meng, Qingxiang; Sun, Xiandong; Hu, Xuyang; Wang, Lu; Li, Xilai

doi:10.3390/rs17122099

Open AccessArticle

SRM-YOLO for Small Object Detection in Remote Sensing Images

by

Bin Yao

^1,2,

Chengkun Zhang

^1,2,

Qingxiang Meng

^1,2,

Xiandong Sun

^1,2,

Xuyang Hu

^1,2,

Lu Wang

^1,2,* and

Xilai Li

³

¹

School of Computer Technology and Application, Qinghai University, Ningzhang Road, Xining 810016, China

²

Qinghai Provincial Laboratory for Intelligent Computing and Application, Qinghai University, Ningzhang Road, Xining 810016, China

³

College of Agriculture and Animal Husbandry, Qinghai University, Ningzhang Road, Xining 810016, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(12), 2099; https://doi.org/10.3390/rs17122099

Submission received: 24 April 2025 / Revised: 11 June 2025 / Accepted: 18 June 2025 / Published: 19 June 2025

Download

Browse Figures

Versions Notes

Abstract

Small object detection presents significant challenges in computer vision, often affected by factors such as low resolution, dense object distribution, and complex backgrounds, which can lead to false positives or missed detections. In this paper, we introduce SRM-YOLO, a novel small object detection algorithm based on the YOLOv8 framework. The model incorporates the following key innovations: Reuse Fusion Structure (RFS), which enhances feature fusion; SPD-Conv, which enables effective downsampling while preserving critical information; and a specialized detection head designed for small objects. Additionally, the MPDIoU loss function is employed to improve detection accuracy. Experimental results on the VisDrone2019 dataset show that SRM-YOLO significantly enhances detection accuracy, achieving a 5.2% improvement in mAP50 over YOLOv8n. Additionally, its superior performance on the SSDD and NWPU VHR-10 datasets further validates its effectiveness in small object detection tasks.

Keywords:

small object detection; YOLOv8; feature fusion

1. Introduction

With the continuous advancement of Unmanned Aerial Vehicle (UAV) technologies and significant cost reductions, drones have been widely applied across diverse domains, such as intelligent transportation systems [1], military operations [2], search and rescue missions [3], and security surveillance [4]. Driven by rapid developments in remote sensing technologies, achieving accurate and efficient target recognition and localization has become particularly crucial for the effectiveness of these missions [5].

As UAVs are increasingly deployed in complex environments, deep learning-based object detection algorithms—particularly single-stage detectors like the You Only Look Once (YOLO) series [6]—have shown promise in enhancing recognition accuracy. The YOLO family’s real-time processing capabilities and computational efficiency make it well suited for UAV-based remote sensing applications, including aerial surveillance and environmental monitoring [7,8]. Over successive iterations, the YOLO architecture has been refined to better address the unique challenges of remote sensing imagery.

Although YOLO series algorithms have shown strong performance in general object detection tasks, accurately detecting small objects in UAV-based remote sensing images remains a significant challenge. Firstly, targets in UAV images are often small-scale (typically less than 32 × 32 pixels), with limited discriminative features due to high-altitude acquisition, which restricts effective feature extraction [9]. Secondly, complex and cluttered backgrounds combined with environmental factors—such as weather variations and illumination changes—lead to foreground–background confusion and reduce detection accuracy [10,11]. Thirdly, dense object distributions and frequent occlusions further complicate detection and localization [12,13]. Moreover, varying camera angles and dynamic flight paths cause multi-angle image acquisition, resulting in heterogeneous feature representations of the same object and the introduction of uncertainty, which affects model robustness [14]. Finally, small objects are particularly sensitive to localization errors, as even minor positional deviations in bounding box regression can drastically reduce Intersection over Union (IoU), presenting challenges especially in dense scenes [15]. These factors collectively degrade detection performance in UAV-based remote sensing imagery.

In summary, the core challenges of small object detection in such settings can be grouped into three main aspects, as follows: (1) feature representation bottlenecks caused by insufficient discriminative information in low-resolution targets; (2) foreground–background confusion due to environmental interference and target–background similarity; and (3) heightened sensitivity of the regression branch to positional deviations, particularly for objects smaller than 32 × 32 pixels in size [16].

To address the challenges outlined above, we propose SRM-YOLO, a YOLOv8-based small object detection algorithm for UAV applications. The model incorporates several key innovations designed to tackle issues such as occlusion, high object density, illumination variations, and low-resolution targets.

In response to these challenges, this article presents the following key contributions:

We design the SPD-Conv module to replace traditional stride convolutions and pooling operations. This design effectively reduces feature information loss and improves the extraction of fine details, which are crucial for small object detection.
We propose a feature refusion strategy to address the semantic inconsistencies and information loss caused by repeated upsampling and downsampling operations in the neck. Specifically, we leverage the inherent features from the corresponding backbone layers and introduce residual connections to preserve fine-grained information and enhance the representation of small targets in complex scenes.
We introduce a novel MPDIoU loss function that directly minimizes the distance between predicted and ground truth bounding box corners, improving localization accuracy and reducing sensitivity to small positional deviations, especially in dense target environments.

2. Related Work

2.1. Object Detection Methods

Object detection algorithms are generally categorized into two main types—traditional and deep learning-based methods [17]. Traditional approaches can be further classified into template matching [18] and manually engineered feature-based methods [19]. Template matching algorithms detect objects by comparing the feature vectors of the input image with template feature vectors using similarity metrics, such as structural similarity [20] and Hausdorff distance [21]. In contrast, manually engineered feature-based methods use expert-defined rules to select candidate regions and construct feature representations of objects, such as visual saliency detection [22], scale-invariant feature transform (SIFT) [23], and histogram of oriented gradients (HOG) [24]. Classification is then applied for object detection. However, traditional methods rely heavily on handcrafted features and complex pipelines, which often struggle to provide adequate feature representation in complex scenes, leading to high computational costs and reduced detection accuracy and robustness. Consequently, these methods exhibit limited performance in challenging environments [25].

Currently, deep learning-based object detection algorithms have emerged as the dominant paradigm in computer vision, primarily categorized into two methodological frameworks. Two-stage detection algorithms, exemplified by R-CNN [26], SPPNet [27], Faster R-CNN [28], and FPN [29], operate through sequential region proposal generation and refined classification–localization stages to achieve high accuracy. In contrast, single-stage algorithms—such as SSD [30], RetinaNet [31], EfficientDet [32], and the YOLO series [33,34]—reformulate detection as an end-to-end regression task, enabling simultaneous prediction of object categories and spatial coordinates. Notably, the YOLO architecture has undergone continuous architectural evolution since its inception, with iterative improvements in feature extraction networks and detection heads culminating in the recently proposed YOLOv12 [35], which integrates transformer-based attention mechanisms and neural architecture search (NAS)-optimized backbones to advance real-time performance benchmarks while maintaining competitive precision in complex scenarios.

2.2. Deep Learning Algorithms for Small Object Detection

With the rapid advancement of deep learning and the availability of large-scale annotated datasets, object detection models have seen substantial improvements, particularly in handling small objects. Among them, single-stage detectors—such as the YOLO series—have gained widespread adoption across various applications due to their efficiency and accuracy. In particular, YOLO-based architectures have been extensively studied for remote sensing and UAV image analysis, leading to specialized variants optimized for aerial imagery detection. In response to the challenges of small object detection in aerial imagery, several YOLO-based variants have been developed, such as LGFF-YOLO [36], FFCA-YOLO [37], and SF-YOLO [38].

LGFF-YOLO integrates the Global Information Fusion Module (GIFM) and the Four-Leaf Clover Fusion Module (FLCM) to enhance the fusion of multi-scale features, improving detection accuracy without increasing model complexity. FFCA-YOLO introduces the Feature Enhancement Module (FEM) to enrich feature representations and expand receptive fields, improving the contextual understanding of small objects. SF-YOLO introduces a Spatial Information Perception (SIP) module that enhances contextual feature extraction by integrating space-to-depth operation with a large receptive field selective kernel. This design dynamically adjusts the backbone’s receptive field, improving object–background differentiation.

While YOLO-based models continue to evolve, researchers have also explored alternative architectures to further enhance small object detection, particularly in UAV imagery. Kong et al. introduced the Drone-DETR [39], which enhances model performance through the Shallow Feature Enhancement Module (SFEM) and the Fast-Residual Block. Additionally, the method incorporates the Enhanced Dual-Path Feature Fusion Attention Module (EDP-FAM) to extract multi-scale feature information and suppress unnecessary background interference. Shan et al., proposed the UAVPNet [40], which enhances the detection performance by introducing a Balanced Feature Pyramid (BFP) feature fusion structure to address imbalanced multiscale features and a VarifocalNet detection head to alleviate foreground–background imbalance. Shi et al. proposed an anchor-free aircraft detection network [41] that integrates a Progressive Class-Aware Dual-Branch (PCA-DB) module and an Instance-Guided Enhancement Module (IGEM). The PCA-DB guides high-quality point set generation through coarse and refined cross-shaped branches, while IGEM employs interactive attention to embed instance-level cues into the main features. This design effectively mitigates the issues of low spatial occupancy and structural ambiguity in aircraft detection.

Beyond architectural innovations, mutual-assistance (MA) learning has emerged as a complementary paradigm to enhance small object detection performance through inter-module or inter-branch collaboration. Xie et al. proposed MADet [42], a robust one-stage detector that embodies the MA concept. It reintegrates decoupled classification and regression features to produce shared offsets, mitigating inconsistencies in feature–prediction alignment. MADet also adopts a hybrid anchor-based and anchor-free regression strategy to handle objects with diverse shapes, occlusions, and aspect ratios. Furthermore, a quality assessment mechanism facilitates adaptive sample selection and loss reweighting. Jiao et al., introduced the Dual Instance-Consistent Network (DICN) [43] for cross-domain object detection, which applies a dual-network framework to extract source-specific and target-specific representations. By enforcing mutual consistency in both feature and detection outputs between a primary and an auxiliary network, DICN effectively transfers domain-invariant knowledge. This approach exemplifies mutual-assistance learning at the inter-model level and demonstrates improved robustness across challenging domain shifts.

Despite these advancements, small object detection in UAV imagery remains a formidable challenge due to background complexity, object occlusion, and extreme scale variation. To address these issues, we propose SRM-YOLO, a novel model based on YOLOv8, incorporating multiple enhancements specifically designed for small object detection.

3. Proposed Model

3.1. Overview of SRM-YOLO Model

Figure 1 illustrates the architecture of our proposed SRM-YOLO-n network model, which builds upon and enhances the YOLOv8n framework. In the backbone, we integrate SPD-Conv [44] to transform the spatial dimensions of feature maps into depth dimensions, facilitating downsampling while preserving critical structural information. Within the neck, we introduce an improved feature fusion strategy that supplements the features lost during the fusion process by incorporating additional information from the backbone. Furthermore, we append an additional detection head specifically designed for tiny objects, enhancing the model’s capability to detect tiny targets. Lastly, we replace the conventional CIoU loss function with MPDIoU [45], further improving the model’s overall performance and robustness.

3.2. Improvements in Backbone Structure

In the process of object detection, the accuracy of identifying small objects is often significantly lower than that for larger ones. This disparity primarily arises because small objects occupy a limited number of pixels within an image, resulting in insufficient feature representation for effective model training. Additionally, small objects are frequently occluded or surrounded by larger objects in the background, which tend to dominate the feature extraction process and impede the precise localization and classification of small targets. Furthermore, convolutional neural networks (CNNs), through successive convolution and pooling operations, inherently lose fine-grained spatial details, thereby constraining the model’s ability to effectively learn and represent small objects. This loss of critical information is a key factor contributing to the low performance of small object detection.

To mitigate the degradation of fine-grained features caused by standard convolution and pooling operations, we incorporate the SPD-Conv module into the backbone network, replacing conventional stride-based convolutions and pooling layers in the original architecture. The SPD-Conv module comprises two core components—a spatial-depth transformation layer and a volume-preserving layer. This module performs downsampling while retaining essential channel-wise information and subsequently applies non-strided convolutions to reduce the number of channels in the following layers. The architecture of the SPD-Conv module is illustrated in Figure 2.

Assume the feature map

X

has dimensions

S \times S \times C_{1}

. The SPD-Conv transformation downsamples

X

based on a scaling factor.

X

is sliced into sub-feature maps

f (x, y)

, where

x, y

∈ {0, 1, …, scale − 1}. The sub-feature maps are extracted as follows:

f (x, y) = X [x : S : s c a l e, y : S : s c a l e]

(1)

Each sub-feature map

f (x, y)

is obtained by downsampling

X

through the extraction of pixels at intervals specified by the scaling factor, with the dimensions of each

f (x, y)

being (

\frac{S}{s c a l e}

,

\frac{S}{s c a l e}

,

C_{1}

). Subsequently, these sub-feature maps are concatenated along the channel dimension to generate a new feature map

X^{'}

, as follows:

X^{'} = c o n c a t e n a t e (\{f_{x, y} ∣ x, y \in \{0,1, \dots, s c a l e - 1\}\}, a x i s = c h a n n e l)

(2)

The main purpose of this transformation is to reduce the spatial dimensions of the feature map while expanding its channel dimension. As a result, the new feature map

X^{'}

has dimensions (

\frac{S}{s c a l e}

,

\frac{S}{s c a l e}

,

{s c a l e}^{2} \times C_{1}

). A convolution operation with a stride of 1 is then applied to

X^{'}

using

C_{2}

filters, which transforms

X^{'}

into

X^{″}

, as shown in the following equation:

X^{″} = C o n v o l u t i o n (X^{'}, f i l t e r s = C_{2}, s t r i d e = 1)

(3)

This convolution operation is designed to preserve as much discriminative feature information as possible, effectively preventing any loss of information. The dimensions of the output feature map

X^{″}

are (

\frac{S}{s c a l e}

,

\frac{S}{s c a l e}

,

C_{2}

). By scaling the image proportion before inputting it into the detection network, the space-to-depth layer ensures the retention of channel dimension information throughout the feature mapping process, thereby effectively preventing information loss.

While the SPD-Conv module exhibits superficial similarity to the Focus module employed in YOLOv5, significant differences exist in both their structural design and functional mechanisms. As depicted in Figure 3, the Focus module achieves downsampling by partitioning the input feature map into four fixed spatial segments based on a 2 × 2 grid and subsequently concatenating these segments along the channel dimension. This operation effectively reduces spatial resolution while increasing the number of channels. A strided convolution is then applied to extract features and further compress spatial dimensions. Despite its efficiency, this approach inevitably results in the loss of fine-grained spatial details, which is particularly detrimental for detection tasks involving small or densely clustered objects.

In contrast, the SPD-Conv module explicitly decouples spatial downsampling from feature extraction by employing a space-to-depth transformation, followed by a standard, non-strided convolution. Specifically, SPD-Conv utilizes a configurable scaling factor to reorganize spatial information into the channel dimension without incurring information loss. By avoiding strided operations during the downsampling phase, the SPD-Conv module better preserves discriminative spatial features critical for the accurate detection of small targets. Furthermore, the non-strided convolution facilitates feature learning without further reducing spatial resolution, thereby maintaining essential fine-grained details.

3.3. Improvements in Neck Structure

Figure 4 illustrates the feature fusion mechanism in the conventional Feature Pyramid Network (FPN) architecture, where features are hierarchically organized across three levels—level-1, level-2, and level-3. While the FPN framework is designed to enable multi-scale feature integration, its structural limitations restrict direct information exchange to adjacent levels, necessitating an indirect propagation pathway for non-adjacent levels.

When level-1 requires features from level-2, it can directly retrieve and incorporate them. However, obtaining information from level-3 involves an additional step; features from level-3 must first undergo fusion with level-2 before being transmitted upward. Consequently, level-1 only gains access to level-3 information following an intermediate refinement process.

This indirect transmission mechanism introduces potential information loss, as the intermediate layers selectively transmit only a subset of the available features, while unselected features are discarded. As a result, the information exchange primarily benefits neighboring layers, while long-range dependencies are weakened. This limitation constrains the comprehensiveness of global feature fusion, potentially impairing the model’s ability to capture fine-grained details and complex structural patterns in visual scenes.

To mitigate the information loss associated with the traditional FPN structure, we propose an enhanced feature fusion strategy utilizing Reuse Fusion Structure (RFS). Inspired by Gold-YOLO [46], RFS reutilizes feature representations extracted by SPD-Conv to supplement the sampled results, ensuring that fine-grained target features are preserved throughout the fusion process instead of being lost due to consecutive upsampling and downsampling operations.

Meanwhile, considering the limitations of the three detection heads in YOLOv8’s strategy for detecting small targets, we refine the detection framework by introducing an additional detection head. As illustrated in Figure 5, the feature maps obtained from the Neck network are output to four distinct detection heads, rather than three. This modification enhances the model’s capability to detect small, medium, and large targets more effectively, thereby improving its overall accuracy in handling objects of varying sizes.

3.4. MPDIoU

In YOLOv8, the CIoU loss function is adopted for bounding box regression optimization, which is mathematically defined as follows:

C I o U = I o U - \frac{ρ^{2} (b, b^{g t})}{c^{2}} - α υ

(4)

υ = \frac{4}{π^{2}} {(a r c t a n \frac{ω^{g t}}{h^{g t}} - a r c t a n \frac{ω}{h})}^{2}

(5)

α = \frac{υ}{(1 - I o U) + υ}

(6)

where

ω

and

h

denote the width and height of the predicted bounding box, while

ω^{g t}

and

h^{g t}

correspond to those of the ground truth;

b

and

b^{g t}

represent the centroids of the predicted and ground truth boxes, respectively;

ρ

is the Euclidean distance between

b

and

b^{g t}

; and

c

denotes the diagonal length of the smallest enclosing box that covers both the predicted and ground truth boxes.

However, CIoU uses top left and bottom right coordinates to define rectangles, leading to a key limitation. When the predicted and ground-truth bounding boxes share the same aspect ratio but differ in scale, the aspect ratio penalty term becomes zero. This restricts the model’s ability to optimize effectively. In UAV remote sensing imagery, where dense small objects are prevalent, this issue is exacerbated, frequently resulting in missed detections.

To overcome this, we propose MPDIoU as an alternative loss function. MPDIoU simplifies the similarity computation between bounding boxes and demonstrates effectiveness in both overlapping and non-overlapping cases. As illustrated in Figure 6, it improves prediction accuracy and accelerates convergence by using the minimum point distance between predicted and ground-truth boxes as a similarity metric.

As shown in Figure 6, the calculation formula of MPDIoU is as follows:

d_{1}^{2} = (x_{1}^{p r d} - x_{1}^{g t})^{2} + (y_{1}^{p r d} - y_{1}^{g t})^{2}

(7)

d_{2}^{2} = (x_{2}^{p r d} - x_{2}^{g t})^{2} + (y_{2}^{p r d} - y_{2}^{g t})^{2}

(8)

M P D I o U = \frac{A \cap B}{A \cup B} - \frac{d_{1}^{2}}{w^{2} + b^{2}} - \frac{d_{2}^{2}}{w^{2} + b^{2}}

(9)

where

x_{1}^{g t}

,

y_{1}^{g t}

,

x_{2}^{g t}

,

y_{2}^{g t}

defines the coordinates of the upper left and lower right corners of the ground-truth bounding box; and

x_{1}^{p r d}

,

y_{1}^{p r d}

,

x_{2}^{p r d}

,

y_{2}^{p r d}

specifies the coordinates of the upper left and lower right corners of the predicted bounding box.

L_{M P D I o U} = 1 - M P D I o U

(10)

4. Experiment

4.1. Datasets

The VisDrone2019 dataset [47] is a widely recognized, large-scale benchmark designed for UAV vision research and algorithm evaluation. It comprises over 6000 video clips and 25,000 images covering diverse environments, including urban, rural, highway, and construction sites, under varying weather conditions and target sizes. This diversity enables a comprehensive assessment of model robustness across different scenarios. The dataset includes key object categories, such as pedestrians, vehicles, bicycles, and motorcycles, with detailed annotations encompassing bounding boxes, object categories, motion states, and occlusion information—essential for training, evaluation, and validation.

SRM-YOLO is evaluated on the VisDrone2019 dataset, which offers a broad and representative set of samples. Figure 7 illustrates the dataset’s image distribution, label counts, and object size variations, highlighting several inherent challenges. First, object detection is complicated by factors such as high density, motion blur, small object size, and class ambiguity. For instance, distinguishing between pedestrians and people can be difficult in certain scenarios. Moreover, UAV imaging introduces significant variations in object scale and increased occlusion, further complicating detection. Second, class imbalance is evident in the dataset, with the Car category having 144,867 annotations, whereas Awning-tricycle has only 3246. This imbalance necessitates robust detection models capable of handling underrepresented classes. Third, the anchor size distribution reveals that while medium- and large-sized anchors dominate categories like Van, Truck, Car, and Bus, most objects are smaller than 100 × 100 pixels, with a significant proportion under 50 × 50 pixels, posing additional challenges for small object detection.

For this study, the dataset is divided into training, validation, and testing sets, with 6471, 548, and 1610 images, respectively, following the partitioning used in the VisDrone2019 Challenge. All images are resized to 640 × 640 pixels for both training and testing. Utilizing the YOLOv8 framework, two sets of scaling parameters—[0.33, 0.25, 1024] and [0.33, 0.50, 1024]—are applied to adjust network depth, width, and maximum channel capacity, ensuring optimal model performance.

4.2. Implementation Details

The experiments were conducted on a system running Red Hat Enterprise Linux 7.6, equipped with an Intel Xeon Gold 6348 CPU, an NVIDIA A100 80 GB PCIe GPU, and 251 GB of RAM (NVIDIA Corp., Santa Clara, CA, USA). The implementation was developed using Python 3.8.19, and model training was performed with PyTorch 1.9.1 under the CUDA 11.7 environment. The hardware configuration for training and testing is summarized in Table 1.

The experimental settings were as follows: The batch size was set to 32, and the input image resolution was 640 × 640 pixels. The dataset was split into training and validation sets, with a ratio of 8:2. The model was trained for 200 epochs. The detailed training parameters are summarized in Table 2.

4.3. Evaluation Metrics

In this experiment, the performance of the proposed network was evaluated using three standard object detection metrics—precision (P), recall (R), and mean average precision (mAP).

Precision (P) stands for the percentage of targets that are correctly predicted in all detected targets. The accuracy rate can be calculated using the following formula, where TP represents the correct prediction of the model, and

F P

represents the wrong prediction of the model:

P = \frac{T P}{T P + F P}

(11)

Recall (R) stands for the proportion of targets that are correctly predicted in all targets. The recall rate can be calculated by the following formula, where

F N

represents the targets that need to be predicted but are incorrectly detected by the model:

R = \frac{T P}{T P + F N}

(12)

Average accuracy (AP) stands for the area enclosed by the axis of the curve formed by the precision rate and recall rate, while the average accuracy (mAP) is the mean of the average accuracy of all samples, both of which can be calculated using the following formulas:

A P = \int_{0}^{1} P (r) d r

(13)

m A P = \frac{1}{k} \sum_{i = 1}^{k} A P_{i}

(14)

In addition, to describe the computational cost caused by the proposed modules, the number of parameters (Param) and GFLOPs are used as indicators to describe the results of the algorithm.

4.4. Ablation Experiment

To evaluate the effectiveness of each proposed component, we conducted a series of ablation studies using the YOLOv8n and YOLOv8s models as baselines. Our proposed model, SRM-YOLO, integrates the following four key components to enhance the detection performance of small objects: the SPD-Conv block, the Reuse Fusion Structure (RFS) for feature reutilization, a dedicated tiny-object detection head, and the MPDIoU loss function for improved localization accuracy. Specifically, the RFS module was designed to efficiently reuse multi-scale features extracted from the backbone, alleviating feature degradation caused by repeated upsampling and downsampling. This reinforced the feature representation of small-scale targets.

The impact of each individual module was evaluated using standard detection metrics, including precision (P), recall (R), mAP50, and the number of parameters (Param) and GFLOPs. The results are presented in Table 3 and Table 4.

Table 3 and Table 4 present the detailed results of our ablation studies, where √ indicates module activation and -- denotes deactivation. This experimental design enables a clear assessment of each component’s contribution to the overall performance.

In YOLOv8s-based experiments, the baseline achieved a precision of 48.8%, a recall of 38.7%, and an mAP50 of 38.9%. Introducing the SPD-Conv module alone increased precision to 52.7% and mAP50 to 40.8%, while the Tiny Head alone raised mAP50 to 42.6%. When all modules were integrated, SRM-YOLO (based on YOLOv8s) achieved a precision of 53.6%, a recall of 44.6%, and an mAP50 of 45.6%.

These findings not only validate the individual effectiveness of SPD-Conv and the Tiny Head but also emphasize the synergistic effect achieved when combined with the RFS module. Notably, the RFS module benefits from the fine-grained representations provided by SPD-Conv during the feature fusion process, resulting in marked improvements in detection performance.

To further evaluate the effectiveness of the proposed MPDIoU loss, we conducted comparative experiments against several widely adopted IoU-based loss functions, including GIoU, DIoU, CIoU, and the Normalized Wasserstein Distance (NWD), using YOLOv8n as the baseline. As shown in Table 5, while GIoU, DIoU, and NWD offered only marginal improvements in mAP50, MPDIoU achieved a notable 1.0% increase, demonstrating its superior capacity to improve localization accuracy—particularly for small objects—and overall detection performance.

We further conducted a comparative analysis of SRM-YOLO and the baseline YOLOv8 models using diverse images from the VisDrone2019 dataset. These samples cover various lighting conditions, object densities, camera perspectives, and scene complexities. Due to the high density of detected objects, category labels and confidence scores are omitted in the visualizations for clarity. Instead, different colors are used to distinguish object categories.

Figure 8 presents an urban traffic scene captured by a UAV during daylight from a moderate altitude and an oblique perspective, illustrating the detection performance of the four models. While all models successfully identified most vehicles and pedestrians, notable differences were observed. YOLOv8n primarily detected larger objects but struggled with smaller and more distant ones, such as vehicles and pedestrians further from the camera. Additionally, it exhibited inconsistencies in recognizing occluded objects and those blending into the background.

Figure 9 depicts an oblique aerial image of a major roadway at nighttime, captured by a drone flying at a moderate altitude. The upper portion of the image is characterized by strong light interference around the bridge, with some vehicles partially obscured by the bridge structure. Due to drone movement during image acquisition, the image exhibits slight blurriness, presenting challenges for accurate object detection. Compared to YOLOv8, the proposed SRM-YOLO model exhibits improved capability in detecting occluded and relatively blurred vehicles, indicating better adaptability to motion blur and challenging lighting conditions.

Figure 10 illustrates a surveillance scenario in which a UAV captures a top-down nighttime view of a residential community entrance. Due to insufficient illumination, large portions of the image exhibit extremely low brightness, significantly hindering object localization and classification. This poses substantial challenges for detection algorithms operating in low-light conditions. The experimental results indicate that YOLOv8n and YOLOv8s not only struggle to detect objects in severely underexposed areas but also suffer from false detections and missed detections. In contrast, the proposed SRM-YOLO model successfully detects some objects under these challenging conditions, demonstrating improved robustness and practical applicability for object detection in dark environments.

Overall, SRM-YOLO exhibits greater sensitivity and robustness, effectively detecting a higher number of objects, particularly small and distant vehicles and pedestrians. This improvement is evident in the increased number of bounding boxes in the images, underscoring the model’s enhanced performance in dense traffic environments. These results suggest that SRM-YOLO is better suited for accurately detecting small objects in complex scenes.

4.5. Comparisons with Other Object Detection

To comprehensively evaluate the performance of SRM-YOLO, we compared it against a broad spectrum of state-of-the-art object detection models. The benchmark set includes anchor-free detectors such as TOOD [48], VFNet [49], various YOLO-based architectures, and several enhanced variants, including SOD-YOLO [50], LUD-YOLO [51], and a customized version named Drone-YOLO. Additionally, transformer-based detectors—such as RT-DETR-R18 [52] and D-FINE-S [53]—were incorporated to ensure a thorough and balanced assessment.

Table 6 presents a comprehensive summary of the experimental results. Except for TOOD and VFNet, which were trained on the official VisDrone2021 dataset, all other models were implemented within the PyTorch framework and optimized using the Stochastic Gradient Descent (SGD) algorithm. The initial learning rate was set to 0.01 and gradually decayed to 12% of its original value. In this study, we primarily compared the mean Average Precision (mAP) on the validation set across different models and parameter configurations, focusing on their performance over 10 distinct categories in the dataset.

Table 6 presents the experimental results of the proposed SRM-YOLO-N and eight other benchmark algorithms. Compared with the one-stage object detection models TOOD and VFNet, SRM-YOLO demonstrates a notable improvement in detection accuracy. Despite having a significantly smaller number of parameters, SRM-YOLO achieves an overall mAP of 45.6%, which is substantially higher than TOOD and VFNet. In the Awning Tricycle category, which has the smallest number of instances, SRM-YOLO maintains a higher detection accuracy than these two models.

Against other YOLO-based detectors—including YOLOv8-s, YOLOv10-s, and YOLOX-s—SRM-YOLO-s consistently delivers improved accuracy across most categories. The performance gains are particularly evident in challenging classes such as Pedestrian, Bus, and Motor. Although SRM-YOLO-s entails a slightly larger parameter count, the accuracy improvement represents a favorable trade-off, confirming the model’s effectiveness in balancing detection performance and computational complexity.

Furthermore, SRM-YOLO-s surpasses several state-of-the-art detectors—D-FINE-s (42.3%), LUD-YOLO-s (41.7%), and RT-DETR-R18 (42.5%)—in overall detection accuracy. While Drone-YOLO-m achieves a marginally higher mAP of 46.9%, SRM-YOLO-s offers comparable accuracy with a considerably smaller model size. In key categories such as Truck and Motor, SRM-YOLO-s delivers competitive results, demonstrating strong adaptability in complex UAV scenarios.

In conclusion, SRM-YOLO enhances detection accuracy, especially for small, complex objects in UAV aerial imagery. Despite a higher computational cost, its superior performance makes it suitable for real-world applications requiring precision and robustness, and it remains feasible for UAV embedded system deployment.

Overall, SRM-YOLO demonstrates high detection accuracy for both small and large objects in UAV imagery. Despite a moderate increase in computational cost, its accuracy and versatility make it well suited for practical UAV-based deployments where precision and efficiency are critical. As shown in Figure 11, SRM-YOLO-s achieves superior detection performance for small objects such as Pedestrian and Tricycle, while also maintaining reliable accuracy on larger classes like Car and Bus.

4.6. Generalization Test

To further evaluate the effectiveness of our proposed method, we conducted comparative analyses with various YOLO-series object detection algorithms on the SSDD [54] and NWPU VHR-10 [55] datasets. For consistency and fairness, the same hyperparameters and training protocols used for the VisDrone2019 dataset were applied.

The SSDD dataset, designed for ship detection in satellite imagery, consists of 1160 high-resolution images containing 2456 annotated ship instances. The NWPU VHR-10 dataset includes 800 very-high-resolution remote sensing images (ranging from 500 × 500 to 1100 × 1100 pixels) sourced from Google Earth and Vaihingen, covering 3651 instances across ten object categories. The average object size in the NWPU VHR-10 dataset is approximately 6.4% of the image area.

Based on the results in Table 7, the proposed SRM-YOLO algorithm outperforms Faster R-CNN, SSD, YOLOv5n, YOLOv8n, and YOLOv10 in terms of precision (P), recall (R), mAP50, and the number of parameters on both the SSDD and NWPU VHR-10 datasets. Notably, compared to YOLOv8n, the SRM-YOLO model achieves an improvement of 1.7% in mAP50 on the SSDD dataset and 1.4% on the NWPU VHR-10 dataset. These improvements suggest that the SRM-YOLO model demonstrates decent generalization capability.

5. Conclusions

This paper proposes an efficient object detection framework based on YOLOv8 to address the limitations of small object detection in UAVs. To ensure high detection accuracy, four improvement strategies are introduced. First, a novel image feature fusion method is proposed, which reuses the backbone-extracted features during fusion to effectively enhance small object representations. Second, SPD-Conv is integrated into the backbone to improve detection performance for small objects. Additionally, a small object detection head is added to specifically detect miniature targets. Finally, the MPDIoU is employed to further enhance the model’s performance. Experimental results show that, compared with the original YOLOv8, SRM-N and SRM-S achieve excellent detection performance on the VisDrone2019 dataset, with significant improvements in mAP, while the increase in model parameters remains minimal. Although SRM-YOLO achieves promising results in small object detection tasks and demonstrates potential for future applications, it still has the following limitations:

(1): Speed and memory utilization require further optimization before hardware deployment.
(2): Currently, the proposed method is only applicable to scenarios with high-quality labeled data. For space-based remote sensing, images typically have lower resolution, poorer quality, and more complex degradation patterns. Therefore, the effectiveness of the proposed method requires further investigation and validation.

In future work, we aim to explore high-precision object detection using lightweight models with limited or even no labeled data, focusing on model lightweighting and few-shot learning approaches. This would facilitate the deployment of the model on UAVs to perform relevant tasks.

Author Contributions

Conceptualization, B.Y. and L.W.; methodology, B.Y. and Q.M.; software, C.Z., X.S. and X.H.; validation, B.Y., C.Z., X.S. and X.L.; writing—original draft preparation, B.Y. and L.W.; writing—review and editing, B.Y., L.W. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financially supported by the National Natural Science Foundation of China (U21A20191, U23A20159), Qinghai Science and Technology Department (2023-QY-210), and Higher Education Discipline Innovation Project, the 111 Project of China (D18013), Research on Grassland Rat Hole Recognition Method Based on Deep Learning (2025-ZZ-07).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

We gratefully acknowledge the support provided by high performance computing center of Qinghai University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, W.; Dai, L.; Zhang, X.; Chang, P.; He, X. RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring. Appl. Intell. 2022, 52, 8448–8463. [Google Scholar] [CrossRef]
Chen, X.; Jiang, H.; Zheng, H.; Yang, J.; Liang, R.; Xiang, D.; Cheng, H.; Jiang, Z. Det-YOLO: An innovative high-performance model for detecting military aircraft in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17753–17771. [Google Scholar] [CrossRef]
Xu, J.; Fan, X.; Jian, H.; Xu, C.; Bei, W.; Ge, Q.; Zhao, T. YOLOOW: A spatial scale adaptive real-time object detection neural network for open water search and rescue from UAV aerial imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5623115. [Google Scholar] [CrossRef]
Cao, Z.; Kooistra, L.; Wang, W.; Guo, L.; Valente, J. Real-time object detection based on UAV remote sensing: A systematic literature review. Drones 2023, 7, 620. [Google Scholar] [CrossRef]
Nex, F.; Armenakis, C.; Cramer, M.; Cucci, D.A.; Gerke, M.; Honkavaara, E.; Kukko, A.; Persello, C.; Skaloud, J. UAV in the advent of the twenties: Where we stand and what is next. ISPRS J. Photogramm. Remote Sens. 2022, 184, 215–242. [Google Scholar] [CrossRef]
Vijayakumar, A.; Vairavasundaram, S. Yolo-based object detection models: A review and its applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar] [CrossRef]
Ajakwe, S.O.; Ihekoronye, V.U.; Kim, D.-S.; Lee, J.M. DRONET: Multi-tasking framework for real-time industrial facility aerial surveillance and safety. Drones 2022, 6, 46. [Google Scholar] [CrossRef]
Chen, J.; Chen, S.; Fu, R.; Li, D.; Jiang, H.; Wang, C.; Peng, Y.; Jia, K.; Hicks, B.J. Remote sensing big data for water environment monitoring: Current status, challenges, and future prospects. Earth’s Future 2022, 10, e2021EF002289. [Google Scholar] [CrossRef]
Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhang, W.; Zhang, Y.; Zhang, P.; Bao, G. Feature-Enhanced CenterNet for Small Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 5488. [Google Scholar] [CrossRef]
Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small object detection algorithm based on improved YOLOv8 for remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1734–1747. [Google Scholar] [CrossRef]
Hu, J.; Wei, Y.; Chen, W.; Zhi, X.; Zhang, W. CM-YOLO: Typical object detection method in remote sensing cloud and mist scene images. Remote Sens. 2025, 17, 125. [Google Scholar] [CrossRef]
Gao, J.; Zhang, J.; Zhang, F.; Gao, J. LACTA: A lightweight and accurate algorithm for cherry tomato detection in unstructured environments. Expert Syst. Appl. 2024, 238, 122073. [Google Scholar] [CrossRef]
Ruan, J.; Cui, H.; Huang, Y.; Li, T.; Wu, C.; Zhang, K. A review of occluded objects detection in real complex scenarios for autonomous driving. Green Energy Intell. Transp. 2023, 2, 100092. [Google Scholar] [CrossRef]
Zeng, Y.; Chen, Y.; Yang, X.; Li, Q.; Yan, J. ARS-DETR: Aspect ratio-sensitive detection transformer for aerial oriented object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610315. [Google Scholar] [CrossRef]
Wang, Z.; Lei, L.; Shi, P. Smoking behavior detection algorithm based on YOLOv8-MNC. Front. Comput. Neurosci. 2023, 17, 1243779. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep learning-based object detection techniques for remote sensing images: A survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
Feng, C.; Cao, Z.; Xiao, Y.; Fang, Z.; Zhou, J.T. Multi-Spectral Template Matching Based Object Detection in a Few-Shot Learning Manner. Inf. Sci. 2023, 624, 20–36. [Google Scholar] [CrossRef]
Gibert, D.; Planes, J.; Mateu, C.; Le, Q. Fusing Feature Engineering and Deep Learning: A Case Study for Malware Classification. Expert Syst. Appl. 2022, 207, 117957. [Google Scholar] [CrossRef]
Liu, X.; Wang, L. MSRMNet: Multi-Scale Skip Residual and Multi-Mixed Features Network for Salient Object Detection. Neural Netw. 2024, 173, 106144. [Google Scholar] [CrossRef]
Yu, Y.; Jiang, H.; Zhang, X.; Chen, Y. Identifying irregular potatoes using Hausdorff distance and intersection over union. Sensors 2022, 22, 5740. [Google Scholar] [CrossRef] [PubMed]
Chen, S.; Xiong, J.; Jiao, J.; Xie, Z.; Huo, Z.; Hu, W. Citrus fruits maturity detection in natural environments based on convolutional neural networks and visual saliency map. Precis. Agric. 2022, 23, 1515–1531. [Google Scholar] [CrossRef]
Park, S.K.; Chung, J.H.; Pae, D.S.; Kang, T.K.; Lim, M.T. Fusion-Attention Network Using Dense Scale-Invariant Feature Transform Flow Image and Point Cloud for 3D Pedestrian Detection. Multimed. Tools Appl. 2024, 84, 12813–12833. [Google Scholar] [CrossRef]
Bhattarai, B.; Subedi, R.; Gaire, R.R.; Vazquez, E.; Stoyanov, D. Histogram of oriented gradients meet deep learning: A novel multi-task deep network for 2D surgical image semantic segmentation. Med. Image Anal. 2023, 85, 102747. [Google Scholar] [CrossRef]
Shi, T.; Gong, J.; Hu, J.; Zhang, Y.; Li, W. Adaptive feature fusion with attention-guided small target detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5623116. [Google Scholar] [CrossRef]
Dong, C.; Zhang, K.; Xie, Z.; Shi, C. An improved cascade RCNN detection method for key components and defects of transmission lines. Gener. Transm. Distrib. 2023, 17, 4277–4292. [Google Scholar] [CrossRef]
Jeon, J.; Jeong, B.; Baek, S.; Jeong, Y.-S. Static Multi Feature-Based Malware Detection Using Multi SPP-net in Smart IoT Environments. IEEE Trans. Inf. Forensics Secur. 2024, 19, 2487–2500. [Google Scholar] [CrossRef]
Xu, J.; Ren, H.; Cai, S.; Zhang, X. An Improved Faster R-CNN Algorithm for Assisted Detection of Lung Nodules. Comput. Biol. Med. 2023, 153, 106470. [Google Scholar] [CrossRef]
Luo, Y.; Cao, X.; Zhang, J.; Guo, J.; Shen, H.; Wang, T.; Feng, Q. CE-FPN: Enhancing Channel Information for Object Detection. Multimed. Tools Appl. 2022, 81, 30685–30704. [Google Scholar] [CrossRef]
Yang, F.; Huang, L.; Tan, X.; Yuan, Y. FasterNet-SSD: A Small Object Detection Method Based on SSD Model. Signal Image Video Process. 2024, 18, 173–180. [Google Scholar] [CrossRef]
Mahum, R.; Al-Salman, A.S. Lung-RetinaNet: Lung Cancer Detection Using a RetinaNet with Multi-Scale Feature Fusion and Context Module. IEEE Access 2023, 11, 53850–53861. [Google Scholar] [CrossRef]
Nawaz, M.; Nazir, T.; Baili, J.; Khan, M.A.; Kim, Y.J.; Cha, J.-H. CXray-EffDet: Chest Disease Detection and Classification from X-ray Images Using the EfficientDet Model. Diagnostics 2023, 13, 248. [Google Scholar] [CrossRef]
Hussain, M. YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Ali, M.L.; Zhang, Z. The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524v1. [Google Scholar]
Peng, H.; Xie, H.; Liu, H.; Guan, X. LGFF-YOLO: Small Object Detection Method of UAV Images Based on Efficient Local–Global Feature Fusion. J. Real-Time Image Process. 2024, 21, 167. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Sun, M.; Wang, L.; Jiang, W.; Dharejo, F.A.; Mao, G.; Timofte, R. SF-YOLO: A Novel YOLO Framework for Small Object Detection in Aerial Scenes. IET Image Process. 2025, 19, e70027. [Google Scholar] [CrossRef]
Kong, Y.; Shang, X.; Jia, S. Drone-DETR: Efficient Small Object Detection for Remote Sensing Image Using Enhanced RT-DETR Model. Sensors 2024, 24, 5496. [Google Scholar] [CrossRef]
Shan, P.; Yang, R.; Xiao, H.; Zhang, L.; Liu, Y.; Fu, Q.; Zhao, Y. UAVPNet: A Balanced and Enhanced UAV Object Detection and Pose Recognition Network. Measurement 2023, 222, 113654. [Google Scholar] [CrossRef]
Shi, T.; Gong, J.; Hu, J.; Liu, Y. Progressive class-aware instance enhancement for aircraft detection in remote sensing imagery. Pattern Recognit. 2025, 164, 111503. [Google Scholar] [CrossRef]
Xie, X.; Lang, C.; Miao, S.; Cheng, G.; Li, K.; Han, J. Mutual-assistance learning for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15171–15184. [Google Scholar] [CrossRef] [PubMed]
Jiao, Y.; Yao, H.; Xu, C. Dual instance-consistent network for cross-domain object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7338–7735. [Google Scholar] [CrossRef] [PubMed]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. Mach. Learn. Knowl. Discov. Databases 2023, 13715, 443–459. [Google Scholar]
Liu, Q.; Zhou, Z.; Xiong, L.; Lu, M.; Ouyang, J. YOLO-RDM: Innovative Detection Methods for Eggplants and Stems in Complex Natural Environment. IEEE Access 2025, 13, 37656–37672. [Google Scholar] [CrossRef]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Muzammul, M.; Algarni, A.; Ghadi, Y.Y.; Assam, M. Enhancing UAV Aerial Image Analysis: Integrating Advanced SAHI Techniques With Real-Time Detection Models on the VisDrone Dataset. IEEE Access 2024, 12, 21621–21633. [Google Scholar] [CrossRef]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-Aligned One-Stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
Huang, M.; Li, G.; Liu, Z.; Wu, Y.; Gong, C.; Zhu, L.; Yang, Y. Exploring viewport features for semi-supervised saliency prediction in omnidirectional images. Image Vis. Comput. 2023, 129, 104590. [Google Scholar] [CrossRef]
Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. SOD-YOLO: Small-Object-Detection Algorithm Based on Improved YOLOv8 for UAV Images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
Fan, Q.; Li, Y.; Deveci, M.; Zhang, H.; Wang, L. LUD-YOLO: A novel lightweight object detection network for unmanned aerial vehicle. Inf. Sci. 2025, 686, 121366. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Peng, Y.; Li, H.; Wu, P.; Zhang, M.; Chen, X. D-FINE: Redefine regression task in DETRs as fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
Gong, Y.; Zhang, Z.; Wen, J.; Lan, G.; Xiao, S. Small Ship Detection of SAR Images Based on Optimized Feature Pyramid and Sample Augmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 7385–7392. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]

Figure 1. Structure of SRM-YOLO network.

Figure 2. Structure of SPD-Conv.

Figure 3. Structure of Focus module.

Figure 4. Traditional neck structure.

Figure 5. Network structure improvement.

Figure 6. MPDIoU calculation diagram.

Figure 7. Visualization of training dataset. (a) Category distribution. (b) Distribution of label frame length and width. (c) Spatial distribution of object centers. (d) Distribution of object width and height.

Figure 8. Visual comparison of detection performance in complex daytime urban scene.

Figure 9. Detection results of nighttime aerial view of major roadway.

Figure 10. Detection results in a low-light residential area at night.

Figure 11. Comparison of detection accuracy across categories among different models.

Table 1. Configuration of training and testing experiment environments.

Environment	Parameters
CPU	Intel Xeon Gold 6348
GPU	NVIDIA A100 (80 GB, PCIe)
Language	Python 3.8.19
Framework	Pytorch 1.9.1
CUDA Version	11.7

Table 2. Training parameters setting.

Epochs	Batch Size	Initial Learning Rate	Final Learning Rate	Optimizer
200	32	1.00 × 10⁻²	1.00 × 10⁻⁴	SGD

Table 3. Ablation study results of SRM-YOLO based on YOLOv8n using VisDrone2019 dataset.

Model	SPD-Conv	RFS	Tiny Head	MPDIoU	P (%)	R (%)	mAP50 (%)	Param (M)	GFLOPs
YOLOv8n	--	--	--	--	45.2	34.1	34.2	3.01	8.1
Ours(n)	√	--	--	--	46.6	34.1	35	3.27	11.6
	--	√	--	--	47.1	33.8	35	3.02	8.2
	--	--	√	--	49.4	35.5	36.6	2.98	12.6
	--	--	--	√	46.6	34.7	35.2	3.02	8.2
	√	√	--	--	45.9	34.6	34.6	3.28	11.7
	√	--	√	--	48.8	36.7	37.6	3.18	15.8
	--	√	√	--	45.9	36.3	36.2	2.94	12.3
	√	√	√	--	47.9	37.6	38.1	3.27	16.2
	√	√	--	√	46.4	35.4	35.4	3.28	11.7
	√	√	√	√	49.4	38.1	39.4	3.20	15.9

Table 4. Ablation study results of SRM-YOLO based on YOLOv8s using VisDrone2019 dataset.

Model	SPD-Conv	RFS	Tiny Head	MPDIoU	P (%)	R (%)	mAP50 (%)	Param (M)	GFLOPs
YOLOv8s	--	--	--	--	48.8	38.7	38.9	11.1	28.6
Ours(n)	√	--	--	--	52.7	39.2	40.8	12.1	42.4
	--	√	--	--	49.2	38.6	39.2	11.2	28.8
	--	--	√	--	51.8	41.6	42.6	10.6	36.9
	--	--	--	√	50.9	38.5	39.3	11.2	28.8
	√	√	--	--	53.5	39.2	40.6	12.2	42.6
	√	--	√	--	54.3	42.4	44.1	11.7	50.7
	--	√	√	--	53.6	41.3	43.0	10.7	37.1
	√	√	√	--	54.1	42.6	44.7	12.0	52.4
	√	√	--	√	52.5	38.9	41.2	12.2	42.6
	√	√	√	√	53.6	44.6	45.6	11.7	51.1

Table 5. Experiment on different loss functions.

Model	Loss Function	P (%)	R (%)	mAP50 (%)	mAP50–95 (%)	GFLOPs
YOLOv8n	CIoU	45.2	34.1	34.2	20.1	8.1
	DIoU	45.3	34.9	34.8	20.3	8.1
	GIoU	45.3	34.5	34.9	20.4	8.1
	NWD (ratio = 0.5)	45.1	34.5	34.4	19.9	8.1
	NWD (ratio = 0.4)	45.6	34.4	34.7	20.0	8.1
	NWD (ratio = 0.3)	46.8	34.0	34.8	20.1	8.1
	MPDIoU	46.6	34.7	35.2	20.5	8.2

Table 6. Performance comparison between proposed SRM-YOLO models (SRM-YOLO-n and SRM-YOLO-s) and several state-of-the-art object detection algorithms. The evaluation includes number of parameters and mAP50 (%) for each object category: People (Peo), Pedestrian (Ped), Tricycle (Tri), Van, Truck, Awning Tricycle (Awi), Car, Bus, Bicycle, and Motor.

Model	Param	mAP50 (%)
Model	Param	All	Peo	Ped	Tri	Van	Truck	Awi	Car	Bus	Bicycle	Motor
TOOD	31.8	41.0	31.9	41.5	31.8	46.5	39.6	14.1	81.4	53.5	19.2	50.5
VFNet	33.5	41.3	25.4	41.8	35.1	47.4	41.7	15.5	80.4	57.0	20.0	48.8
YOLOv8s	11.1	38.9	33.2	43.2	27.9	44.0	34.6	14.9	79.5	54.8	12.9	43.7
YOLOv10s	7.4	38.6	32.3	42.1	25.8	44.8	36.8	15.1	79.2	55.1	12.2	42.9
YOLOX-s	8.9	32.5	13.6	41.8	18.0	40.5	39.0	12.4	76.5	51.2	7.2	25.2
SOD-YOLO-s	1.75	42	40.3	50.4	27.9	48.8	34.8	15.7	83.3	57.2	14	48.9
LUD-YOLO-s	10.34	41.7	34.3	44.8	29.8	48.4	39.4	16.9	80.9	62.2	14.5	46.2
D-FINE-s	10.2	42.3	39.0	43.5	31.5	47.2	36.2	17.5	81.0	53.8	20.5	54.5
RT-DETR-R18	19.9	42.5	39.2	44.9	32.0	48.3	36.2	15.9	81.7	54.9	18.8	53.5
Drone-YOLO-s	10.8	42.8	40.3	50.5	28.5	47.7	37.6	16.7	83.4	58.1	15.1	50.1
Drone-YOLO-m	25.4	46.9	44.2	54.6	34.8	50.0	43.2	18.3	85.2	63.1	20.0	55.1
Ours(n)	3.2	39.4	42.1	52.7	25.9	43.1	34.1	14.8	80.5	53.6	12.7	45.2
Ours(s)	11.7	45.6	43.6	54.2	33.7	50.2	38.7	18.9	85.0	60.9	17.2	53.3

Table 7. Comparative results on SSDD and NWPU VHR-10 datasets.

Dataset	Model	P (%)	R (%)	mAP50 (%)	Param (M)
SSDD	Faster R-CNN	81.6	85.3	89.6	64.6
	YOLOv5n	95.8	93.2	97.6	2.5
	YOLOv8n	94.8	92.5	96.2	3.0
	YOLOv10n	97.1	95.1	98.4	2.7
	Ours	96.6	95.6	98.6	3.2
NWPU VHR-10	Faster R-CNN	91.0	84.3	77	64.6
	YOLOv5n	85.1	78.5	85.5	2.5
	YOLOv8n	89.0	79.1	85.2	3.0
	YOLOv10n	76.0	70.9	76.7	2.7
	Ours	82.8	83.6	86.6	3.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, B.; Zhang, C.; Meng, Q.; Sun, X.; Hu, X.; Wang, L.; Li, X. SRM-YOLO for Small Object Detection in Remote Sensing Images. Remote Sens. 2025, 17, 2099. https://doi.org/10.3390/rs17122099

AMA Style

Yao B, Zhang C, Meng Q, Sun X, Hu X, Wang L, Li X. SRM-YOLO for Small Object Detection in Remote Sensing Images. Remote Sensing. 2025; 17(12):2099. https://doi.org/10.3390/rs17122099

Chicago/Turabian Style

Yao, Bin, Chengkun Zhang, Qingxiang Meng, Xiandong Sun, Xuyang Hu, Lu Wang, and Xilai Li. 2025. "SRM-YOLO for Small Object Detection in Remote Sensing Images" Remote Sensing 17, no. 12: 2099. https://doi.org/10.3390/rs17122099

APA Style

Yao, B., Zhang, C., Meng, Q., Sun, X., Hu, X., Wang, L., & Li, X. (2025). SRM-YOLO for Small Object Detection in Remote Sensing Images. Remote Sensing, 17(12), 2099. https://doi.org/10.3390/rs17122099

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SRM-YOLO for Small Object Detection in Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Object Detection Methods

2.2. Deep Learning Algorithms for Small Object Detection

3. Proposed Model

3.1. Overview of SRM-YOLO Model

3.2. Improvements in Backbone Structure

3.3. Improvements in Neck Structure

3.4. MPDIoU

4. Experiment

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Ablation Experiment

4.5. Comparisons with Other Object Detection

4.6. Generalization Test

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI