YOLO-SE: Improved YOLOv8 for Remote Sensing Object Detection and Recognition

: Object detection remains a pivotal aspect of remote sensing image analysis, and recent strides in Earth observation technology coupled with convolutional neural networks (CNNs) have propelled the ﬁeld forward. Despite advancements, challenges persist, especially in detecting objects across diverse scales and pinpointing small-sized targets. This paper introduces YOLO-SE, a novel YOLOv8-based network that innovatively addresses these challenges. First, the introduction of a lightweight convolution SEConv in lieu of standard convolutions reduces the network’s parameter count, thereby expediting the detection process. To tackle multi-scale object detection, the paper proposes the SEF module, an enhancement based on SEConv. Second, an ingenious Efﬁcient Multi-Scale Attention (EMA) mechanism is integrated into the network, forming the SPPFE module. This addition augments the network’s feature extraction capabilities, adeptly handling challenges in multi-scale object detection. Furthermore, a dedicated prediction head for tiny object detection is incorporated, and the original detection head is replaced by a transformer prediction head. To address adverse gradients stemming from low-quality instances in the target detection training dataset, the paper introduces the Wise-IoU bounding box loss function. YOLO-SE showcases remarkable performance, achieving an average precision at IoU threshold 0.5 (AP50) of 86.5% on the optical remote sensing dataset SIMD. This represents a noteworthy 2.1% improvement over YOLOv8 and YOLO-SE outperforms the state-of-the-art model by 0.91%. In further validation, experiments on the NWPU VHR-10 dataset demonstrated YOLO-SE’s superiority with an accuracy of 94.9%, surpassing that of YOLOv8 by 2.6%. The proposed advancements position YOLO-SE as a compelling solution in the realm of deep learning-based remote sensing image object detection.


Introduction
Object detection serves as a prerequisite for advanced visual tasks such as scene understanding.Compared to object detection in videos, detecting objects in static images is more challenging.The utilization of object detection in optical remote sensing images has been widespread, encompassing diverse applications such as monitoring, traffic surveillance, agricultural development, disaster planning, and geospatial referencing [1][2][3][4].This has attracted considerable interest from researchers in recent years.
Traditional remote sensing image object-detection algorithms can be categorized into several types: threshold-based methods, feature engineering-based methods, template matching methods, machine learning-based methods, segmentation-based methods, and spectral information-based methods [5].
(1) Threshold-Based Methods: These methods typically use image brightness or color information to set appropriate thresholds for separating targets from the background.When pixel values exceed or fall below specific thresholds, they are considered as targets or non-targets.These methods are simple and user-friendly, but they are less stable under changing lighting conditions and complex backgrounds.
(2) Feature Engineering-Based Methods: These methods rely on manually designed features, such as texture, shape, and edges, to identify objects.Information is often extracted using filters, shape descriptors, and texture features.Subsequently, classifiers like support vector machines or decision trees are used to categorize the extracted features.
(3) Template Matching: Template matching is a method for identifying objects by comparing them with predefined templates or patterns.When the similarity between the target and the template exceeds a certain threshold, it is detected as an object.This method works well when there is a high similarity between the target and the template, but it is not very robust for target rotation, scaling, and other variations.
(4) Machine Learning-Based Methods: Machine learning algorithms, such as neural networks, support vector machines, and random forests, are employed to learn how to detect objects from data.Feature extraction and classifier parameters are often automatically learned from training data.This approach tends to perform well in complex object detection tasks, but it requires a significant amount of labeled data and computational resources.
(5) Segmentation-Based Methods: Object detection methods based on segmentation first divide the image into target and non-target regions and then perform further analysis and classification on each region.Segmentation methods can include region growing, watershed transform, and graph cuts.This approach works well when there are clear boundaries between objects and the background.
(6) Spectral Information-Based Methods: Remote sensing images typically contain information from multiple spectral bands, and this information can be leveraged for object detection.Methods based on spectral information often use spectral features like spectral reflectance and spectral angles to distinguish different types of objects.
The swift advancement of deep learning technologies, particularly the introduction of convolutional neural networks (CNNs), has ushered in new possibilities and applications for object detection in remote sensing images [6,7].
At present, the predominant frameworks for object detection in remote sensing images can be broadly classified into two major categories: single-stage and two-stage methods.Notable two-stage object detection algorithms encompass Spatial Pyramid Pooling Networks (SPP-Net) [8], Region-based CNN (R-CNN) [9], and Faster R-CNN [10].Two-stage methods often achieve high detection accuracy but tend to be slower in terms of detection speed with larger model sizes and parameter counts, due to their two-stage nature.Singlestage detection algorithms have effectively addressed these issues, with representative algorithms such as YOLO [11], SSD [12], RetinaNet [13], CornerNet [14], and RefineDet [15].As these single-stage algorithms-for example, YOLO-have matured, they not only outperform two-stage methods in terms of detection speed but also match or surpass them in terms of accuracy.
Despite the current state of the art, there is still room for improvement in the detection of objects at multiple scales and small targets.Several object categories in remote sensing display size variations, even within the same category.For instance, ships in ports can range in size from tens of meters to hundreds of meters.Additionally, the height of capture and distance from the target can affect an object's size in the image.Moreover, many small objects are present in aerial images, which are often filtered out in the pooling layers of convolutional neural networks (CNNs) due to their small size, making them challenging to detect and recognize.
To tackle these challenges, this paper suggests an improved network model built upon the foundation of YOLOv8.Our approach improves conventional convolution techniques, incorporates state-of-the-art attention mechanisms, and enhances the loss functions.The primary contributions of this paper are as follows: (1) A lightweight convolution SEConv was introduced to replace standard convolutions, reducing the network's parameter count and speeding up the detection process.To address multi-scale object detection, the SEF module was proposed, based on SEConv.
(2) A novel EMA attention mechanism was introduced and integrated into the network, resulting in the SPPFE module, which enhances the network's feature extraction capabilities and effectively addresses multi-scale object detection challenges.
(3) To improve the detection of small objects, an additional prediction head designed for tiny-object detection was added.Furthermore, the original detection head was replaced by a transformer prediction head to capture more global and contextual information.
(4) To mitigate the adverse gradients generated by low-quality examples, the Wise-IoU loss function was introduced.
The paper unfolds as follows: Section 2 delves into a comprehensive review of existing work concerning object-detection networks in remote sensing images, with a particular focus on attention mechanisms.Section 3 provides an intricate overview of both the YOLOv8 network and our proposed YOLO-SE network.In Section 4, we present a detailed account of our experiments and conduct a thorough analysis of the results, using both the SIMD dataset and the NWPU VHR-10 dataset.Finally, Section 5 encapsulates our conclusions.

Object-Detection Networks for Remote Sensing Images
While deep learning has demonstrated remarkable success in object detection for remote sensing images, effectively detecting multi-scale and small objects continues to pose a substantial challenge.Researchers have made notable contributions to address these challenges.Ma et al. [16] improved upon Faster R-CNN by proposing a method for identifying medium-and small-sized animals in large-scale images.They utilized the HRNet feature extraction network to enhance small-object detection.Sun et al. [17] presented the partial-based convolutional neural network (PBNet) to address compound object detection in high-resolution optical remote sensing images.Lai et al. [18] devised a feature extraction module that integrates convolutional neural networks (CNNs) and multi-head attention, leading to an expanded receptive field and the development of the STC-YOLO algorithm.Additionally, they introduced the Normalized Gaussian Wasserstein Distance (NWD) metric to enhance sensitivity to small-object position deviations.Han et al. [19] proposed the Ghost module for building efficient neural network architectures.GhostNet, constructed using this new module, achieved a balance between efficiency and accuracy.Lin W et al. [20] introduced a Scale-Aware Aggregation Module (SMT) that effectively simulates the transition from local to global dependencies with network depth, offering better performance with fewer parameters.Wan et al. [21] presented the YOLO-HR algorithm for high-resolution optical remote sensing object recognition.This algorithm employs multiple detection heads for object detection and reuses output features from the feature pyramid, further enhancing detection performance.Xu et al. [22] proposed a multi-scale remote sensing object detection model based on YOLOv3.They improved the existing feature extraction network by introducing DenseNet.This method exhibited good performance in multi-scale remote sensing object detection.Cao et al. [23] proposed a GhostConv-based backbone lightweight YOLO network (GCL_YOLO).The network first establishes a lightweight backbone network based on ghost convolutions with a minimal number of parameters.Subsequently, a novel small-object prediction head is designed to replace the existing large-object prediction head used for natural scene objects.Finally, the network utilizes the focus-effective intersection over union (Focus-EIOU) loss as the localization loss.
The above-mentioned research endeavors have led to various improvements in objectdetection algorithms for remote sensing images, contributing to the advancement of this field.In light of the challenges posed by multi-scale and small-object detection, this paper aims to further enhance the state-of-the-art YOLOv8 algorithm.

Attention Mechanism
The attention mechanism empowers neural networks to concentrate on crucial features while disregarding less pertinent ones [24].Convolution operations combine channel information and spatial information to extract features, making attention-mechanism designs consider both channel and spatial aspects.Currently, there are three main types of attention mechanisms: (1) Channel Attention: This type of attention mechanism prioritizes important features or channels within the data.For example, SENet [25] focuses on crucial channels for better feature representation.
(2) Spatial Attention: Spatial attention directs the model's focus to essential spatial positions within the data.Self-attention mechanisms, such as those used in deformable convolutional networks (DCNs) [26], excel at capturing spatial relationships.
(3) Mixed Attention Mechanisms: Some attention mechanisms, like the Convolutional Block Attention Module (CBAM) [27], combine both channel and spatial attention characteristics.CBAM can simultaneously attend to channels and spatial positions, contributing to improved model accuracy and noise suppression by considering both aspects of the data.
Additionally, we acknowledge the significance of other works.Many works also use neural attention to improve feature learning, such as Motion-attentive Transition for Zeroshot Video Object Segmentation [28] and Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation [29].These works further demonstrate the diverse applications of attention mechanisms for feature learning.
These attention mechanisms play a crucial role in enhancing model performance by selectively emphasizing relevant information, which is particularly beneficial in tasks like object detection and image analysis.

Vision Transformer
The transformer [30] was initially devised for machine translation tasks within the realm of natural language processing (NLP).Owing to its potent representation capabilities, researchers have been investigating avenues to leverage transformers for computer vision tasks.Models based on transformers have demonstrated performance on diverse visual benchmarks comparable to or surpassing other network types such as convolutional and recurrent neural networks.The growing attention within the computer vision community toward transformers is attributed to their superior performance and reduced need for domain-specific feature engineering [31].
Chen et al. [32] conducted training on a sequence transformer for pixel-level regression, achieving results akin to CNNs in image-classification tasks.Another notable model, ViT, directly applies pure transformers to image patch sequences, enabling the classification of entire images.Dosovitskiy et al. [33] recently introduced a model that attained state-of-theart performance across various image recognition benchmarks.
While traditional visual transformers excel in capturing long-range dependencies between patches, they often neglect local feature extraction and project 2D patches onto vectors, using simple linear layers.In response to this, recent research has concentrated on enhancing modeling capacity for local information.TNT [34], for instance, divides patches into multiple sub-patches and introduces a novel transformer-in-transformer architecture.This leverages inner transformer blocks to model relationships between subpatches and outer transformer blocks for patch-level information exchange.Twins [35] and CAT [36] alternatively perform local and global attention at different layers.Swin transformers [37] execute local attention within a window and introduce a shift-window partition method for cross-window connections.RegionViT [38] generates region tokens and local tokens from images, with local tokens receiving global information through attention to region tokens.Beyond local attention, some research suggests enhancing local information through local feature aggregation, such as T2T.Improved computations for selfattention layers have also garnered attention.DeepViT [39], for instance, proposes building cross-head communication to regenerate attention maps, fostering increased diversity at different levels.
Zhu et al. [40] integrated a transformer prediction head into the YOLOv5 structure, proposing the TPH-YOLOv5 model.This model introduces an additional prediction head to detect objects at different scales, utilizing a transformer prediction head (TPH) to leverage the predictive potential of self-attention mechanisms.Building upon this, Zhao et al. [41] enhanced the model with the TPH-YOLOv5++ version, significantly reducing computational costs and improving detection speed.In TPH-YOLOv5++, they introduced a cross-layer asymmetric transformer (CA-Trans) to replace the extra prediction head while maintaining its functionality.The use of the Sparse Local Attention (SLA) module effectively captures asymmetric information between additional heads and other heads, enriching the features of other heads.

YOLOv8
YOLOv8 utilizes a similar backbone to YOLOv5, but with some modifications in the CSPLayer, now referred to as the C2f module.The C2f module, which consists of a twoconvolution cross-stage partial bottleneck, combines high-level features with contextual information to enhance detection accuracy.YOLOv8 employs an anchor-free model with decoupled heads to independently handle object-detection, classification, and regression tasks.This design allows each branch to focus on its specific task, leading to improved overall model accuracy.In the output layer of YOLOv8, they use the sigmoid function as the activation function for object scores, indicating the probability of an object being present within the bounding box.They use the softmax function to represent class probabilities, signifying the probability of an object belonging to each possible class.
For bounding box loss, YOLOv8 uses the CIoU [42] and DFL [43] loss functions, and for classification loss, it employs binary cross-entropy.These loss functions enhance object-detection performance, especially when dealing with smaller objects.For this paper, we selected YOLOv8 as the baseline model, which consists of three key components: the backbone network, the neck network, and the prediction output head.The backbone network is the core part of the YOLOv8 model and is responsible for extracting features from the input RGB color images.The neck network is positioned between the backbone network and the prediction output head.Its primary role is to aggregate and process the features extracted by the backbone network.In YOLOv8, the neck network plays a crucial role in integrating features of different scales.Typically, the neck network adopts a Feature Pyramid Network (FPN) structure, which effectively fuses features of various scales to construct a more comprehensive representation.
The prediction output head is the topmost part of the YOLOv8 model and is responsible for identifying and locating object categories in the images.The output head usually contains multiple detectors, with each detector responsible for predicting the position and category of objects.In YOLOv8, three sets of detectors are employed, each with a different scale, aiding the model in recognizing objects of various sizes.The network architecture of YOLOv8 is illustrated in Figure 1.
The architecture of YOLOv8.

YOLO-SE
To address the issues related to detecting small objects and objects at multiple scales with the YOLOv8 network, we propose the YOLO-SE algorithm, as discussed in this section.We first provide an overview of the YOLO-SE architecture.Building on this, we introduce the essential components of YOLO-SE, including the Efficient Multiscale Convolution Module (SEF), improvements to convolution through the introduction of the EMA attention mechanism in the SPPFE module, replacing the original YOLOv8 detection head with a transformer prediction head, and adding an additional detection head to handle objects at different scales.Finally, we replace the original CIOU bounding box loss function with Wise-IoU.The network structure of YOLO-SE is depicted in Figure 2.

YOLO-SE
To address the issues related to detecting small objects and objects at multiple scales with the YOLOv8 network, we propose the YOLO-SE algorithm, as discussed in this section.We first provide an overview of the YOLO-SE architecture.Building on this, we introduce the essential components of YOLO-SE, including the Efficient Multiscale Convolution Module (SEF), improvements to convolution through the introduction of the EMA attention mechanism in the SPPFE module, replacing the original YOLOv8 detection head with a transformer prediction head, and adding an additional detection head to handle objects at different scales.Finally, we replace the original CIOU bounding box loss function with Wise-IoU.The network structure of YOLO-SE is depicted in Figure 2.

SEF
We replaced the standard convolutions in C2f with a more lightweight and efficient multi-scale convolution module called SEF.This module introduces multiple convolutions with different kernel sizes, enabling it to capture various spatial features at multiple scales.Additionally, SEF extends the receptive field using larger convolution kernels, enhancing its ability to model long-range dependencies.
As shown in Figure 3, the SEConv2d operation partitions the input channels into four smaller channels.The first and third smaller channels remain unchanged, while the second and fourth channels undergo 3 × 3 and 5 × 5 convolution operations, respectively.Subsequently, a 1 × 1 convolution consolidates the features from these four smaller channels.By employing half of the features for convolution and then integrating them with the original features, the objective is to generate redundant features, decrease parameters and computational workload, and alleviate the influence of high-frequency noise.This approach aims to reduce the number of parameters and computational expenses while preserving essential feature information.Each distinct convolutional mapping learns to focus

SEF
We replaced the standard convolutions in C2f with a more lightweight and efficient multi-scale convolution module called SEF.This module introduces multiple convolutions with different kernel sizes, enabling it to capture various spatial features at multiple scales.Additionally, SEF extends the receptive field using larger convolution kernels, enhancing its ability to model long-range dependencies.
As shown in Figure 3, the SEConv2d operation partitions the input channels into four smaller channels.The first and third smaller channels remain unchanged, while the second and fourth channels undergo 3 × 3 and 5 × 5 convolution operations, respectively.Subsequently, a 1 × 1 convolution consolidates the features from these four smaller channels.By employing half of the features for convolution and then integrating them with the original features, the objective is to generate redundant features, decrease parameters and computational workload, and alleviate the influence of high-frequency noise.This approach aims to reduce the number of parameters and computational expenses while preserving essential feature information.Each distinct convolutional mapping learns to focus on features of varying granularities adaptively.SEF excels in capturing local details, preserving the nuances and semantic information of target objects as the network deepens.

SPPFE
The SPPF module in YOLOv8 has demonstrated its advantages in enhancing model performance through multi-scale feature fusion, particularly in certain scenarios.However, we must acknowledge that the SPPF module may have limitations in complex backgrounds and situations involving variations in target scales.This is because it still lacks a fine-grained mechanism to focus on task-critical regions.The structure of SEF is shown in Figure 4.In summary, the SEF module reduces the network's parameter count, accelerates detection speed, and effectively captures multi-scale features and local details of the target.The structure of SEF is shown in Figure 4.In summary, the SEF module reduces the network's parameter count, accelerates detection speed, and effectively captures multiscale features and local details of the target.

SPPFE
The SPPF module in YOLOv8 has demonstrated its advantages in enhancing model performance through multi-scale feature fusion, particularly in certain scenarios.However, we must acknowledge that the SPPF module may have limitations in complex backgrounds and situations involving variations in target scales.This is because it still lacks a fine-grained mechanism to focus on task-critical regions.

SPPFE
The SPPF module in YOLOv8 has demonstrated its advantages in enhancing model performance through multi-scale feature fusion, particularly in certain scenarios.However, we must acknowledge that the SPPF module may have limitations in complex backgrounds and situations involving variations in target scales.This is because it still lacks a finegrained mechanism to focus on task-critical regions.
To address the limitations of the SPPF module and enhance feature extraction capabilities, we introduce the Efficient Multi-Scale Attention (EMA) mechanism [44], which dynamically adjusts the weights in the feature maps based on the importance of each region in an adaptive manner.The EMA attention mechanism is employed to retain information on each channel while reducing computational costs.We achieve this by restructuring a portion of the channels into the batch dimension and grouping the channel dimensions into multiple sub-features, ensuring an even distribution of spatial semantic features within each feature group.This approach helps maintain channel-wise information while minimizing computational expenses.This allows the module to focus on task-critical regions, making it more targeted in complex scenes.The structure of SPPFE is depicted in Figure 5, and we incorporate the EMA attention mechanism into this module.The SPPEF module not only performs multi-scale feature fusion but also finely adjusts features at each scale, effectively capturing information at different scales.This enhancement significantly improves the model's ability to detect small objects.To address the limitations of the SPPF module and enhance feature extraction capabilities, we introduce the Efficient Multi-Scale Attention (EMA) mechanism [44], which dynamically adjusts the weights in the feature maps based on the importance of each region in an adaptive manner.The EMA attention mechanism is employed to retain information on each channel while reducing computational costs.We achieve this by restructuring a portion of the channels into the batch dimension and grouping the channel dimensions into multiple sub-features, ensuring an even distribution of spatial semantic features within each feature group.This approach helps maintain channel-wise information while minimizing computational expenses.This allows the module to focus on task-critical regions, making it more targeted in complex scenes.The structure of SPPFE is depicted in Figure 5, and we incorporate the EMA attention mechanism into this module.The SPPEF module not only performs multi-scale feature fusion but also finely adjusts features at each scale, effectively capturing information at different scales.This enhancement significantly improves the model's ability to detect small objects.

TPH
Due to the significant variation in object sizes within remote sensing images, including numerous extremely small instances, experimental results have shown that YOLOv8′s original three detection heads do not adequately address the challenges presented by remote sensing imagery.As a result, we added an additional prediction head specifically designed for detecting tiny objects.When combined with the other three prediction heads, this approach enables us to capture relevant information about small targets more effectively while also detecting objects at different scales, thus improving overall detection performance.
We replaced the original detection head with a transformer prediction head to capture more global and contextual information.The structure of the Vision Transformer is

TPH
Due to the significant variation in object sizes within remote sensing images, including numerous extremely small instances, experimental results have shown that YOLOv8 s original three detection heads do not adequately address the challenges presented by remote sensing imagery.As a result, we added an additional prediction head specifically designed for detecting tiny objects.When combined with the other three prediction heads, this approach enables us to capture relevant information about small targets more effectively while also detecting objects at different scales, thus improving overall detection performance.
We replaced the original detection head with a transformer prediction head to capture more global and contextual information.The structure of the Vision Transformer is depicted in Figure 6.It consists of two main blocks: a multi-head attention block and a feedforward neural network (MLP).The LayerNorm layer aids in better network convergence and prevents overfitting.Multi-head attention allows the current node to focus not only on the current pixel but also on the semantic context.While the additional prediction head introduces a considerable amount of computational and memory overhead, it has improved the performance of tiny-object detection.
Appl.Sci.2023, 13, x FOR PEER REVIEW 10 of 2 depicted in Figure 6.It consists of two main blocks: a multi-head attention block and a feedforward neural network (MLP).The LayerNorm layer aids in better network conver gence and prevents overfitting.Multi-head attention allows the current node to focus no only on the current pixel but also on the semantic context.While the additional prediction head introduces a considerable amount of computational and memory overhead, it ha improved the performance of tiny-object detection.

Wise-IoU Loss
YOLOv8 uses Complete Intersection over Union (CIoU) [42] as the default loss-cal culation method.CIoU builds upon Distance Intersection over Union (DIoU) by introduc ing the aspect ratio of the predicted bounding box and the ground-truth bounding box making the loss function more attentive to the shape of the bounding boxes.However, the computation of CIoU loss is relatively complex, leading to higher computational overhead during the training process.Weighted Intersection over Union (WIoU) [45] proposes a dynamic non-monotonic focus mechanism, replacing IoU with dissimilarity to assess the quality of anchor boxes.It adopts a gradient gain allocation strategy, reducing the com petitiveness of high-quality anchor boxes and mitigating harmful gradients caused by low-quality anchor boxes.This allows WIoU to focus on low-quality anchor boxes, ulti mately improving the overall performance of the detector.WIoU comes in three versions namely WIoUv1, which constructs an attention-based bounding box loss, and WIoUv2 and WIoUv3, which build upon 1 by adding gradient gain to the focus mechanism.
The formula for calculating  is as shown in Equation ( 2): The calculation formula for Region-based Weighted Intersection over Union (R WIoU) is as follows, as shown in Equation ( 3

Wise-IoU Loss
YOLOv8 uses Complete Intersection over Union (CIoU) [42] as the default losscalculation method.CIoU builds upon Distance Intersection over Union (DIoU) by introducing the aspect ratio of the predicted bounding box and the ground-truth bounding box, making the loss function more attentive to the shape of the bounding boxes.However, the computation of CIoU loss is relatively complex, leading to higher computational overhead during the training process.Weighted Intersection over Union (WIoU) [45] proposes a dynamic non-monotonic focus mechanism, replacing IoU with dissimilarity to assess the quality of anchor boxes.It adopts a gradient gain allocation strategy, reducing the competitiveness of high-quality anchor boxes and mitigating harmful gradients caused by low-quality anchor boxes.This allows WIoU to focus on low-quality anchor boxes, ultimately improving the overall performance of the detector.WIoU comes in three versions, namely WIoU v1 , which constructs an attention-based bounding box loss, and WIoU v2 and WIoU v3 , which build upon v1 by adding gradient gain to the focus mechanism.
The formula for calculating W IoU v1 is as shown in Equation ( 2): The calculation formula for Region-based Weighted Intersection over Union (R-WIoU) is as follows, as shown in Equation (3): The values of w, h, (x Bbox , y Bbox ), and (x Tbox , y Tbox ) are illustrated in Figure 7.To prevent  from producing gradients that hinder convergence, w and h ar separated from the computation graph. takes values in the range [1, e), signifi cantly amplifying the importance of low-quality anchors.Loss Intersection over Unio (LIoU), on the other hand, takes values in the range [0, 1], significantly reducing  for high-quality anchors and focusing on the distance between the center points whe Bbox and Tbox overlap.
The dynamic non-monotonic focus mechanism uses "outlyingness" to assess ancho box quality instead of IoU, and it provides a wise gradient gain allocation strategy.Thi strategy reduces the competitiveness of high-quality anchor boxes while also mitigatin harmful gradients generated by low-quality examples.This allows WIoU to focus on or dinary-quality anchor boxes and improve the overall performance of the detector.

Experimental Environment
The experimental platform used in this study is shown in Table 1.The SIMD dataset is a multi-class, open-source, high-resolution, and fine-grained re mote sensing object-detection dataset.It consists of 5000 images with a total of 45,096 ob jects distributed across 15 different classes such as cars, airplanes, and helicopters.Th distribution of classes and the sizes of objects in the training set are shown in Figure 8 SIMD is characterized by class imbalance and the presence of small objects.As Figure 8 To prevent R W IoU from producing gradients that hinder convergence, w and h are separated from the computation graph.R W IoU takes values in the range [1, e), significantly amplifying the importance of low-quality anchors.Loss Intersection over Union (LIoU), on the other hand, takes values in the range [0, 1], significantly reducing R W IoU for highquality anchors and focusing on the distance between the center points when Bbox and Tbox overlap.
The dynamic non-monotonic focus mechanism uses "outlyingness" to assess anchor box quality instead of IoU, and it provides a wise gradient gain allocation strategy.This strategy reduces the competitiveness of high-quality anchor boxes while also mitigating harmful gradients generated by low-quality examples.This allows WIoU to focus on ordinary-quality anchor boxes and improve the overall performance of the detector.

Experimental Environment
The experimental platform used in this study is shown in Table 1.The SIMD dataset is a multi-class, open-source, high-resolution, and fine-grained remote sensing object-detection dataset.It consists of 5000 images with a total of 45,096 objects distributed across 15 different classes such as cars, airplanes, and helicopters.The distribution of classes and the sizes of objects in the training set are shown in Figure 8. SIMD is characterized by class imbalance and the presence of small objects.As Figure 8a illustrates, the "car" class has more than 16,000 objects, while some classes like "helicopter" and "fighter" have fewer than 500 objects.We partitioned the images into 80% for training and 20% for testing.
illustrates, the "car" class has more than 16,000 objects, while some classes like "helicopter" and "fighter" have fewer than 500 objects.We partitioned the images into 80% for training and 20% for testing.
In this experiment, YOLOv8s' pretrained weights were used, and the training was conducted for 200 epochs with a batch size of 16 and an input image size of 1024 × 1024 pixels.

Evaluation Metrics
In target-detection tasks, metrics such as recall (R), precision and average precision (AP) are commonly used for evaluation.Recall represents the proportion of correctly detected targets out of the total number of targets, calculated using the formula shown in Equation ( 4).
Precision represents the proportion of correctly detected targets out of the total predicted targets, as shown in Equation (5).
where TP represents the count of correctly identified targets, FP is the count of erroneously detected targets, and FN is the count of targets that remain undetected.The Average Precision (AP) is the measure of the area under the Precision-Recall (P-R) curve, where recall is plotted on the x-axis and precision is plotted on the y-axis.The calculation formula for Average Precision (AP) is given by Equation (6).
To obtain the mean Average Precision (mAP) for multiple classes, the AP values for each class are averaged.The formula for calculating mAP is as shown in Equation (7).In this experiment, YOLOv8s' pretrained weights were used, and the training was conducted for 200 epochs with a batch size of 16 and an input image size of 1024 × 1024 pixels.

Evaluation Metrics
In target-detection tasks, metrics such as recall (R), precision (P), and average precision (AP) are commonly used for evaluation.Recall represents the proportion of correctly detected targets out of the total number of targets, calculated using the formula shown in Equation (4).
Precision represents the proportion of correctly detected targets out of the total predicted targets, as shown in Equation (5).
where TP represents the count of correctly identified targets, FP is the count of erroneously detected targets, and FN is the count of targets that remain undetected.The Average Precision (AP) is the measure of the area under the Precision-Recall (P-R) curve, where recall is plotted on the x-axis and precision is plotted on the y-axis.The calculation formula for Average Precision (AP) is given by Equation (6).
To obtain the mean Average Precision (mAP) for multiple classes, the AP values for each class are averaged.The formula for calculating mAP is as shown in Equation (7).
In this paper, mAP@0.5 is used as the primary evaluation metric for precision.For convenience, throughout the rest of this paper, mAP@0.5 will be referred to as AP0.5.

Results Analysis 4.2.1. Experimental Results
The AP50 for various categories in the SIMD dataset reached 86.5%.Table 2 presents the detection results for all categories using the YOLO-SE algorithm on the SIMD dataset.To further validate the effectiveness of the algorithm, we conducted experiments on the NWPU VHR-10 dataset.NWPU VHR-10 is a publicly available geospatial object detection dataset comprising a total of 800 remote sensing (RS) images.We split the dataset into 80% for training and 20% for testing.Table 3 presents the detection results on the NWPU VHR-10 dataset; YOLO-SE achieved an accuracy of 94.9%. Figure 9 displays some visual results of the proposed method on the SIMD dataset.It is evident from Figure 6 that the algorithm introduced in this paper effectively addresses the challenges of multi-scale objects and noise in complex environments.Additionally, it demonstrates good performance in detecting small objects.
the challenges of multi-scale objects and noise in complex environments.Additionally, it demonstrates good performance in detecting small objects.From the graph, we can observe that categories like 'airliner,' 'propeller,' 'trainer,' 'chartered,' 'fighter,' and 'boat' perform remarkably well, with average precisions all exceeding 0.95.In contrast, the 'other' category has an average precision of only 0.447.This could be attributed to the 'other' category being diverse, with no unified features for the network to learn.From the graph, we can observe that categories like 'airliner,' 'propeller,' 'trainer,' 'chartered,' 'fighter,' and 'boat' perform remarkably well, with average precisions all exceeding 0.95.In contrast, the 'other' category has an average precision of only 0.447.This could be attributed to the 'other' category being diverse, with no unified features for the network to learn.
Figure 11 shows the progression of various metrics during training and validation, including box loss, object loss, class loss, and metrics after each epoch, such as accuracy and recall.
Figure 12 presents the confusion matrix for our model.It is evident that 'other,' 'stair truck,' and 'pushback truck' are sometimes not detected and are considered as background.Moreover, 'stair truck' and 'pushback truck' have a similar appearance, making it challenging to distinguish them in remote sensing images.Figure 12 presents the confusion matrix for our model.It is evident that 'other,' 'stair truck,' and 'pushback truck' are sometimes not detected and are considered as background.Moreover, 'stair truck' and 'pushback truck' have a similar appearance, making it challenging to distinguish them in remote sensing images.
The experimental results are presented in Table 4.The Fast R-CNN model has the highest parameter count, yet its AP50 is relatively low compared to that of other models.Subsequent YOLO models have not only increased AP50 but also reduced the model's parameter count, highlighting the excellence of YOLO models.YOLO-SE consistently outperforms the other algorithm models in terms of detection accuracy; the AP50 reached
The experimental results are presented in Table 4.The Fast R-CNN model has the highest parameter count, yet its AP50 is relatively low compared to that of other models.Subsequent YOLO models have not only increased AP50 but also reduced the model's parameter count, highlighting the excellence of YOLO models.YOLO-SE consistently outperforms the other algorithm models in terms of detection accuracy; the AP50 reached 86.5%, outperforming the state-of-the-art model YOLO-HR by 0.91%.Additionally, YOLO-SE exhibits a lower parameter count, further demonstrating the superiority of the algorithm.The comparison results between YOLOv8 and YOLO-SE on the SIMD dataset are illustrated in Figure 13.From the figure, it is evident that YOLO-SE performs well in focusing on small targets when there are large numbers of objects in an image.Additionally, when an image contains multi-scale targets, YOLO-SE effectively handles targets of various sizes.This comparison demonstrates that YOLO-SE outperforms the YOLOv8 algorithm in terms of detection effectiveness.
YOLO-SE exhibits a lower parameter count, further demonstrating the superiority of the algorithm.The comparison results between YOLOv8 and YOLO-SE on the SIMD dataset are illustrated in Figure 13.From the figure, it is evident that YOLO-SE performs well in fo cusing on small targets when there are large numbers of objects in an image.Additionally when an image contains multi-scale targets, YOLO-SE effectively handles targets of vari ous sizes.This comparison demonstrates that YOLO-SE outperforms the YOLOv8 algo rithm in terms of detection effectiveness.

Ablation Test
To evaluate the impact of various modules in the proposed method, we conducted ablation experiments on the SIMD dataset.The experimental results are shown in Table 5. Exp 1 represents the original YOLOv8 model without any modifications.Exp 2 introduces the SEF module on top of YOLOv8.Exp 3 incorporates the SPPFE module, Exp 4 enhances the WIoU loss function, and Exp 5 integrates the TPH module, all built upon the YOLOv8 framework.The results indicate that the SEF module, the SPPFE module, the transformer predict head (TPH), and the introduced WIoU loss function all contribute to improving the algorithm's detection performance.
We conducted a total of four sets of experiments, and from Table 5, it is evident that the SEF module has the most significant impact on improving the algorithm's detection performance, with an increase of 0.9% in AP50.The SPPFE module has improved by 0.2%.The WIoU loss function and TPH module led to improvements of 0.2% and 0.3% in AP50, respectively.

Conclusions
To address the challenges of multi-scale and small-object detection in remote sensing image detection, this paper introduced the YOLO-SE network based on YOLOv8.First, we successfully introduced the SEF module, a lightweight design that significantly improves network parameters and inference speed.The SEF module effectively handles multiscale features, providing a solid foundation for the model's performance.Second, by introducing the SPPFE module, we not only improved the efficiency of feature extraction but also successfully introduced the EMA attention mechanism.The multi-scale convolution operations of this module help capture different scale information, thus enhancing the model's accuracy.At the same time, we added an additional prediction head specifically designed for detecting tiny objects.When combined with the other three prediction heads, this allowed us to capture relevant information about small targets more effectively.We also replaced the original detection head with a transformer prediction head (TPH) to capture more global and contextual information.Finally, to improve the training process, we introduced the Wise-IoU bounding box loss function.The use of this loss function helps reduce the negative impact of low-quality instances during training, improving the model's stability and robustness.The experimental results demonstrated that YOLO-SE achieved significant performance improvement on the optical remote sensing dataset SIMD, with a mAP value of 86.5%-a 2.1% increase compared to YOLOv8.This research demonstrated that the introduction of multi-scale convolutions, attention mechanisms, and transformer prediction heads can achieve higher performance in the field of remote sensing image object detection while maintaining a certain level of efficiency.This provides powerful tools and methods for remote sensing image analysis.
In our upcoming efforts, we will aim to further optimize the network model by prioritizing a balance between reduced complexity and improved detection accuracy.We intend to expand the application of the proposed network structure modifications to various object detection algorithms.Moreover, we will investigate alternative strategies for feature reuse and thoroughly address deployment and application challenges associated with the algorithm discussed in this paper.

Figure 3 .Figure 4 .
Figure 3.The architecture of SEConv2d.The structure of SEF is shown in Figure4.In summary, the SEF module reduces the network's parameter count, accelerates detection speed, and effectively captures multiscale features and local details of the target.

Figure 4 .
Figure 4.The architecture of SEF.

Figure 4 .
Figure 4.The architecture of SEF.

Figure 6 .
Figure 6.The architecture of the Transformer Encoder module.

3 Figure 6 .
Figure 6.The architecture of the Transformer Encoder module.

Figure 8 .
Figure 8.The distribution of targets in the SIMD dataset: (a) the distribution of the number of categories; (b) the distribution of target width and height; the color gradient from white to blue (from light to dark) signifies a more concentrated distribution.

Figure 8 .
Figure 8.The distribution of targets in the SIMD dataset: (a) the distribution of the number of categories; (b) the distribution of target width and height; the color gradient from white to blue (from light to dark) signifies a more concentrated distribution.

Figure 9 .
Figure 9. Visual results on the SIMD dataset.

Figure 10 displays
Figure10displays Precision-Recall (PR) curves for each class in the SIMD dataset, illustrating the average precision for each class.From the graph, we can observe that categories like 'airliner,' 'propeller,' 'trainer,' 'chartered,' 'fighter,' and 'boat' perform remarkably well, with average precisions all exceeding 0.95.In contrast, the 'other' category has an average precision of only 0.447.This could be attributed to the 'other' category being diverse, with no unified features for the network to learn.

Figure 9 .
Figure 9. Visual results on the SIMD dataset.

Figure 10 displays
Figure10displays Precision-Recall (PR) curves for each class in the SIMD dataset, illustrating the average precision for each class.From the graph, we can observe that categories like 'airliner,' 'propeller,' 'trainer,' 'chartered,' 'fighter,' and 'boat' perform remarkably well, with average precisions all exceeding 0.95.In contrast, the 'other' category has an average precision of only 0.447.This could be attributed to the 'other' category being diverse, with no unified features for the network to learn.Figure11shows the progression of various metrics during training and validation, including box loss, object loss, class loss, and metrics after each epoch, such as accuracy and recall.Figure12presents the confusion matrix for our model.It is evident that 'other,' 'stair truck,' and 'pushback truck' are sometimes not detected and are considered as background.Moreover, 'stair truck' and 'pushback truck' have a similar appearance, making it challenging to distinguish them in remote sensing images.

Figure 10 .
Figure 10.The P-R curve during training on SIMD dataset.

Figure 11
Figure 11 shows the progression of various metrics during training and validation, including box loss, object loss, class loss, and metrics after each epoch, such as accuracy and recall.

Figure 11 .
Figure 11.Network convergence on the SIMD dataset.

Figure 10 .
Figure 10.The P-R curve during training on SIMD dataset.

Figure 10 .
Figure 10.The P-R curve during training on SIMD dataset.

Figure 11
Figure 11 shows the progression of various metrics during training and validation, including box loss, object loss, class loss, and metrics after each epoch, such as accuracy and recall.

Figure 11 .
Figure 11.Network convergence on the SIMD dataset.Figure 11.Network convergence on the SIMD dataset.

Figure 11 .
Figure 11.Network convergence on the SIMD dataset.Figure 11.Network convergence on the SIMD dataset.

Figure 13 .
Figure 13.The comparison of detection results between YOLOv8 and YOLO-SE on the SIMD da taset.

Figure 13 .
Figure 13.The comparison of detection results between YOLOv8 and YOLO-SE on the SIMD dataset.

Table 2 .
Detection results on SIMD dataset.

Table 4 .
Comparison with other algorithms.

Table 4 .
Comparison with other algorithms.

Table 5 .
Results of ablation test, " √" represents the selection of the corresponding method.