An Efﬁcient Object Detection Algorithm Based on Improved YOLOv5 for High-Spatial-Resolution Remote Sensing Images

: The ﬁeld of remote sensing information processing places signiﬁcant research emphasis on object detection (OD) in high-spatial-resolution remote sensing images (HSRIs). The OD task in HSRIs poses additional challenges compared to conventional natural images. These challenges include variations in object scales, complex backgrounds, dense arrangement, and uncertain orientations. These factors contribute to the increased difﬁculty of OD in HSRIs as compared to conventional images. To tackle the aforementioned challenges, this paper introduces an innovative OD algorithm that builds upon enhancements made to the YOLOv5 framework. The incorporation of RepConv, Transformer Encoder, and BiFPN modules into the original YOLOv5 network leads to improved detection accuracy, particularly for objects of varying scales. The C3GAM module is designed by introducing the GAM attention mechanism to address the interference caused by complex background regions. To achieve precise localization of densely arranged objects, the SIoU loss function is integrated into YOLOv5. The circular smooth label method is used to detect objects with uncertain directions. The effectiveness of the suggested algorithm is conﬁrmed through its application to two commonly utilized datasets, speciﬁcally HRSC2016 and UCAS-AOD. The average detection accuracies achieved on these datasets are 90.29% and 90.06% respectively, surpassing the performance of other compared OD algorithms for HSRIs.


Introduction
Technological advancements in the field of remote sensing have led to a substantial growth in the volume of high-spatial-resolution remote sensing images (HSRIs).These images encompass a vast amount of valuable information for the purpose of earth observation.As a result, effectively acquiring and utilizing this information has become a crucial area of research in remote sensing information processing.The process of object detection (OD) for HSRIs entails extracting features from such images to identify ground targets' categories and obtain their rectangular bounding box coordinates.In the field of remote sensing information processing, this subject has gained substantial popularity and garnered considerable attention from researchers.The research results of OD in HSRIs have found extensive utilization in diverse domains, encompassing urban planning, disaster prediction, natural disaster response, disaster assessment, and military decision-making [1].
The intricate characteristics of the earth's surface make OD in HSRIs a formidable challenge.Several factors impede the process, including variations in target scales, complex backgrounds, dense arrangements, and uncertain orientations [2].The unique characteristics of HSRIs pose challenges in attaining high precision for OD [3].Currently, in the field of OD in HSRIs, there are two primary categories of algorithms.The first category comprises traditional algorithms, including sliding window and template matching.The second category consists of deep learning (DL)-based algorithms.Presently, DL-based algorithms have gained significant popularity due to their superior performance in terms of both accuracy and speed, surpassing traditional algorithms.However, most OD algorithms based on DL cannot recognize the orientation of objects, making them no longer suitable for more challenging HSRIs.Consequently, algorithms that detect arbitrary-oriented bounding boxes have become the prevailing standard for OD in HSRIs.Two distinct categories can be used to classify these algorithms: two-stage [4][5][6] algorithms and one-stage [7,8] algorithms.
In two-stage algorithms, the detection process consists of two stages: the first stage involves generating candidate regions, while the second stage focuses on extracting features from these regions to aid in object recognition.To represent arbitrary-oriented bounding boxes, these detection algorithms commonly utilize the five-parameter representation (x, y, w, h, θ).In this representation, (x, y) indicates the center position of the bounding box, (w, h) indicates the width and height of the bounding box, and θ represents the angle of the bounding box.In the algorithms, the angle of the bounding box can be learned in either the first or second stage of the algorithm.The angle of the bounding box can be acquired during either the initial or subsequent stage.As an illustration, algorithms like R 2 CNN [9] and ROI Transformer [10] generate candidate regions in a horizontal orientation during the first stage.Subsequently, in the second stage, these algorithms perform angle regression to determine the bounding box orientation.In contrast, R 2 PN [11], R-DFPN [12], and ICN [13] directly generate oriented candidate regions in the first stage.
Unlike two-stage detection algorithms, one-stage algorithms are specifically designed as end-to-end detection algorithms.They bypass the need for a separate candidate region generation stage and can directly classify and estimate the position of objects with any orientation in the image.They also have fewer parameters and are easier to converge.Common one-stage OD algorithms include RetinaNet-O [14], DAL [15], RSDet [16], R 3 Det [17], and S 2 A-Net [18].RetinaNet-O is an improved algorithm based on RetinaNet.It achieves arbitrary-oriented OD through five-parameter regression.DAL algorithm adopts a dynamic anchor learning strategy, which assigns labels more efficiently by evaluating the localization potential of anchors.In order to tackle the problem of loss discontinuity resulting from the periodic nature of angles in five-parameter regression and the inconsistency of regression parameters, the RSDet algorithm directly regresses the four points of rotated boxes using an eight-parameter representation.The R 3 Det algorithm combines horizontal anchor boxes and rotated anchor boxes.During the initial detection stage, the algorithm utilizes horizontal anchor boxes to accelerate the process and generate a larger quantity of candidate boxes.In the refinement stage, it uses rotated anchor boxes to adapt to dense target scenarios.Consisting of two distinct modules, namely the feature alignment module and the detection module, the S 2 A-Net algorithm leverages an anchor refinement network within the feature alignment module to produce anchors of exceptional quality.Conversely, the detection module incorporates active rotation filters to encode orientation information, resulting in the generation of orientation-sensitive and orientation-invariant features.By employing this approach, the inconsistency between classification scores and localization accuracy is effectively resolved.When confronted with orientation uncertainty, the majority of the OD algorithms mentioned above employ angle regression to forecast the orientation of detection boxes.However, this approach ignores the issue of boundary discontinuity.If the predicted result falls outside the predefined range, it leads to a substantial loss value, resulting in unstable training outcomes and impacting the model's detection performance.
As one of the representatives of one-stage OD algorithms, the YOLO series has garnered widespread attention and has been updated at an increasingly rapid pace.Since the introduction of YOLOv1 [19], the YOLO series of algorithms have undergone rapid updates and have reached the latest version, YOLOv8, with continuously improving accuracy and speed.Researchers have also made further advancements based on the YOLO framework in their respective fields of study.For example, Choi et al. [20].proposed Gaussian YOLOv3, building upon YOLOv3 [21], to effectively apply the improved algorithm to the field of autonomous driving.Similarly, Wang et al. [22] made advancements in YOLOv3 to address the challenges in pavement surface pothole measurement.Wu et al. [23] combined local fully convolutional neural networks with YOLOv5, achieving progress in small object detection in HSRIs.Zhang et al. [24].improved upon YOLOv5, effectively reducing the false detection rate of occluded vehicle targets.Zhao et al. [25] combined YOLOv5 with Transformers to effectively address the challenge of OD in images captured by drones.
The progression from YOLOv1 to YOLOv8 can be categorized into two stages: the first stage encompasses YOLOv1 to YOLOv5, while the second stage encompasses YOLOv6 [26] to YOLOv8.In the first stage, YOLOv5 emerges as the algorithm of utmost innovation and representation, successfully attaining a commendable equilibrium between accuracy and speed.The improvements in YOLOv6, YOLOv7 [27], and YOLOv8 are all based on YOLOv5.Therefore, by harnessing the robust OD capabilities of YOLOv5 and taking into account the distinctive traits of OD in HSRIs, this paper presents a refined algorithm based on YOLOv5 for detecting objects in HSRIs.
The primary contributions can be summarized as follows: 1.
Our proposal designs a RepConv module that enhances the detection accuracy of small-scale objects without introducing additional inference time.Additionally, we incorporate a Transformer Encoder structure to capture global contextual information, thereby improving the detection accuracy of large-scale objects.In order to achieve a balance in feature information across various scales and enhance the detection accuracy of multi-scale objects, we substitute the PANet structure in YOLOv5 with BiFPN.2.
To address the interference caused by complex background regions in HSRIs, we design a C3GAM module by introducing the GAM attention mechanism, which aids the model in effectively localizing regions that contain the target.

3.
To enhance the localization accuracy of anchor boxes and improve the precision of boundary recognition in HSRIs with dense object arrangements, we incorporate the SIoU loss function.4.
To tackle the issue of uncertain target direction and mitigate the problem of disjointed boundaries caused by angle regression, we suggest the adoption of the circular smooth label method as an effective solution.

Network Structure of YOLOv5
YOLOv5 is one of the most representative algorithms in YOLO target detection, including five network structures: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.Among them, YOLOv5n stands out with its comparatively lower depth and feature map width.The remaining four networks progressively increase both depth and feature map width in comparison to YOLOv5n.For the purpose of enhancement, this paper selects YOLOv5s as the foundational model.The network structure of YOLOv5s can be categorized into three components: Backbone, Neck, and Head, as depicted in Figure 1.

Backbone
The Backbone consists of three modules: Conv, C3, and SPPF.The Conv module further encapsulates three functional modules: the convolutional layer (Conv2d), normalization layer (BatchNorm2d), and activation function (SiLU).By applying convolution, normalization, and activation to the input features, the module produces output features.The C3 module comprises multiple Bottleneck modules and three standard convolutional layers.The number of Bottlenecks varies depending on the network depth.The C3 module, which consists of two branches, plays a crucial role in learning residual features.One branch contains multiple stacked Bottlenecks and three standard convolutional layers, while the other branch contains a basic convolutional block.The results from the two branches are merged by concatenating them.The SPPF module sequentially applies multiple small-size pooling kernels to fuse feature maps with varying receptive fields.This enhances feature map representation and further improves computational speed.

Backbone
The Backbone consists of three modules: Conv, C3, and SPPF.The Conv module further encapsulates three functional modules: the convolutional layer (Conv2d), normalization layer (BatchNorm2d), and activation function (SiLU).By applying convolution, normalization, and activation to the input features, the module produces output features.The C3 module comprises multiple Bottleneck modules and three standard convolutional layers.The number of Bottlenecks varies depending on the network depth.The C3 module, which consists of two branches, plays a crucial role in learning residual features.One branch contains multiple stacked Bottlenecks and three standard convolutional layers, while the other branch contains a basic convolutional block.The results from the two branches are merged by concatenating them.The SPPF module sequentially applies multiple small-size pooling kernels to fuse feature maps with varying receptive fields.This enhances feature map representation and further improves computational speed.

Neck
The Neck component comprises two elements: the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN).FPN represents a top-down feature pyramid network, whereas PAN stands as a path aggregation network.FPN integrates features at different levels in a hierarchical manner, progressing from top to bottom.By capitalizing on the high resolution of low-level features and the abundant semantic information from high-level features, it independently predicts multi-scale features.PAN, as an enhancement of FPN, introduces a bottom-up feature pyramid structure to augment its capabilities.It preserves more shallow positional features, further enhancing the overall feature extraction capability.

Head
The Head section serves as the output layer of the algorithm and consists primarily of three Detect detectors.It performs OD by using grid-based anchors on feature maps at different scales.Each Detect module receives features from the Neck layer at three different scales.It uses convolutional operations to adjust the channel dimension of the output layer and then refines the position of the anchors based on the predicted results.Ultimately, the predicted results undergo mapping back to the original image after traversing

Neck
The Neck component comprises two elements: the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN).FPN represents a top-down feature pyramid network, whereas PAN stands as a path aggregation network.FPN integrates features at different levels in a hierarchical manner, progressing from top to bottom.By capitalizing on the high resolution of low-level features and the abundant semantic information from high-level features, it independently predicts multi-scale features.PAN, as an enhancement of FPN, introduces a bottom-up feature pyramid structure to augment its capabilities.It preserves more shallow positional features, further enhancing the overall feature extraction capability.

Head
The Head section serves as the output layer of the algorithm and consists primarily of three Detect detectors.It performs OD by using grid-based anchors on feature maps at different scales.Each Detect module receives features from the Neck layer at three different scales.It uses convolutional operations to adjust the channel dimension of the output layer and then refines the position of the anchors based on the predicted results.Ultimately, the predicted results undergo mapping back to the original image after traversing a postprocessing module.The ultimate detection results are obtained by applying non-maximum suppression, which helps eliminate a significant number of overlapping candidate boxes.

Our Work
An efficient OD algorithm for HSRIs is introduced, utilizing the YOLOv5s framework.The proposed algorithm's network architecture, depicted in Figure 2, closely resembles that of YOLOv5s, consisting of three main components.The subsequent sections will provide a detailed description of the key modules in this algorithm.

Our Work
An efficient OD algorithm for HSRIs is introduced, utilizing the YOLOv5s framework.The proposed algorithm's network architecture, depicted in Figure 2, closely resembles that of YOLOv5s, consisting of three main components.The subsequent sections will provide a detailed description of the key modules in this algorithm.

RepConv Module
By incorporating the RepConv module into YOLOv5, the feature representation capability for small objects is enhanced due to the multi-branch structure of RepConv.This leads to improved accuracy in recognizing small objects.Moreover, during inference, the parallel branches of RepConv are transformed into a single branch using reparameterization techniques [28], maintaining the same structure as YOLOv5 without increasing the inference time.
This module incorporates a parallel 1 × 1 Conv layer within the 3 × 3 Conv layer of the Backbone, effectively widening the convolutional module.During the inference phase, the outputs of the parallel branch are consolidated into the 3 × 3 Conv layer.With the inclusion of this modification, the detection accuracy of small objects in HSRIs is enhanced, without introducing any extra inference time in the algorithm.Throughout the training process, the RepConv module utilizes a multi-branch structure.Figure 3a illustrates the module's structure when there exists a discrepancy in the number of input feature channels and output feature channels.In contrast, Figure 3b depicts the module's structure when the number of input feature channels matches the number of output feature channels.In the inference phase, the outputs of the parallel branches are combined within the 3 × 3 Conv layer, resulting in the transformation of the multi-branch structure into a single-path structure, as illustrated in Figure 3c.

RepConv Module
By incorporating the RepConv module into YOLOv5, the feature representation capability for small objects is enhanced due to the multi-branch structure of RepConv.This leads to improved accuracy in recognizing small objects.Moreover, during inference, the parallel branches of RepConv are transformed into a single branch using reparameterization techniques [28], maintaining the same structure as YOLOv5 without increasing the inference time.
This module incorporates a parallel 1 × 1 Conv layer within the 3 × 3 Conv layer of the Backbone, effectively widening the convolutional module.During the inference phase, the outputs of the parallel branch are consolidated into the 3 × 3 Conv layer.With the inclusion of this modification, the detection accuracy of small objects in HSRIs is enhanced, without introducing any extra inference time in the algorithm.Throughout the training process, the RepConv module utilizes a multi-branch structure.Figure 3a illustrates the module's structure when there exists a discrepancy in the number of input feature channels and output feature channels.In contrast, Figure 3b depicts the module's structure when the number of input feature channels matches the number of output feature channels.In the inference phase, the outputs of the parallel branches are combined within the 3 × 3 Conv layer, resulting in the transformation of the multi-branch structure into a single-path structure, as illustrated in Figure 3c.

Transformer Encoder Module
The convolutional modules used in YOLOv5 primarily focus on local features, resulting in subpar detection performance for large-scale objects.The core of the Transformer Encoder module lies in its multi-head self-attention mechanism, which enables capturing global features and exhibits strong capabilities in detecting large-scale objects.
The CNN operator encounters the issue of limited local receptive fields when extracting features for OD.To capture global information, multiple layers need to be stacked.Nevertheless, with an increase in the number of layers, there exists the possibility of information degradation, resulting in a concentration of feature attention in specific regions.On the other hand, Transformers possess a self-attention mechanism that effectively captures global information.Additionally, the use of multiple heads enables mapping to various spatial positions, thereby enhancing the model's expressive capacity.

Transformer Encoder Module
The convolutional modules used in YOLOv5 primarily focus on local features, resulting in subpar detection performance for large-scale objects.The core of the Transformer Encoder module lies in its multi-head self-attention mechanism, which enables capturing global features and exhibits strong capabilities in detecting large-scale objects.
The CNN operator encounters the issue of limited local receptive fields when extracting features for OD.To capture global information, multiple layers need to be stacked.Nevertheless, with an increase in the number of layers, there exists the possibility of information degradation, resulting in a concentration of feature attention in specific regions.On the other hand, Transformers possess a self-attention mechanism that effectively captures global information.Additionally, the use of multiple heads enables mapping to various spatial positions, thereby enhancing the model's expressive capacity. To

C3GAM Module
HSRIs often contain a large amount of complex background, which introduces significant interference to the objects and leads to a decrease in detection accuracy.Therefore,

C3GAM Module
HSRIs often contain a large amount of complex background, which introduces significant interference to the objects and leads to a decrease in detection accuracy.Therefore, to mitigate the interference caused by the background, this paper adopts GAM attention, which reduces the weights and weakens the features of the background through a weighted approach.This effectively eliminates the interference from the background.
The extraction of feature information for OD in HSRIs is significantly hindered by the presence of complex background information.To amplify the feature information within the target regions and mitigate the interference arising from the background, this paper introduces a new attention mechanism called the Global Attention Mechanism (GAM) embedded within the C3 module, resulting in the construction of the C3GAM module.The GAM [29] attention mechanism amplifies global interdependent features while reducing information diffusion.Figure 5 illustrates the network structure of GAM.The input feature is denoted as F 1 ∈ R C×H×W , the intermediate state is denoted as F 2 , and the output result is denoted as F 3 .The GAM module can be defined as follows: where M c and M s represent the channel attention module and spatial attention module, respectively, and ⊗ denotes the element-wise multiplication operation.The comparison between the improved C3GAM module and the C3 module in the YOLOv5s network architecture is depicted in Figure 6.It reveals that the C3 module is composed of numerous stacked Bottleneck modules, as can be observed.This paper introduces the GAM attention into the Bottleneck module, effectively suppressing the interference caused by the background.The comparison between the improved C3GAM module and the C3 module in the YOLOv5s network architecture is depicted in Figure 6.It reveals that the C3 module is composed of numerous stacked Bottleneck modules, as can be observed.This paper introduces the GAM attention into the Bottleneck module, effectively suppressing the interference caused by the background.The comparison between the improved C3GAM module and the C3 module in the YOLOv5s network architecture is depicted in Figure 6.It reveals that the C3 module is composed of numerous stacked Bottleneck modules, as can be observed.This paper introduces the GAM attention into the Bottleneck module, effectively suppressing the interference caused by the background.

SIoU Loss Function Module
CIoU [30] and DIoU [31] are two commonly used regression loss functions in YOLO.CIoU builds upon DIoU by adding constraints on aspect ratios, resulting in faster

SIoU Loss Function Module
CIoU [30] and DIoU [31] are two commonly used regression loss functions in YOLO.CIoU builds upon DIoU by adding constraints on aspect ratios, resulting in faster convergence compared to DIoU.SIoU [32], on the other hand, further improves upon CIoU by redefining the penalty term and introducing angle loss.It achieves a faster training speed and higher inference accuracy compared to CIoU.
YOLOv5s adopts CIoU as the regression loss function for bounding boxes.CIoU improves upon DIoU by incorporating scale loss and aspect ratio loss for the bounding boxes, making the predicted boxes more aligned with the ground truth boxes.However, CIoU neglects the orientation matching between the ground truth and predicted boxes, focusing solely on the aggregation of bounding box regression metrics.As a result, its training speed and prediction accuracy are lower compared to SIoU.Hence, SioU is opted as the regression loss function for the predicted boxes, encompassing four components: angle cost, distance cost, shape cost, and IoU cost.3).C h represents the height between the center coordinates of B and B GT , as expressed by Equation (4).C h σ is essentially equal to sin(α), as indicated by Equation (5).In the end, the formula for angle cost can be obtained, as depicted in Equation (6).When α equals π 2 or 0, it can be observed that the angle cost is 0. During the training process, if α < π 4 , α is minimized, otherwise, β is minimized.( (    (3) Shape cost According to Figure 7,  and ℎ represent the width and height of B , while  and ℎ represent the width and height of B , respectively. represents the absolute difference between  and  divided by the maximum value between  and  , and  represents the absolute difference between ℎ and ℎ divided by the maximum value between ℎ and ℎ , as shown in Equation (9).The term  represents the importance of the shape cost.The final expression of the shape cost can be obtained as shown in Equation (10).(3) Shape cost According to Figure 7, w and h represent the width and height of B, while w gt and h gt represent the width and height of B GT , respectively.ω w represents the absolute difference between w and w gt divided by the maximum value between w and w gt , and ω h represents the absolute difference between h and h gt divided by the maximum value between h and h gt , as shown in Equation (9).The term θ represents the importance of the shape cost.The final expression of the shape cost can be obtained as shown in Equation (10). ) 1 − e −w t θ (10) (4) IoU cost According to Figure 9, B ∩ B GT and B ∪ B GT represent the intersection and union of B and B GT , respectively.The IoU expression is shown in Equation (11).Ultimately, the total loss value can be obtained as shown in Equation (12).

BiFPN Module
The FPN [33] has been extensively used for multi-scale feature fusion since its introduction, leading to the development of various cross-scale feature fusion networks such as PANet [34] and NAS-FPN [35].These networks typically treat inputs from different scales equally.BiFPN [36] enhances FPN by incorporating learnable weights that help determine the significance of various input features, thus achieving a better balance of information across different scales.Hence, this paper utilizes BiFPN to substitute the feature fusion approach in the Neck section.Figure 10 provides an illustration of the structure of the BiFPN model, where P1, P2, and P3 represent features of different scales generated by the Backbone part.BiFPN_Add2 and BiFPN_Add3 are feature fusion modules that combine features from the current layer and the preceding layer, employing weighted aggregation for fusion.

CSL (Circular Smooth Label) Module
YOLOv5s is a conventional object detector that primarily focuses on horizontal box detection and lacks the ability to handle uncertain object orientations in HSRIs.To overcome this limitation, this paper suggests integrating the CSL module into YOLOv5s, allowing for the prediction of target orientations.CSL transforms the defined range of angles into categories and achieves more robust angle prediction through classification.Please refer to Figure 11

BiFPN Module
The FPN [33] has been extensively used for multi-scale feature fusion since its introduction, leading to the development of various cross-scale feature fusion networks such as PANet [34] and NAS-FPN [35].These networks typically treat inputs from different scales equally.BiFPN [36] enhances FPN by incorporating learnable weights that help determine the significance of various input features, thus achieving a better balance of information across different scales.Hence, this paper utilizes BiFPN to substitute the feature fusion approach in the Neck section.

BiFPN Module
The FPN [33] has been extensively used for multi-scale feature fusion since its introduction, leading to the development of various cross-scale feature fusion networks such as PANet [34] and NAS-FPN [35].These networks typically treat inputs from different scales equally.BiFPN [36] enhances FPN by incorporating learnable weights that help determine the significance of various input features, thus achieving a better balance of information across different scales.Hence, this paper utilizes BiFPN to substitute the feature fusion approach in the Neck section.Figure 10 provides an illustration of the structure of the BiFPN model, where P1, P2, and P3 represent features of different scales generated by the Backbone part.BiFPN_Add2 and BiFPN_Add3 are feature fusion modules that combine features from the current layer and the preceding layer, employing weighted aggregation for fusion.

CSL (Circular Smooth Label) Module
YOLOv5s is a conventional object detector that primarily focuses on horizontal box detection and lacks the ability to handle uncertain object orientations in HSRIs.To overcome this limitation, this paper suggests integrating the CSL module into YOLOv5s, allowing for the prediction of target orientations.CSL transforms the defined range of angles into categories and achieves more robust angle prediction through classification.Please refer to Figure 11

CSL (Circular Smooth Label) Module
YOLOv5s is a conventional object detector that primarily focuses on horizontal box detection and lacks the ability to handle uncertain object orientations in HSRIs.To overcome this limitation, this paper suggests integrating the CSL module into YOLOv5s, allowing for the prediction of target orientations.CSL transforms the defined range of angles into categories and achieves more robust angle prediction through classification.Please refer to Figure 11  CSL consists of cyclic circular label encoding, where the assigned label values ar smooth and have a certain variance.The expression for CSL is as follows: ( ), ( ) 0, The function ( ) g x symbolizes a window function, with the radius θ of the win dow function correlating to the angle of the current bounding box.A desirable window function exhibits the following characteristics: periodicity, symmetry, maximum valu and monotonicity.Commonly used window functions include the pulse function, rectan gular function, triangular function, and Gaussian function.From Figure 11, it can be see that the label values are continuous at the boundaries and are not affected by the perio dicity of CSL, thus avoiding accuracy errors.

IDetect Module
Human analysis of the same object can be conducted from multiple perspective However, when training convolutional neural networks, typically only one perspective provided, making it difficult for the obtained features to be applicable to other tasks.Th primary factor contributing to this problem is that the model solely focuses on extractin neural features while neglecting the acquisition and utilization of implicit knowledge which holds significant value in analyzing diverse tasks.
In the context of neural networks, the shallow features observed by the network which correspond to explicit knowledge, are commonly referred to as explicit knowledg The deep features, which are unobservable and unrelated to observations, are defined a implicit knowledge.As a result, the IDetect module is developed in this paper to blen implicit knowledge and explicit knowledge within the Head section, leading to a notabl enhancement in the algorithm's overall performance.As shown in Figure 12, the structur of the IDetect module is divided into two branches: training and inference.During train ing, the input data are first fused through the ImplicitA module (initialized as a learnabl variable with a value of 0) using addition.It then passes through the Conv module t adjust the output channels.Finally, it undergoes multiplication fusion through the Implic itM module (initialized as a learnable variable with a value of 1) to obtain the output re sult.During inference, only one layer of the Conv module is applied to adjust the outpu channels.CSL consists of cyclic circular label encoding, where the assigned label values are smooth and have a certain variance.The expression for CSL is as follows: The function g(x) symbolizes a window function, with the radius θ of the window function correlating to the angle of the current bounding box.A desirable window function exhibits the following characteristics: periodicity, symmetry, maximum value, and monotonicity.Commonly used window functions include the pulse function, rectangular function, triangular function, and Gaussian function.From Figure 11, it can be seen that the label values are continuous at the boundaries and are not affected by the periodicity of CSL, thus avoiding accuracy errors.

IDetect Module
Human analysis of the same object can be conducted from multiple perspectives.However, when training convolutional neural networks, typically only one perspective is provided, making it difficult for the obtained features to be applicable to other tasks.The primary factor contributing to this problem is that the model solely focuses on extracting neural features while neglecting the acquisition and utilization of implicit knowledge, which holds significant value in analyzing diverse tasks.
In the context of neural networks, the shallow features observed by the network, which correspond to explicit knowledge, are commonly referred to as explicit knowledge.The deep features, which are unobservable and unrelated to observations, are defined as implicit knowledge.As a result, the IDetect module is developed in this paper to blend implicit knowledge and explicit knowledge within the Head section, leading to a notable enhancement in the algorithm's overall performance.As shown in Figure 12, the structure of the IDetect module is divided into two branches: training and inference.
During training, the input data are first fused through the ImplicitA module (initialized as a learnable variable with a value of 0) using addition.It then passes through the Conv module to adjust the output channels.Finally, it undergoes multiplication fusion through the ImplicitM module (initialized as a learnable variable with a value of 1) to obtain the output result.During inference, only one layer of the Conv module is applied to adjust the output channels.

Experiments
The experimental setup utilized CUDA 10.0 as the computing platform, Ubuntu 18.04 as the operating system, Intel i7-7700K as the processor, NVIDIA GTX 3090 with 24GB of VRAM as the graphics card, and PyTorch 1.10.0 as the DL framework.The performance of the algorithm is assessed using two datasets specific to detect objects in HSRIs: HRSC2016 and UCAS-AOD.

Dataset
HRSC2016, which was introduced by Northwestern Polytechnical University in 2016, is acknowledged as one of the most challenging datasets for detecting ships in remote sensing.The dataset comprises 1061 HSRIs obtained from Google Earth, accompanied by 2976 instances annotated with rotated bounding boxes to facilitate the detection of targets.The dataset encompasses images with diverse resolutions, ranging from 2 m to 0.4 m.The images encompass a range of sizes, spanning from 300 × 300 to 1500 × 900, with a majority of them exceeding dimensions of 1000 × 600.For the experimental setup, single-class object recognition is conducted using three different sets of images.The training set comprises 436 images, with a total of 1207 samples.The validation set comprises 181 images, with a total of 541 samples.Lastly, the test set comprises 444 images, with a total of 1228 samples.Figure 13 displays a subset of the HRSC2016 dataset, highlighting the significant scale variations and complex background challenges present in remote sensing images.

Experiments
The experimental setup utilized CUDA 10.0 as the computing platform, Ubuntu 18.04 as the operating system, Intel i7-7700K as the processor, NVIDIA GTX 3090 with 24 GB of VRAM as the graphics card, and PyTorch 1.10.0 as the DL framework.The performance of the algorithm is assessed using two datasets specific to detect objects in HSRIs: HRSC2016 and UCAS-AOD.

Dataset
HRSC2016, which was introduced by Northwestern Polytechnical University in 2016, is acknowledged as one of the most challenging datasets for detecting ships in remote sensing.The dataset comprises 1061 HSRIs obtained from Google Earth, accompanied by 2976 instances annotated with rotated bounding boxes to facilitate the detection of targets.The dataset encompasses images with diverse resolutions, ranging from 2 m to 0.4 m.The images encompass a range of sizes, spanning from 300 × 300 to 1500 × 900, with a majority of them exceeding dimensions of 1000 × 600.For the experimental setup, single-class object recognition is conducted using three different sets of images.The training set comprises 436 images, with a total of 1207 samples.The validation set comprises 181 images, with a total of 541 samples.Lastly, the test set comprises 444 images, with a total of 1228 samples.Figure 13 displays a subset of the HRSC2016 dataset, highlighting the significant scale variations and complex background challenges present in remote sensing images.

Experiments
The experimental setup utilized CUDA 10.0 as the computing platform, Ubuntu 18.04 as the operating system, Intel i7-7700K as the processor, NVIDIA GTX 3090 with 24GB of VRAM as the graphics card, and PyTorch 1.10.0 as the DL framework.The performance of the algorithm is assessed using two datasets specific to detect objects in HSRIs: HRSC2016 and UCAS-AOD.

Dataset
HRSC2016, which was introduced by Northwestern Polytechnical University in 2016, is acknowledged as one of the most challenging datasets for detecting ships in remote sensing.The dataset comprises 1061 HSRIs obtained from Google Earth, accompanied by 2976 instances annotated with rotated bounding boxes to facilitate the detection of targets.The dataset encompasses images with diverse resolutions, ranging from 2 m to 0.4 m.The images encompass a range of sizes, spanning from 300 × 300 to 1500 × 900, with a majority of them exceeding dimensions of 1000 × 600.For the experimental setup, single-class object recognition is conducted using three different sets of images.The training set comprises 436 images, with a total of 1207 samples.The validation set comprises 181 images, with a total of 541 samples.Lastly, the test set comprises 444 images, with a total of 1228 samples.Figure 13 displays a subset of the HRSC2016 dataset, highlighting the significant scale variations and complex background challenges present in remote sensing images.

Experimental Parameter Settings
The Adam optimizer is employed in the experiments, using a momentum coefficient of 0.937 and a learning rate of 0.001.The IoU thresholds are set to 0.10, 0.20, 0.25, and 0.30, while the confidence thresholds for detecting targets are set to 0.10, 0.20, 0.30, and 0.40.The model's loss values reached a stable state after 350 iterations on the two experimental datasets, as illustrated in Figure 15.Hence, the number of iterations is set to 400.

Experimental Evaluation Metrics
The Intersect Over Union (IoU) threshold has a direct impact on the output prediction frame, with a higher threshold typically resulting in improved prediction accuracy.In this experiment, the mean Average Precision (mAP) metric is employed as the main evaluation

Experimental Parameter Settings
The Adam optimizer is employed in the experiments, using a momentum coefficient of 0.937 and a learning rate of 0.001.The IoU thresholds are set to 0.10, 0.20, 0.25, and 0.30, while the confidence thresholds for detecting targets are set to 0.10, 0.20, 0.30, and 0.40.The model's loss values reached a stable state after 350 iterations on the two experimental datasets, as illustrated in Figure 15.Hence, the number of iterations is set to 400.

Experimental Parameter Settings
The Adam optimizer is employed in the experiments, using a momentum coefficient of 0.937 and a learning rate of 0.001.The IoU thresholds are set to 0.10, 0.20, 0.25, and 0.30, while the confidence thresholds for detecting targets are set to 0.10, 0.20, 0.30, and 0.40.The model's loss values reached a stable state after 350 iterations on the two experimental datasets, as illustrated in Figure 15.Hence, the number of iterations is set to 400.

Experimental Evaluation Metrics
The Intersect Over Union (IoU) threshold has a direct impact on the output prediction frame, with a higher threshold typically resulting in improved prediction accuracy.In this experiment, the mean Average Precision (mAP) metric is employed as the main evaluation

Experimental Evaluation Metrics
The Intersect Over Union (IoU) threshold has a direct impact on the output prediction frame, with a higher threshold typically resulting in improved prediction accuracy.In this experiment, the mean Average Precision (mAP) metric is employed as the main evaluation indicator.mAP is calculated based on the precision-recall (P-R) curve in multi-class OD, measuring the accuracy and recall for each class individually.The precision (P), recall (R), and mAP values are computed using Formulas ( 14)- (16).
Among these, TP represents true positives, which signifies the count of correctly detected positive samples.False positives, denoted as FP, represent the count of negative samples erroneously identified as positive detections.FN represents false negatives, indicating the count of positive samples erroneously identified as negative detections.K represents the count of target classes, whereas AP denotes the average precision.

Analysis of the Experimental Results on HRSC2016
The precision comparison results of the proposed OD algorithm, when evaluated against state-of-the-art one-stage and two-stage OD algorithms, are displayed in Table 1. Figure 16 exhibits the chosen detection results achieved by the proposed OD algorithm.Based on the data displayed in Table 1, the proposed algorithm outperforms all the compared algorithms, achieving an mAP of 90.29%.Compared to the algorithms in Table 1, our algorithm performs OD on large-scale images of 1024 × 1024, which is advantageous for object recognition.By using a preset number of three Anchors, the algorithm effectively reduces computational complexity and achieves higher detection accuracy at a lower cost.
Based on Table 1, the Gliding Vertex algorithm, a two-stage object detection algorithm, achieves the highest detection accuracy of 88.20%, which is improved by 2.09% compared to our algorithm, while the number of predefined prior boxes in our algorithm is also fewer than in Gliding Vertex.Our algorithm outperforms the compared two-stage object detection algorithms in the table by a significant margin.When comparing with the onestage object detection algorithms in the table, we also achieve a 0.12% improvement in accuracy compared to the highest-performing S 2 A-Net.
It is evident that the proposed algorithm demonstrates impressive detection performance when dealing with objects that exhibit significant scale variations when the detection results depicted in Figure 16 are examined.This observation highlights the effectiveness of integrating the RepConv, Transformer Encoder, and BiFPN modules into the algorithm.The algorithm also demonstrates efficient and accurate detection capabilities for objects in complex backgrounds, highlighting the effectiveness of utilizing the GAM and SIoU modules.In light of the aforementioned analysis, it can be inferred that our algorithm exhibits strong performance in detecting objects across a wide range of scales or in complex backgrounds.This provides validation for the effectiveness of the proposed approach.Furthermore, we compared our algorithm with representative one-stage object detection algorithms such as YOLOv6, YOLOv7, and YOLOv8.Our algorithm outperforms YOLOv6, YOLOv7, and YOLOv8 by 4.87%, 4.18%, and 4.59% in terms of accuracy improvement, respectively, indicating that their direct application to HSRIs is not effective.The YOLO series algorithms are primarily developed for conventional datasets, while HSRIs present greater challenges due to large object aspect ratios, complex backgrounds, and frequent object clustering.Therefore, conventional horizontal box object detection algorithms like the YOLO series cannot achieve the desired results when they are applied to HSRIs.
It is evident that the proposed algorithm demonstrates impressive detection performance when dealing with objects that exhibit significant scale variations when the detection results depicted in Figure 16 are examined.This observation highlights the effectiveness of integrating the RepConv, Transformer Encoder, and BiFPN modules into the algorithm.The algorithm also demonstrates efficient and accurate detection capabilities for objects in complex backgrounds, highlighting the effectiveness of utilizing the GAM and SIoU modules.In light of the aforementioned analysis, it can be inferred that our algorithm exhibits strong performance in detecting objects across a wide range of scales or in complex backgrounds.This provides validation for the effectiveness of the proposed approach.

Analysis of the Experimental Results on UCAS-AOD
Experiments were conducted to compare our algorithm with the latest OD algorithms.The precision comparison results for each algorithm are displayed in Table 2.The partial detection results of our algorithm on the UCAS-AOD dataset are depicted in Figure 17.

Ablation Experiments
To evaluate the rationality and effectiveness of the recently incorporated functiona modules in our OD algorithm, we perform ablation experiments on the two experimenta datasets.Tables 3 and 4 display the experimental findings.On the HRSC2016 dataset, th baseline model YOLOv5s achieved an mAP of 88.57%.As shown in Table 3, the introduc tion of the SIoU, GAM, Transformer Encoder, BiFPN, RepConv, and IDetect modules re sulted in mAP improvements of 0.46%, 0.52%, 0.25%, 0.27%, 0.05%, and 0.17%, respec tively.The proposed algorithm achieved an mAP of 90.29%.
The baseline model YOLOv5s achieved an mAP of 86.90% on the UCAS-AOD da taset.Similarly, as shown in Table 4, the introduction of the SIoU, GAM, Transformer En coder, BiFPN, RepConv, and IDetect modules resulted in mAP improvements of 0.42% 0.61%, 0.67%, 0.84%, 0.47%, and 0.15%, respectively.The proposed algorithm achieved an mAP of 90.06%.By analyzing both Tables 3 and 4, the positive impact of each newly in troduced functional module in our algorithm on improving the accuracy of object recog nition in HSRIs can be observed.According to Table 2, the detection accuracy of our algorithm for Car and Airplane reaches 89.60% and 90.53%, respectively, with an overall mAP of 90.06%, which is higher than all the compared algorithms.The detection results depicted in Figure 17 demonstrate the strong performance of our algorithm in detecting densely arranged objects, confirming the effectiveness of the GAM and SIoU modules introduced in this paper.It also exhibits efficient and accurate detection capability for objects with uncertain orientations, confirming the effectiveness of the circular smooth label approach for handling angle-related issues.According to the aforementioned analysis, it can be inferred that our algorithm exhibits strong detection performance for densely arranged objects and objects with uncertain orientations.This outcome serves as evidence for the effectiveness of the proposed approach.

Different Variants
Furthermore, we compared our algorithm with YOLOv6, YOLOv7, and YOLOv8 in terms of accuracy.YOLOv8 achieves the highest detection accuracy of 89.28% for the "Car" category, while YOLOv6 achieves the highest detection accuracy of 90.46% for the "Airplane" category.In comparison, our algorithm demonstrates accuracy improvements of 0.32% and 0.07%, respectively, providing further evidence of the effectiveness of our algorithm compared to YOLOv6, YOLOv7, and YOLOv8.
The baseline model YOLOv5s achieved an mAP of 86.90% on the UCAS-AOD dataset.Similarly, as shown in Table 4, the introduction of the SIoU, GAM, Transformer Encoder, BiFPN, RepConv, and IDetect modules resulted in mAP improvements of 0.42%, 0.61%, 0.67%, 0.84%, 0.47%, and 0.15%, respectively.The proposed algorithm achieved an mAP of 90.06%.By analyzing both Tables 3 and 4, the positive impact of each newly introduced functional module in our algorithm on improving the accuracy of object recognition in HSRIs can be observed.

Conclusions and Future Works
OD in HSRIs encounters various challenges due to the intricate nature of the earth's surface and the specific shooting distances and angles involved.The neglect of specific characteristics of HSRIs often leads to the failure of conventional OD algorithms to meet application requirements.In response to this, the present paper introduces an enhanced OD algorithm based on YOLOv5 specifically designed for HSRIs.By incorporating multiple functional modules, this algorithm preserves the strong OD capability of the original YOLOv5 while significantly improving the accuracy in detecting objects with diverse scales, complex backgrounds, dense arrangements, and uncertain orientations within HSRIs.By demonstrating a high detection accuracy on the two experimental HSRI datasets, the proposed algorithm's effectiveness is validated.Nonetheless, the algorithm continues to experience instances of overlooking small objects and exhibits a comparatively lengthy processing time.
To tackle the problem of missed detections for certain small objects in this study, our forthcoming efforts will concentrate on enhancing the precision of small object detection.This will be accomplished by employing multi-scale detection, improving feature representation, and implementing techniques such as data augmentation and sample balancing.Meanwhile, it is crucial to invest efforts in the development of OD algorithms for HSRIs that are both fast and accurate.Our plan entails exploring pruning and distillation techniques to not only optimize model performance and achieve exceptional results but also to minimize processing time.Conducting research on faster and more accurate OD algorithms can effectively cater to the requirements of real-world applications that involve extensive HSRI datasets.

Figure 3 .
Figure 3. Architecture of the designed RepConv module.
bolster the global feature extraction capabilities of YOLOv5s, this paper introduces a Transformer Encoder module into the Backbone component, as depicted in Figure 4.The figure clearly demonstrates that each Transformer Encoder is composed of two sublayers.The initial sub-layer includes LayerNorm, Multi-Head Attention, and Dropout.The input to this layer is data with dimensions (n, b, c); in this context, n represents the result of multiplying the width and height of the feature map, b signifies the count of input images within the network, and c denotes the number of feature channels.The input data undergo normalization through the LayerNorm layer, followed by Multi-Head Attention to compute similarities between targets.Lastly, the data flow through the Dropout layer to alleviate overfitting.The second sub-layer consists of LayerNorm and Multi-Layer Perceptron (MLP).LayerNorm serves a similar purpose as in the first sub-layer, while the MLP employs fully connected layers for linear transformations.Residual connections are established between each sub-layer.With the inclusion of the Transformer Encoder module, the improved algorithm acquires enhanced capabilities to capture global information and contextual details.

Figure 3 .
Figure 3. Architecture of the designed RepConv module.

Figure 4 .
Figure 4. Architecture of the Transformer Encoder module.

Figure 4 .
Figure 4. Architecture of the Transformer Encoder module.

Figure 5 .
Figure 5. Architecture of the GAM module.

Figure 5 .
Figure 5. Architecture of the GAM module.

Figure 5 .
Figure 5. Architecture of the GAM module.

Figure 6 .
Figure 6.Comparison of architectures between C3 module and C3GAM module.

Figure 6 .
Figure 6.Comparison of architectures between C3 module and C3GAM module.

( 1 )
Angle costThe schematic diagram in Figure7illustrates the angle cost, where B represents the predicted box with center coordinates b c x , b c y , and B GT represents the ground truth box with center coordinates b gt c x , b gt c y .σ represents the distance between the center coordinates of B and B GT , as denoted by Equation (
) Distance cost The distance cost is shown in Figure 8, where B represents the predicted box with center coordinates b c x , b c y , B GT represents the ground truth box with center coordinates b gt c x , b gt c y ; C w and C h denote the width and height of the minimum bounding rectangle of B and BGT , respectively.The term ρ x represents the squared ratio of the difference in x-axis coordinates between the B and B GT to C w , while ρ y represents the squared ratio of the difference in y-axis coordinates between B and B GT to C h , as described by Equation (7).The final expression of the distance cost is shown in Equation (8).

∆ 8 )
= ∑ t=x,y 1 − e −γρ t (Remote Sens. 2023, 15, x FOR PEER REVIEW 10 of 21 difference in x-axis coordinates between the B and B to C , while  represents the squared ratio of the difference in y-axis coordinates between B and Bto C , as described by Equation(7).The final expression of the distance cost is shown in Equation (8).

Figure 9 .
Figure 9. Illustration of Intersection and Union.
for visualization.In the figure, the lines of different colors represent different window functions, where the yellow line represents the pulse function, the green line represents the rectangular function, the blue line represents the triangular function, and the red line represents the Gaussian function.

Figure 9 .
Figure 9. Illustration of Intersection and Union.
for visualization.In the figure, the lines of different colors represent different window functions, where the yellow line represents the pulse function, the green line represents the rectangular function, the blue line represents the triangular function, and the red line represents the Gaussian function.

Figure 13 .
Figure 13.Sample images from the HRSC2016 dataset.UCAS-AOD is a dataset specifically designed for aircraft and car detection, consisting of 1000 images containing 7482 instances of aircraft and 510 images containing 7114

Figure 13 .
Figure 13.Sample images from the HRSC2016 dataset.UCAS-AOD is a dataset specifically designed for aircraft and car detection, consisting of 1000 images containing 7482 instances of aircraft and 510 images containing 7114

of 5 : 2 : 3 .
The training set consists of 755 images, the validation set contains 302 images, and the test set comprises 453 images.All images have dimensions close to 1280 × 659.Figure14displays a subset of the UCAS-AOD dataset, primarily highlighting the dense arrangement of objects and the issue of orientation uncertainty in HSRIs.Remote Sens. 2023, 15, x FOR PEER REVIEW 14 of 21 instances of cars.The dataset is split into training, validation, and test sets in a ratio of 5:2:3.The training set consists of 755 images, the validation set contains 302 images, and the test set comprises 453 images.All images have dimensions close to 1280 × 659. Figure 14 displays a subset of the UCAS-AOD dataset, primarily highlighting the dense arrangement of objects and the issue of orientation uncertainty in HSRIs.

Figure 15 .
Figure 15.Loss function curves for the HRSC2016 and UCAS-AOD datasets.
Remote Sens. 2023, 15, x FOR PEER REVIEW 14 of 21 instances of cars.The dataset is split into training, validation, and test sets in a ratio of 5:2:3.The training set consists of 755 images, the validation set contains 302 images, and the test set comprises 453 images.All images have dimensions close to 1280 × 659. Figure 14 displays a subset of the UCAS-AOD dataset, primarily highlighting the dense arrangement of objects and the issue of orientation uncertainty in HSRIs.

Figure 15 .
Figure 15.Loss function curves for the HRSC2016 and UCAS-AOD datasets.

Figure 15 .
Figure 15.Loss function curves for the HRSC2016 and UCAS-AOD datasets.

Figure 16 .
Figure 16.Detection results obtained from the HRSC2016 dataset.Figure 16.Detection results obtained from the HRSC2016 dataset.

Figure 16 .
Figure 16.Detection results obtained from the HRSC2016 dataset.Figure 16.Detection results obtained from the HRSC2016 dataset.

Figure 17 .
Figure 17.Detection results obtained from the UCAS-AOD dataset.

Figure 17 .
Figure 17.Detection results obtained from the UCAS-AOD dataset.

Table 1 .
Accuracy comparison of different OD algorithms on the HRSC2016 dataset.

Table 2 .
Comparison of accuracy of different OD algorithms on the UCAS-AOD dataset.

Table 3 .
The recognition accuracy changes with the increase in modules in the HRSC2016 dataset.