RSI-YOLO: Object Detection Method for Remote Sensing Images Based on Improved YOLO

With the continuous development of deep learning technology, object detection has received extensive attention across various computer fields as a fundamental task of computational vision. Effective detection of objects in remote sensing images is a key challenge, owing to their small size and low resolution. In this study, a remote sensing image detection (RSI-YOLO) approach based on the YOLOv5 target detection algorithm is proposed, which has been proven to be one of the most representative and effective algorithms for this task. The channel attention and spatial attention mechanisms are used to strengthen the features fused by the neural network. The multi-scale feature fusion structure of the original network based on a PANet structure is improved to a weighted bidirectional feature pyramid structure to achieve more efficient and richer feature fusion. In addition, a small object detection layer is added, and the loss function is modified to optimise the network model. The experimental results from four remote sensing image datasets, such as DOTA and NWPU-VHR 10, indicate that RSI-YOLO outperforms the original YOLO in terms of detection performance. The proposed RSI-YOLO algorithm demonstrated superior detection performance compared to other classical object detection algorithms, thus validating the effectiveness of the improvements introduced into the YOLOv5 algorithm.


Introduction
Remote sensing is a technology used to obtain target information by means of remote sensing platforms such as spaceborne and airborne sensors without direct contact with the target. The technology involves comprehensive earth observation with a large amount of target information included in the images, which is widely used in agriculture, environmental monitoring, urban planning, disaster management, and other fields [1][2][3]. With the development and progress of remote sensing technology, remote sensing image observation and acquisition become more efficient. Since objects in remote sensing images are generally small and may be densely distributed, using the human eye to obtain effective information is inefficient and error-prone, and manually extracting this information is an unpractical task. Target detection technology has been widely used in remote sensing [4].
Traditional remote sensing image object detection methods mainly rely on handdesigned features and traditional machine learning algorithms. However, due to the complexity and diversity of remote sensing image data, traditional methods have encountered many difficulties in processing these data. With the rapid development of deep learning technology, especially the appearance of convolutional neural network (CNN), it brings new opportunities and applications for remote sensing image object detection [5,6]. Compared with traditional methods based on manual feature design, deep learning can automatically learn higher-level semantic features through stronger feature representation capabilities and better adapt to complex changes in remote sensing images. Remote sensing image target detection also needs to consider the context information around the target in order to identify and locate the target more accurately. Deep learning models can capture target context information through local receptive fields and pooling operations in convolutional neural networks to improve the accuracy and robustness of target detection. In addition, deep learning models often require large amounts of training data to take advantage of them. Remote sensing image data have rich temporal and spatial information and large-scale coverage, so a lot of training data are provided. This enables deep learning to learn more accurate and generalising models from large-scale data to deal with the diversity and complexity of object detection in remote sensing images.
There are two main types of deep-learning-based object detection algorithms. The first category is the two-stage object detection algorithm, such as spatial pyramid pooling network (SPP-Net) [7], regions with convolutional neural networks (R-CNN) [8], and Faster R-CNN [9]. While these algorithms generally exhibit high detection accuracy, their inherent two-stage nature often leads to bottlenecks in detection speed. The second category of algorithms is the one-stage target detection algorithm, represented by you only look once (YOLO) and single-shot detector (SSD) [10]. These algorithms have fast detection speed but may sacrifice detection accuracy compared to two-stage algorithms. As detection speed is often a high requirement in most complex tasks, one-stage algorithms have advantages in practical applications. This study improved the YOLOv5 algorithm, and several key points of our work are listed as follows: (1) The YOLOv5 algorithm was improved by incorporating a backbone network with an attention mechanism. The design of the attention mechanism was inspired by the human visual system. When people see an image, their gaze remains focused on the parts that they are interested in, with the visual system automatically filtering out the unimportant parts. This is the manifestation of attention, which involves a conscious focus on the important content in an image. From a computational perspective, the attention mechanism enables the algorithm to effectively filter out extraneous information and focus primarily on processing the most relevant information. This study introduces the convolution block attention module (CBAM) [11] attention mechanism into the YOLO object detection algorithm to explore the feasibility of improving the model. (2) The bi-directional feature pyramid network (Bi-FPN) was used to modify the neck section of the network. The FPN structure was proposed to address the challenge of preserving features of smaller objects during the continuous convolution process. This is because larger objects have more pixels, and thus, their features are easier to extract and preserve. In this study, the weighted Bi-FPN of Efficient Det [12], which can support multi-scale fusion more naturally and quickly, was introduced into the neck section of the YOLOv5s network to obtain a more efficient multi-scale fusion method. (3) A small object detection layer was added. When a neural network learns feature information, it can be difficult to learn useful information if the target size is too small. If the downsampling factor is too large, information is easily lost, and if it is too small, extensive GPU resources are required to save feature map information during network forward propagation, considerably reducing the speed of training and inference. In this paper, an additional detection layer was added to the original three detection layers of YOLO to improve the detection performance of small objects. (4) Efficient intersection over union (EIOU) was introduced to improve the loss function.
Loss function optimisation is a method of improving the performance of object detection. In the original YOLOv5 loss function, when regressing the predicted values, the effect of the parameter representing the aspect ratio in complete IOU (CIOU) loss [13] is reduced if the aspect ratios of the predicted and ground truth boxes are linear. This study introduces EIOU loss [14] to replace the original CIOU loss and improve the training performance of the network.

Input
The input module normalises images of different sizes and converts them into a tensor of 640 × 640 × 3 for input into the network, thereby generating initial prediction boxes, using the anchor box mechanism. YOLOv5 provides three sets of pre-set anchor box sizes. If the initial anchor boxes and the target size of the dataset do not satisfy certain conditions, the program uses k-means [23] and genetic evolution algorithms to determine the anchor box size that best matches the target of the current dataset.

Backbone
The backbone module is responsible for feature extraction and is designed based on CSP-Darknet53. It consists of four modules: Conv, Focus, C3, and SPP [24]. Each of these modules is described below. Figure 2 shows the Conv module, which is composed of a convolution operation and the Conv structure of the convolution operation, the BN (batch normalisation) [25] structure, and the SiLU activation function, so the module can also be called the CBL module. As the basic module in the YOLOv5 network, this module constitutes all convolution operations in the network.

Conv
BN SiLU  Figure 3 shows the Focus module. The design idea of this module is to slice images in a way similar to subsampling and then splicing them together. In this way, the width and height information of the original image is segmented and aggregated into the channel. This can reduce the amount of computation and improve the training speed of the network.

Input
The input module normalises images of different sizes and converts them into a tensor of 640 × 640 × 3 for input into the network, thereby generating initial prediction boxes, using the anchor box mechanism. YOLOv5 provides three sets of pre-set anchor box sizes. If the initial anchor boxes and the target size of the dataset do not satisfy certain conditions, the program uses k-means [23] and genetic evolution algorithms to determine the anchor box size that best matches the target of the current dataset.

Backbone
The backbone module is responsible for feature extraction and is designed based on CSP-Darknet53. It consists of four modules: Conv, Focus, C3, and SPP [24]. Each of these modules is described below. Figure 2 shows the Conv module, which is composed of a convolution operation and the Conv structure of the convolution operation, the BN (batch normalisation) [25] structure, and the SiLU activation function, so the module can also be called the CBL module. As the basic module in the YOLOv5 network, this module constitutes all convolution operations in the network. of parameters. The main difference between these structures is the depth of the model and the number of channels in the convolutional layers; the overall architecture is similar. This study used the YOLOv5s structure, which consists of four parts: input, backbone, neck, and head, as shown in Figure 1. Each part is introduced in detail in the following sections.

Input
The input module normalises images of different sizes and converts them into a tensor of 640 × 640 × 3 for input into the network, thereby generating initial prediction boxes, using the anchor box mechanism. YOLOv5 provides three sets of pre-set anchor box sizes. If the initial anchor boxes and the target size of the dataset do not satisfy certain conditions, the program uses k-means [23] and genetic evolution algorithms to determine the anchor box size that best matches the target of the current dataset.

Backbone
The backbone module is responsible for feature extraction and is designed based on CSP-Darknet53. It consists of four modules: Conv, Focus, C3, and SPP [24]. Each of these modules is described below. Figure 2 shows the Conv module, which is composed of a convolution operation and the Conv structure of the convolution operation, the BN (batch normalisation) [25] structure, and the SiLU activation function, so the module can also be called the CBL module. As the basic module in the YOLOv5 network, this module constitutes all convolution operations in the network.

Conv
BN SiLU  Figure 3 shows the Focus module. The design idea of this module is to slice images in a way similar to subsampling and then splicing them together. In this way, the width and height information of the original image is segmented and aggregated into the channel. This can reduce the amount of computation and improve the training speed of the network.   Figure 3 shows the Focus module. The design idea of this module is to slice images in a way similar to subsampling and then splicing them together. In this way, the width and height information of the original image is segmented and aggregated into the channel. This can reduce the amount of computation and improve the training speed of the network. of parameters. The main difference between these structures is the depth of the model and the number of channels in the convolutional layers; the overall architecture is similar. This study used the YOLOv5s structure, which consists of four parts: input, backbone, neck, and head, as shown in Figure 1. Each part is introduced in detail in the following sections.

Input
The input module normalises images of different sizes and converts them into a tensor of 640 × 640 × 3 for input into the network, thereby generating initial prediction boxes, using the anchor box mechanism. YOLOv5 provides three sets of pre-set anchor box sizes. If the initial anchor boxes and the target size of the dataset do not satisfy certain conditions, the program uses k-means [23] and genetic evolution algorithms to determine the anchor box size that best matches the target of the current dataset.

Backbone
The backbone module is responsible for feature extraction and is designed based on CSP-Darknet53. It consists of four modules: Conv, Focus, C3, and SPP [24]. Each of these modules is described below. Figure 2 shows the Conv module, which is composed of a convolution operation and the Conv structure of the convolution operation, the BN (batch normalisation) [25] structure, and the SiLU activation function, so the module can also be called the CBL module. As the basic module in the YOLOv5 network, this module constitutes all convolution operations in the network.

Conv
BN SiLU  Figure 3 shows the Focus module. The design idea of this module is to slice images in a way similar to subsampling and then splicing them together. In this way, the width and height information of the original image is segmented and aggregated into the channel. This can reduce the amount of computation and improve the training speed of the network.   Figure 4 shows the C3 module, which is equal to a simplified version of Bottleneck. It consists of Bottleneck and three Conv structures. Bottleneck Bool aims to further extract and enhance characteristic information. The C3 module is designed to improve feature representation and deepen the layers of the network without increasing the amount of computation. Figure 4 shows the C3 module, which is equal to a simplified version of Bottleneck. It consists of Bottleneck and three Conv structures. Bottleneck Bool aims to further extract and enhance characteristic information. The C3 module is designed to improve feature representation and deepen the layers of the network without increasing the amount of computation.  Figure 5 shows the SPP module, which first halves the input channel through a standard convolution module, then performs maxpool operations of different kernel sizes, and finally splicing so that more features of different sizes can be fused and more network information can be obtained before it is sent to the Neck.

Neck
The Neck module can further enhance the diversity and robustness of features. While YOLOv5 has undergone some minor adjustments compared to its predecessor, it still incorporates the FPN [26] + path aggregation network (PAN) [27] for structural design. The upsample module performs normal upsampling operations, and the concat module concatenates tensors along a certain dimension, which is not equivalent to an add operation. The add operation adds corresponding elements of tensors with the same size, without changing the size of the tensor. However, the concat module merges two tensors into one, and during this process, the size of the resulting tensor may change. The remaining modules are same as those mentioned in the backbone module.

Head
The head outputs the detection results of the network. It consists of three detection modules for large, medium, and small objects. The most important part of the head is the loss function. For machine learning algorithms, learning refers to autonomously improving the efficiency of the model to complete the task, and the loss function can measure the degree of excellence of the model and then quantify the effectiveness of the model. The loss function is also often called the objective function, and the machine learning task hopes to optimise the function to the lowest point. The YOLOv5 loss function is expressed in Equation (1) and consists of classification, objectness, and localisation losses, where the various i λ are the balancing coefficients. In the literature, localisation and objectness losses are sometimes referred to as bounding box and confidence losses, respectively.   Figure 5 shows the SPP module, which first halves the input channel through a standard convolution module, then performs maxpool operations of different kernel sizes, and finally splicing so that more features of different sizes can be fused and more network information can be obtained before it is sent to the Neck. Figure 4 shows the C3 module, which is equal to a simplified version of Bottleneck. It consists of Bottleneck and three Conv structures. Bottleneck Bool aims to further extract and enhance characteristic information. The C3 module is designed to improve feature representation and deepen the layers of the network without increasing the amount of computation.  Figure 5 shows the SPP module, which first halves the input channel through a standard convolution module, then performs maxpool operations of different kernel sizes, and finally splicing so that more features of different sizes can be fused and more network information can be obtained before it is sent to the Neck.

Neck
The Neck module can further enhance the diversity and robustness of features. While YOLOv5 has undergone some minor adjustments compared to its predecessor, it still incorporates the FPN [26] + path aggregation network (PAN) [27] for structural design. The upsample module performs normal upsampling operations, and the concat module concatenates tensors along a certain dimension, which is not equivalent to an add operation. The add operation adds corresponding elements of tensors with the same size, without changing the size of the tensor. However, the concat module merges two tensors into one, and during this process, the size of the resulting tensor may change. The remaining modules are same as those mentioned in the backbone module.

Head
The head outputs the detection results of the network. It consists of three detection modules for large, medium, and small objects. The most important part of the head is the loss function. For machine learning algorithms, learning refers to autonomously improving the efficiency of the model to complete the task, and the loss function can measure the degree of excellence of the model and then quantify the effectiveness of the model. The loss function is also often called the objective function, and the machine learning task hopes to optimise the function to the lowest point. The YOLOv5 loss function is expressed in Equation (1) and consists of classification, objectness, and localisation losses, where the various i λ are the balancing coefficients. In the literature, localisation and objectness losses are sometimes referred to as bounding box and confidence losses, respectively.

Neck
The Neck module can further enhance the diversity and robustness of features. While YOLOv5 has undergone some minor adjustments compared to its predecessor, it still incorporates the FPN [26] + path aggregation network (PAN) [27] for structural design. The upsample module performs normal upsampling operations, and the concat module concatenates tensors along a certain dimension, which is not equivalent to an add operation. The add operation adds corresponding elements of tensors with the same size, without changing the size of the tensor. However, the concat module merges two tensors into one, and during this process, the size of the resulting tensor may change. The remaining modules are same as those mentioned in the backbone module.

Head
The head outputs the detection results of the network. It consists of three detection modules for large, medium, and small objects. The most important part of the head is the loss function. For machine learning algorithms, learning refers to autonomously improving the efficiency of the model to complete the task, and the loss function can measure the degree of excellence of the model and then quantify the effectiveness of the model. The loss function is also often called the objective function, and the machine learning task hopes to optimise the function to the lowest point. The YOLOv5 loss function is expressed in Equation (1) and consists of classification, objectness, and localisation losses, where the various λ i are the balancing coefficients. In the literature, localisation and objectness losses are sometimes referred to as bounding box and confidence losses, respectively.
In traditional simple classification tasks, class labels are mutually exclusive. For example, a target could be a chicken, a duck, or even a goose, but it can only belong to one of these classes. Therefore, the softmax function is often used to convert predicted values into probability values that add up to 1, with the highest value constituting the result. In YOLOv3 and later versions, the algorithm considers the possibility that a target may belong to multiple classes. For example, when recognising a person, it may recognise "child" and "running" as two results. In this case, the probabilities of each class are treated independently, and the classification loss function utilises cross-entropy (BCE) loss [28], Sensors 2023, 23, 6414 6 of 21 which is calculated using Equation (2). Here, N represents the total number of classes; C ij is the true class value;Ĉ ij is the class value after activation; and x i is the current class prediction value.
As part of object loss calculation, BCE loss is calculated using Equation (3). Here, C ij represents the true value of the presence of an object, where 0 represents presence and 1 represents absence.Ĉ ij represents the probability of the presence of an object after the activation function, and its calculation is the same as that of the sigmoid function in the classification loss equation.
Localisation loss is calculated using CIOU loss, which considers three geometric parameters in object detection tasks: overlap area, distance ratio of centre points, and aspect ratio. As shown in Equation (4), the value of localisation loss is equal to 1 minus the value of CIOU, where ρ 2 (b, b gt ) represents the Euclidean distance between the centre points of the predicted box and the true box; c represents the diagonal distance of the minimum closed bounding box that contains both the true and predicted boxes; υ is a parameter used to measure the aspect ratio; and α is a weight function.
Regarding the loss function, there are two points to consider. First, the target loss uses three different weights to calculate the final result for the three prediction feature maps with different sized objects. These weights, denoted by the parameters in front of each prediction feature map in Equation (5), are hyperparameters that are set based on the common objects in context (COCO) [29] dataset. Second, not all losses calculate the loss for all samples. For instance, the classification and localisation losses only calculate the loss for positive samples, while the target loss, which is based on the centre-intersection over union (CIOU) of the target bounding box, needs to be calculated for all samples.
The YOLOv5 algorithm uses the PyTorch framework, which is easier to deploy compared to the previous Darknet framework. In this study, this version of the YOLO object detection algorithm was used for development and improvement, thereby evaluating the effectiveness of algorithmic improvements through experiments.

Introducing Attention Mechanism
The attention mechanism enables neural networks to focus on important features and ignore unimportant ones [30]. Convolutional operations extract features by mixing channel and spatial information so that the design of attention mechanisms focus on both of these aspects. CBAM consists of two parts: a channel attention module (CAM) and a Sensors 2023, 23, 6414 7 of 21 spatial attention module (SAM), as shown in Figure 6. By redistributing weights, these two modules can help the network better learn category features and location information of target objects.

Introducing Attention Mechanism
The attention mechanism enables neural networks to focus on important features and ignore unimportant ones [30]. Convolutional operations extract features by mixing channel and spatial information so that the design of attention mechanisms focus on both of these aspects. CBAM consists of two parts: a channel attention module (CAM) and a spatial attention module (SAM), as shown in Figure 6. By redistributing weights, these two modules can help the network better learn category features and location information of target objects. The channel attention module focuses on the category of the input image target, as shown in Figure 7. The module takes the input feature map and applies global max pooling and global average pooling to obtain two feature maps. These two 1 × 1 × C feature maps are then fed into a multilayer perceptron (MLP) [31] with a fully connected neural network of two layers. The first layer has C/r neurons, where r is the reduction ratio, and employs a leaky ReLU as the activation function. The second layer consists of C neurons, and the two-layer neural network constitutes a shared fully connected layer. The output features of the shared layer are added together, and a sigmoid normalisation operation is applied to generate the final channel attention features. This process is expressed mathematically in Equation (6)  The channel attention module focuses on the category of the input image target, as shown in Figure 7. The module takes the input feature map and applies global max pooling and global average pooling to obtain two feature maps. These two 1 × 1 × C feature maps are then fed into a multilayer perceptron (MLP) [31] with a fully connected neural network of two layers. The first layer has C/r neurons, where r is the reduction ratio, and employs a leaky ReLU as the activation function. The second layer consists of C neurons, and the two-layer neural network constitutes a shared fully connected layer. The output features of the shared layer are added together, and a sigmoid normalisation operation is applied to generate the final channel attention features.

Introducing Attention Mechanism
The attention mechanism enables neural networks to focus on important features and ignore unimportant ones [30]. Convolutional operations extract features by mixing channel and spatial information so that the design of attention mechanisms focus on both of these aspects. CBAM consists of two parts: a channel attention module (CAM) and a spatial attention module (SAM), as shown in Figure 6. By redistributing weights, these two modules can help the network better learn category features and location information of target objects. The channel attention module focuses on the category of the input image target, as shown in Figure 7. The module takes the input feature map and applies global max pooling and global average pooling to obtain two feature maps. These two 1 × 1 × C feature maps are then fed into a multilayer perceptron (MLP) [31] with a fully connected neural network of two layers. The first layer has C/r neurons, where r is the reduction ratio, and employs a leaky ReLU as the activation function. The second layer consists of C neurons, and the two-layer neural network constitutes a shared fully connected layer. The output features of the shared layer are added together, and a sigmoid normalisation operation is applied to generate the final channel attention features. This process is expressed mathematically in Equation (6)  This process is expressed mathematically in Equation (6), where M c generates the feature; AvgPool and MaxPool are global average pooling and global maximum pooling operations, respectively; σ is the sigmoid function; and W 1 and W 0 are the shared parameters of the MLP network.
The spatial attention module focuses on the position of the input image target, as shown in Figure 8. The module accepts the feature map output of the channel attention module as the input feature map. First, the feature map is passed through a global max pooling layer and a global average pooling layer, which are then concatenated along the channel axis to obtain a single feature map. This feature map is then reduced in dimensionality to a single channel, using a convolutional operation. The sigmoid normalisation operation is then applied to generate the final spatial attention feature. The spatial attention module focuses on the position of the input image target, as shown in Figure 8. The module accepts the feature map output of the channel attention module as the input feature map. First, the feature map is passed through a global max pooling layer and a global average pooling layer, which are then concatenated along the channel axis to obtain a single feature map. This feature map is then reduced in dimensionality to a single channel, using a convolutional operation. The sigmoid normalisation operation is then applied to generate the final spatial attention feature. This process is expressed mathematically in Equation (7), where s M generates the feature, and 7*7 f is a convolutional operation with a kernel size of 7 × 7.
The final output feature is obtained by multiplying the feature output of the spatial attention module with the input feature. The newly generated feature map can improve the connection between various features in both channel and spatial dimensions, making it more effective in extracting the relevant features of the target.

Feature Fusion Structure Improvement
The neck section of YOLOv5 network adopts the FPN+PAN structure. The higher layers of the convolutional neural network are more sensitive to semantic features; however, because of the smaller feature map size, there are fewer position features, which is not conducive to detecting target location. The lower layers of the network are more sensitive to position features because of the larger feature map size; however, there are fewer semantic features, which is not conducive to detecting the target category. FPN is a structure that integrates deep and shallow features. The structure diagram of FPN is shown in the blue dotted box of Figure 6. FPN first processes feature maps bottom-up and generates a pyramid-like structure by sending them to a pre-trained network. Then, in a top-down manner, it first duplicates the high-level feature maps and then enlarges the feature map size by upsampling. Subsequently, it reduces the dimensions of previously processed specific feature maps and performs element-wise addition to obtain a new layer. The FPN structure allows the network to obtain better semantic features, whereas the network features allow the PAN structure to better transmit position information. The PAN structure is shown in the red dotted box of Figure 9. PAN first duplicates the lowest layer of the feature pyramid, performs downsampling, and concatenates the output with the second layer of the FPN pyramid to fuse features, which is essentially the reverse operation of the FPN structure. This process is expressed mathematically in Equation (7), where M s generates the feature, and f 7 * 7 is a convolutional operation with a kernel size of 7 × 7.
The final output feature is obtained by multiplying the feature output of the spatial attention module with the input feature. The newly generated feature map can improve the connection between various features in both channel and spatial dimensions, making it more effective in extracting the relevant features of the target.

Feature Fusion Structure Improvement
The neck section of YOLOv5 network adopts the FPN+PAN structure. The higher layers of the convolutional neural network are more sensitive to semantic features; however, because of the smaller feature map size, there are fewer position features, which is not conducive to detecting target location. The lower layers of the network are more sensitive to position features because of the larger feature map size; however, there are fewer semantic features, which is not conducive to detecting the target category. FPN is a structure that integrates deep and shallow features. The structure diagram of FPN is shown in the blue dotted box of Figure 6. FPN first processes feature maps bottom-up and generates a pyramid-like structure by sending them to a pre-trained network. Then, in a top-down manner, it first duplicates the high-level feature maps and then enlarges the feature map size by upsampling. Subsequently, it reduces the dimensions of previously processed specific feature maps and performs element-wise addition to obtain a new layer. The FPN structure allows the network to obtain better semantic features, whereas the network features allow the PAN structure to better transmit position information. The PAN structure is shown in the red dotted box of Figure 9. PAN first duplicates the lowest layer of the feature pyramid, performs downsampling, and concatenates the output with the second layer of the FPN pyramid to fuse features, which is essentially the reverse operation of the FPN structure. The PAN structure adopted by YOLOv5 is based on the feature fusion idea of the original PANet structure. Unlike PANet, which uses pixel summation (add operation), YOLOv5 uses channel concatenation (concat operation) to fuse features. In this project, the weighted bidirectional feature pyramid network structure of Bi-FPN is used to replace the PAN structure of the YOLOv5 network. PANet does not set weights for fusing features of different scales, whereas Bi-FPN introduces weights to balance the information in the features of different scales. The Bi-FPN structure is shown in Figure 10, where Pi represents the feature map obtained from the backbone network. By adding the Bi-FPN structure, the algorithm achieves more efficient and richer feature fusion. The PAN structure adopted by YOLOv5 is based on the feature fusion idea of the original PANet structure. Unlike PANet, which uses pixel summation (add operation), YOLOv5 uses channel concatenation (concat operation) to fuse features. In this project, the weighted bidirectional feature pyramid network structure of Bi-FPN is used to replace the PAN structure of the YOLOv5 network. PANet does not set weights for fusing features of different scales, whereas Bi-FPN introduces weights to balance the information in the features of different scales. The Bi-FPN structure is shown in Figure 10, where Pi represents the feature map obtained from the backbone network. By adding the Bi-FPN structure, the algorithm achieves more efficient and richer feature fusion.

Adding the Small Object Detection Layer
Detecting small objects has always been a challenging problem in the field of object detection. The small number of pixels occupied by small objects in the image limit the size of the receptive field, making it difficult to learn the features of the target. If the image size is large, data information is easily lost during high magnification downsampling, and when downsampling at a low magnification, the network needs to retain a large amount of feature information in memory, which can easily cause interruption of the detection task because of resource occupation issues on the graphics card. Utilising the feature extraction process of the YOLOv5 algorithm shown in Figure 11, this study improves the algorithm's ability to recognise small objects by adding a small object detection layer.

Adding the Small Object Detection Layer
Detecting small objects has always been a challenging problem in the field of object detection. The small number of pixels occupied by small objects in the image limit the size of the receptive field, making it difficult to learn the features of the target. If the image size is large, data information is easily lost during high magnification downsampling, and when downsampling at a low magnification, the network needs to retain a large amount of feature information in memory, which can easily cause interruption of the detection task because of resource occupation issues on the graphics card. Utilising the feature extraction process of the YOLOv5 algorithm shown in Figure 11, this study improves the algorithm's ability to recognise small objects by adding a small object detection layer. The network structure of this study, with the small object detection layer added to the original YOLOv5 network model, is shown in Figure 12. The modified part is highlighted by a green dashed box. The upsampling operation after the 17th layer of the network enlarges the size of the feature map, making it the same size as the feature map generated by the second layer of the backbone network, which is 160 × 160. This feature map size has a relatively small receptive field. Then, the feature maps from these two layers are concatenated using the concat operation to obtain a larger feature map. Finally, the detection layers in the original network, which use the 17th, 20th, and 23rd layers, are modified so that the 21st, 24th, 27th, and 30th layers are used as detection layers. These modifications add a new detection layer to the network for small object detection.
Adding detection layers increases the computational complexity of the network and The network structure of this study, with the small object detection layer added to the original YOLOv5 network model, is shown in Figure 12. The modified part is highlighted by a green dashed box. The upsampling operation after the 17th layer of the network enlarges the size of the feature map, making it the same size as the feature map generated by the second layer of the backbone network, which is 160 × 160. This feature map size has a relatively small receptive field. Then, the feature maps from these two layers are concatenated using the concat operation to obtain a larger feature map. Finally, the detection layers in the original network, which use the 17th, 20th, and 23rd layers, are modified so that the 21st, 24th, 27th, and 30th layers are used as detection layers. These modifications add a new detection layer to the network for small object detection.
the original YOLOv5 network model, is shown in Figure 12. The modified part is highlighted by a green dashed box. The upsampling operation after the 17th layer of the network enlarges the size of the feature map, making it the same size as the feature map generated by the second layer of the backbone network, which is 160 × 160. This feature map size has a relatively small receptive field. Then, the feature maps from these two layers are concatenated using the concat operation to obtain a larger feature map. Finally, the detection layers in the original network, which use the 17th, 20th, and 23rd layers, are modified so that the 21st, 24th, 27th, and 30th layers are used as detection layers. These modifications add a new detection layer to the network for small object detection.
Adding detection layers increases the computational complexity of the network and consequently slows down the inference speed of the model. However, during training, recall rate is improved, and target loss is significantly reduced.

Improvement of Loss Function
Improving the loss function is a common approach to optimise object detection models. As mentioned earlier, YOLOv5 uses CIOU loss as the localisation loss function to represent the width and height losses of the predicted box in terms of aspect ratio. However, aspect ratio may not accurately represent the true width and height. If there is a linear relationship between the true and predicted values of the aspect ratio, the role of parameter is insignificant. According to the formula, the width and height of the predicted box are inversely proportional to each other. This means that when one increases, the other decreases, making it difficult for the model to optimise both dimensions effectively. In this project, EIOU loss is used to solve the problem of CIOU loss, and its formula is shown, as expressed in Equation (8). Adding detection layers increases the computational complexity of the network and consequently slows down the inference speed of the model. However, during training, recall rate is improved, and target loss is significantly reduced.

Improvement of Loss Function
Improving the loss function is a common approach to optimise object detection models. As mentioned earlier, YOLOv5 uses CIOU loss as the localisation loss function to represent the width and height losses of the predicted box in terms of aspect ratio. However, aspect ratio may not accurately represent the true width and height. If there is a linear relationship between the true and predicted values of the aspect ratio, the role of parameter υ is insignificant. According to the formula, the width and height of the predicted box are inversely proportional to each other. This means that when one increases, the other decreases, making it difficult for the model to optimise both dimensions effectively. In this project, EIOU loss is used to solve the problem of CIOU loss, and its formula is shown, as expressed in Equation (8).
The EIOU loss consists of three parts: the overlap loss between predicted and true boxes, centre distance loss, and width and height loss. The first two components are identical to the CIOU loss, but the third loss improves the aspect ratio by treating the difference between the predicted and true box width and height separately. The C 2 w and C 2 h in the formula represent the width and height of the minimum bounding rectangle of the predicted and true boxes, respectively. The iterative process of the predicted box regression using CIOU and EIOU losses is compared in Figure 13, where it is observed that the width and height of the predicted box cannot increase or decrease simultaneously using CIOU loss; however, this is possible with EIOU loss. The EIOU loss function improves the regression accuracy of the predicted box, which in turn helps the model to converge faster during the training process. dicted and true boxes, respectively. The iterative process of the predicted box regress using CIOU and EIOU losses is compared in Figure 13, where it is observed that the wi and height of the predicted box cannot increase or decrease simultaneously using CI loss; however, this is possible with EIOU loss. The EIOU loss function improves the gression accuracy of the predicted box, which in turn helps the model to converge fa during the training process. Figure 13. Comparison of iterative process for box prediction using CIOU and EIOU losses. The blue boxes are the true boxes, and the black boxes are the initial predicted boxes. The red and green boxes are the predicted boxes in the convergence process respectively.

Experimental Environment and Parameter Settings
This project was developed using Python, and the detection task was performed a Windows 10 operating system. However, existing Windows 10 computers cannot m the minimum requirements of the high-performance training task. Therefore, the train task of this project was implemented by renting a cloud server provided by Auto DL. T specific experimental environment configuration is shown in Table 1.  Figure 13. Comparison of iterative process for box prediction using CIOU and EIOU losses. The blue boxes are the true boxes, and the black boxes are the initial predicted boxes. The red and green boxes are the predicted boxes in the convergence process respectively.

Experimental Environment and Parameter Settings
This project was developed using Python, and the detection task was performed on a Windows 10 operating system. However, existing Windows 10 computers cannot meet the minimum requirements of the high-performance training task. Therefore, the training task of this project was implemented by renting a cloud server provided by Auto DL. The specific experimental environment configuration is shown in Table 1. The project primarily used the YOLOv5 algorithm to complete the experimental task, selecting the YOLOv5s pre-trained on the COCO dataset provided by the official site as the training model. The specific experimental parameter settings are shown in Table 2.

Evaluation Metrics
In object detection tasks, recall, precision, and average precision are commonly used as evaluation metrics. The basic elements of these evaluation metrics are shown in Table 3. (1) Recall represents the proportion of correctly detected targets to the total number of targets, as shown in Equation (9).
(2) Precision represents the proportion of correctly detected targets to the total number of predicted targets, as shown in Equation (10).
Average precision (AP) represents the area enclosed by the P-R curve, which has recall and precision as the x and y axes, respectively. The formula for calculating the average precision is shown in Equation (11). To obtain the mean average precision (mAP) for multiple categories, the AP values of each category were averaged. In this project, mAP@0.5 and mAP@0.5:0.95 were used as the main evaluation metrics for precision; mAP@0.5 represents the average precision for an IOU of 0.5, whereas mAP@0.5:0.95 represents the average precision for IOUs ranging from 0.5 to 0.95, in increments of 0.05.
For convenience, mAP@0.5 is referred to as AP 0.5 and mAP@0.5:0.95 is referred to as AP 0.5 0.95 in the remainder of this paper.

DOTA Dataset
The DOTA [32] dataset was generated from Google Earth, as well as JL-1 and GF-2 satellite images. It includes fifteen categories: airplanes, ships, storage tanks, baseball fields, tennis courts, basketball courts, athletic fields, harbours, bridges, large vehicles, small vehicles, helicopters, roundabouts, soccer fields, and swimming pools. The original images in the dataset were too large for this project, as they caused high GPU utilisation during training. Therefore, the images were cropped to a size of 1024 × 1024. The final dataset for this project contained 21,046 images.
RSI-YOLOv5 was trained on the DOTA dataset in this project. The training took a total of 14 h and 31 min. The AP 0.5 0.95 and AP 0.5 scores for all categories in the improved YOLOv5 algorithm were 45.3% and 71.2%, respectively. The detection results are shown in Table 4 for all categories in the dataset. The project compared the RSI-YOLO with the original YOLOv5s algorithm and selected several representative target categories. The results are shown in Table 5, demonstrating that the RSI-YOLO achieved better performance on most metrics. Specifically, for the "airplane" category, the recall rate increased by 1.2%, and the AP 0.5 and AP 0.5 0.95 scores increased by 1.1% and 2.3%, respectively. For the "ship" category, the recall rate increased by 2.2%, and the AP 0.5 and AP 0.5 0.95 scores increased by 1.3% and 2.8%, respectively. For the "small vehicle" category, the recall rate increased by 5.6%, and AP 0.5 0.95 increased by 2.5%. Overall, the recall rate for all categories increased by 4.6%; AP 0.5 and AP 0.5 0.95 increased by 1.3% and 1.6%, respectively. For remote sensing image object detection tasks, the targets are often located far away from the detection devices, and they may move slowly within the detection field of view. Therefore, it is necessary to optimise detection accuracy and other parameters of the algorithm to accurately detect the targets and their speed. The experimental results demonstrated that the improved algorithm proposed in this project for YOLOv5s effectively improved detection performance for typical small object detection tasks, such as remote sensing image object detection. Figure 14 presents the detection results for several test images, demonstrating that RSI-YOLO effectively detected most targets in the dataset, including small objects such as vehicles, airplanes, and ships, as well as other targets such as baseball fields, basketball courts, and bridges. proved detection performance for typical small object detection tasks, such as remote sensing image object detection. Figure 14 presents the detection results for several test images, demonstrating that RSI-YOLO effectively detected most targets in the dataset, including small objects such as vehicles, airplanes, and ships, as well as other targets such as baseball fields, basketball courts, and bridges. In the second column of Figure 11, RSI-YOLO detected two airplane targets located in the upper right of the image, which were missed by the original YOLOv5 algorithm, indicating the robustness of the proposed algorithm. There was a similar situation in the result presentation in the fourth column, where due to the dense ship images, the YOLOv5 algorithm with baseline training did not detect the harbour like RSI-YOLO.
In the third column of Figure 14, it can be observed that the confidence values of the predicted results for baseball field, basketball court, and tennis court targets using RSI-YOLO were all greater than those of the original YOLOv5 algorithm, indicating that the RSI-YOLO model trained on the DOTA dataset has strong feature extraction capabilities. In the fifth column of Figure 14, the comparison of the algorithm's detection confidence results on the bridge target can also prove that RSI-YOLO has better detection effect than YOLOv5 algorithm with baseline training. In the second column of Figure 11, RSI-YOLO detected two airplane targets located in the upper right of the image, which were missed by the original YOLOv5 algorithm, indicating the robustness of the proposed algorithm. There was a similar situation in the result presentation in the fourth column, where due to the dense ship images, the YOLOv5 algorithm with baseline training did not detect the harbour like RSI-YOLO.

DIOR Dataset
In the third column of Figure 14, it can be observed that the confidence values of the predicted results for baseball field, basketball court, and tennis court targets using RSI-YOLO were all greater than those of the original YOLOv5 algorithm, indicating that the RSI-YOLO model trained on the DOTA dataset has strong feature extraction capabilities. In the fifth column of Figure 14, the comparison of the algorithm's detection confidence results on the bridge target can also prove that RSI-YOLO has better detection effect than YOLOv5 algorithm with baseline training.

DIOR Dataset
The DIOR [33] dataset was also used to validate the effectiveness of the algorithmic improvements. This dataset contains 23,463 images and 192,472 targets with an image size of 800 × 800 pixels, and there are 20 types of targets, namely, airplanes, airports, baseball fields, basketball courts, bridges, chimneys, dams, highway service areas, highway toll booths, golf courses, ground athletic fields, ports, overpasses, ships, stadiums, storage tanks, tennis courts, train stations, vehicles, and wind turbines. The training took a total of 8 h and 27 min. The AP 0.5 0.95 and AP 0.5 of all categories of the RSI-YOLO were 79.4% and 57.2%, respectively. The detection results for all categories in the dataset are shown in Table 6. After verification, the detection performance of the trained model using RSI-YOLO was found to be better than that of the original YOLOv5 trained model, which is shown in Figure 15. When comparing the model trained using the baseline YOLOv5s network with that trained using RSI-YOLO, some missed and false detection cases were noted. In the first row of the detection results, for the three aircraft targets on the left, the detection confidence values for the model trained using RSI-YOLO were 0.74, 0.88, and 0.54, whereas for the baseline model the detection confidence values were 0.19, 0.18, and 0.18, which are relatively low. Furthermore, there was one aircraft target on the right that could not be detected by the baseline model, resulting in a missed detection. In the second row of the detection results, the baseline model detected the sidewalk as a bridge, whereas RSI-YOLO did not present a false detection issue. Therefore, RSI-YOLO also displayed good robustness on the DIOR dataset.

NWPU VHR-10 Dataset
In addition to the DOTA and DIOR datasets mentioned above, the NWPU VHR-10 [34] dataset was used to validate algorithmic improvements. Compared to the previous two datasets, the NWPU VHR-10 dataset consists of a relatively small set of 650 images containing targets and 150 images without targets. The dataset includes a total of 10 categories of targets, namely, airplanes, ships, oil tanks, baseball fields, tennis courts, basketball courts, athletic fields, ports, bridges, and vehicles. The improved YOLOv5 algorithm was trained on the NWPU VHR-10 dataset. The training took a total of 24 min. Because the dataset is very small compared to DOTA and DIOR, the training time is also faster. The AP 0.5 0.95 and AP 0.5 for all categories of RSI-YOLO were 95.8% and 57.9%, respectively. The detection results for all categories in the dataset are shown in Table 7.

NWPU VHR-10 Dataset
In addition to the DOTA and DIOR datasets mentioned above, the NWPU VHR-10 [34] dataset was used to validate algorithmic improvements. Compared to the previous two datasets, the NWPU VHR-10 dataset consists of a relatively small set of 650 images containing targets and 150 images without targets. The dataset includes a total of 10 categories of targets, namely, airplanes, ships, oil tanks, baseball fields, tennis courts, basketball courts, athletic fields, ports, bridges, and vehicles. The improved YOLOv5 algorithm was trained on the NWPU VHR-10 dataset. The training took a total of 24 min. Because the dataset is very small compared to DOTA and DIOR, the training time is also faster. The  Table 7. Some classes such as "airplane" and "tank" achieved nearly 100% 0.5 AP on baseline training, which is attributed to the small size of the dataset. However, the effectiveness of  Some classes such as "airplane" and "tank" achieved nearly 100% AP 0.5 on baseline training, which is attributed to the small size of the dataset. However, the effectiveness of the algorithm can still be validated by comparing the results in other classes. Several representative target categories were selected, and the results are shown in Table 8. RSI-YOLO improved most of the metrics, with the precision of the "tennis court" category increasing by 6.9%, and AP 0.5 and AP 0.5 0.95 increasing by 1.7% and 3.9%, respectively. The precision of the "car" category increased by 1.2%, with AP 0.5 and AP 0.5 0.95 increasing by 2.2% and 1.6%, respectively. AP 0.5 and AP 0.5 0.95 of the "bridge" category increased by 7.8% and 1.3%, respectively.

MAR20 Dataset
Unlike most remote sensing image datasets that contain a variety of different targets such as vehicles, airplanes, and ships, among others, MAR20 [35] is a military aircraft target recognition dataset that provides fine-grained model information. The dataset consists of 3842 high-resolution remote sensing images collected from 60 military airports, including 20 different types of airplanes, denoted by A1-A20. The image size for most of the dataset is 800 × 800 pixels.
Partial detection results and comparisons with original photos are shown in Figure 16. The training took a total of 2 h and 41 min. RSI-YOLO achieved good detection results on different models of this type of aircraft target.
YOLO improved most of the metrics, with the precision of the "tennis court" category increasing by 6.9%, and  Unlike most remote sensing image datasets that contain a variety of different targets such as vehicles, airplanes, and ships, among others, MAR20 [35] is a military aircraft target recognition dataset that provides fine-grained model information. The dataset consists of 3842 high-resolution remote sensing images collected from 60 military airports, including 20 different types of airplanes, denoted by A1-A20. The image size for most of the dataset is 800 × 800 pixels.
Partial detection results and comparisons with original photos are shown in Figure  16.  To compare RSI-YOLO with the original YOLOv5s, several representative target categories were selected, as shown in Figure 17. For this dataset, both initial YOLOv5s and RSI-YOLO achieved good performance in AP 0.5 , with values of approximately 98% and 99%, respectively. The values of this index did not change for "A12" and "A20", and there is little difference between other categories of AP 0.5 . However, the AP 0.5 0.95 index baseline displayed a better performance effect for RSI-YOLO. Specifically, the AP 0.5 0.95 of "A2" category increased by 4.6%; that of "A9" category increased by 3.6%; and that of "A20" category increased by 3%. These results show that RSI-YOLO provides an improvement over the original YOLOv5s. 99%, respectively. The values of this index did not change for "A12" and "A20", and there is little difference between other categories of 0.5 AP . However, the 0 .5 0 .9 5 A P index baseline displayed a better performance effect for RSI-YOLO. Specifically, the 0 .5 0 .9 5 A P of "A2" category increased by 4.6%; that of "A9" category increased by 3.6%; and that of "A20" category increased by 3%. These results show that RSI-YOLO provides an improvement over the original YOLOv5s.

Ablation Experiments
To investigate the effectiveness of different improvement methods for algorithm optimisation, ablation experiments were conducted on the DOTA dataset. The YOLOv5 algorithm was separately modified by adding a CBAM attention mechanism, Bi-FPN, small object detection layer, and EIOU loss function. The results of training the YOLOv5s model with each modification are shown in Table 9, where the check mark "√" indicates improvement. It was observed that most of the detection evaluation metrics of the algorithm improved after the addition of each module. Specifically, after the addition of the CBAM module,

Ablation Experiments
To investigate the effectiveness of different improvement methods for algorithm optimisation, ablation experiments were conducted on the DOTA dataset. The YOLOv5 algorithm was separately modified by adding a CBAM attention mechanism, Bi-FPN, small object detection layer, and EIOU loss function. The results of training the YOLOv5s model with each modification are shown in Table 9, where the check mark " √ " indicates improvement. It was observed that most of the detection evaluation metrics of the algorithm improved after the addition of each module. Specifically, after the addition of the CBAM module, AP 0.5 and AP 0.5 0.95 increased by 1.0% and 0.9%, respectively, whereas after the addition of the Bi-FPN module, AP 0.5 and AP 0.5 0.95 both increased by 1.2%. Although there was no change in the AP 0.5 value after the addition of the Bi-FPN module, AP 0.5 0.95 increased by 1.5% after the addition of the small object detection layer. Changing the loss function to EIOU increased AP 0.5 and AP 0.5 0.95 by 0.2% and 1.4%, respectively. After combining the different improvement methods, the combined network model increased AP 0.5 by 1.3% and AP 0.5 0.95 by 1.6%, achieving the highest performance in the experiment, as expected.

Comparison with Other Object Detection Algorithms
To further verify the effectiveness of the improvement introduced into the YOLOv5s algorithm, RSI-YOLO was compared with other excellent object detection algorithms on the DIOR, NWPU VHR-10 (referred to as VHR-10 in the table below), and MAR20 datasets. Figure 18 presents partial experimental results on the DIOR dataset, comparing the experimental results of RSI-YOLO, Faster R-CNN, and SSD. Although the precision index of "athletic field" was lower than that of SSD, there was an improvement in AP 0.5 by 6.4% for RSI-YOLO. For the "basketball" court category, the AP 0.5 improved by 10.9% and 9.0% for RSI-YOLO and Faster R-CNN, respectively. Moreover, the overall category's AP 0.5 increased by 8.5% and 5.8% for RSI-YOLO and Faster R-CNN, respectively. experimental results of RSI-YOLO, Faster R-CNN, and SSD. Although the precision index of "athletic field" was lower than that of SSD, there was an improvement in 0.5 AP by 6.4% for RSI-YOLO. For the "basketball" court category, the 0.5 AP improved by 10.9% and 9.0% for RSI-YOLO and Faster R-CNN, respectively. Moreover, the overall category's 0.5 AP increased by 8.5% and 5.8% for RSI-YOLO and Faster R-CNN, respectively. Part of the experimental results on the MAR20 dataset is presented in Figure 19. Comparing Faster R-CNN and SSD, the 0.5 AP of category "A13" increased by 11.4% and 3.5%, respectively; the 0.5 AP of category "A15" increased by 16.4% and 20.4%, respectively; and that of the overall category increased by 2.4% and 3.3%, respectively. For other accuracy and recall rate indexes on this dataset, most values of RSI-YOLO were higher than those of the other two algorithms. Based on the experimental results of RSI-YOLO compared with the other two algorithms, the RSI-YOLO algorithm exhibited superior performance. Part of the experimental results on the NWPU-VHR 10 dataset is shown in Figure 20. The experimental results of RSI-YOLO, Faster R-CNN, and SSD were compared. The precision of RSI-YOLO was lower than that of the SSD algorithm; however, in terms of 0.5 AP , RSI-YOLO was better than the SSD algorithm. Specifically, 0.5 AP of the "ship" category increased by 7.2%. The total category of 0.5 AP increased by 16.6%. Compared with Faster R-CNN, the 0.5 AP of the "bridge" and total category were also improved by 4.3% Part of the experimental results on the MAR20 dataset is presented in Figure 19. Comparing Faster R-CNN and SSD, the AP 0.5 of category "A13" increased by 11.4% and 3.5%, respectively; the AP 0.5 of category "A15" increased by 16.4% and 20.4%, respectively; and that of the overall category increased by 2.4% and 3.3%, respectively. For other accuracy and recall rate indexes on this dataset, most values of RSI-YOLO were higher than those of the other two algorithms. Based on the experimental results of RSI-YOLO compared with the other two algorithms, the RSI-YOLO algorithm exhibited superior performance.
6.4% for RSI-YOLO. For the "basketball" court category, the 0.5 AP improved by 10.9% and 9.0% for RSI-YOLO and Faster R-CNN, respectively. Moreover, the overall category's 0.5 AP increased by 8.5% and 5.8% for RSI-YOLO and Faster R-CNN, respectively. Part of the experimental results on the MAR20 dataset is presented in Figure 19. Comparing Faster R-CNN and SSD, the 0.5 AP of category "A13" increased by 11.4% and 3.5%, respectively; the 0.5 AP of category "A15" increased by 16.4% and 20.4%, respectively; and that of the overall category increased by 2.4% and 3.3%, respectively. For other accuracy and recall rate indexes on this dataset, most values of RSI-YOLO were higher than those of the other two algorithms. Based on the experimental results of RSI-YOLO compared with the other two algorithms, the RSI-YOLO algorithm exhibited superior performance. Part of the experimental results on the NWPU-VHR 10 dataset is shown in Figure 20. The experimental results of RSI-YOLO, Faster R-CNN, and SSD were compared. The precision of RSI-YOLO was lower than that of the SSD algorithm; however, in terms of 0.5 AP , RSI-YOLO was better than the SSD algorithm. Specifically, 0.5 AP of the "ship" category increased by 7.2%. The total category of 0.5 AP increased by 16.6%. Compared with Faster R-CNN, the 0.5 AP of the "bridge" and total category were also improved by 4.3% Part of the experimental results on the NWPU-VHR 10 dataset is shown in Figure 20. The experimental results of RSI-YOLO, Faster R-CNN, and SSD were compared. The precision of RSI-YOLO was lower than that of the SSD algorithm; however, in terms of AP 0.5 , RSI-YOLO was better than the SSD algorithm. Specifically, AP 0.5 of the "ship" category increased by 7.2%. The total category of AP 0.5 increased by 16.6%. Compared with Faster R-CNN, the AP 0.5 of the "bridge" and total category were also improved by 4.3% and 10%, respectively. The above experimental results show that, compared with other target detection algorithms, the improved algorithm exhibited better detection ability in remote sensing image target detection tasks. and 10%, respectively. The above experimental results show that, compared with other target detection algorithms, the improved algorithm exhibited better detection ability in remote sensing image target detection tasks.

Conclusions
This study focused on the task of object detection, specifically on the application of remote sensing image object detection, exploring issues related to small target sizes and improving algorithm performance metrics. The feature extraction ability of the original YOLOv5 algorithm was improved by introducing channel attention and spatial attention modules, as well as by modifying the PAN feature extraction structure of the network's neck section into a weighted Bi-FPN structure, thereby achieving more efficient and rich

Conclusions
This study focused on the task of object detection, specifically on the application of remote sensing image object detection, exploring issues related to small target sizes and improving algorithm performance metrics. The feature extraction ability of the original YOLOv5 algorithm was improved by introducing channel attention and spatial attention modules, as well as by modifying the PAN feature extraction structure of the network's neck section into a weighted Bi-FPN structure, thereby achieving more efficient and rich feature fusion. To address the problem of small target size in remote sensing image object detection, a small target detection layer was added to the network structure to improve the robustness of the network model. Additionally, to further improve the algorithm, the EIOU loss function was incorporated to address the localisation loss function limitations in YOLOv5. This was done to enhance the network's convergence during training, resulting in an improvement in detection accuracy. The experimental results show that the proposed RSI-YOLO effectively improved detection accuracy and recall compared with YOLOv5 and other typical object detection algorithms.
Although the proposed RSI-YOLO is effective in detecting small targets with improved accuracy, the results suggest that there is still potential for enhancing its average precision on a variety of datasets. Additionally, the proposed RSI-YOLO has achieved remarkable results in improving the accuracy, but the number of parameters in the model increases due to the addition of modules and detection layers, which leads to the increase in FLOPs (floating point operations), so the complexity of the model is higher. When applied in engineering fields, network model parameters and model size are often constrained. Hence, future research can focus on further improving small target detection accuracy. Additional datasets can be used to enrich the network training set, obtaining training samples for data augmentation. For lightweight networks, model pruning and knowledge distillation can be considered to decrease the network model's size and speed up the detection process while maintaining high detection accuracy.