YOLO-HR: Improved YOLOv5 for Object Detection in High-Resolution Optical Remote Sensing Images

: Object detection is essential to the interpretation of optical remote sensing images and can serve as a foundation for research into additional visual tasks that utilize remote sensing. However, the object detection network currently employed in optical remote sensing images underutilizes the output of the feature pyramid, so there remains potential for an improved detection. At present, a suitable balance between the detection efﬁciency and detection effect is difﬁcult to attain. This paper proposes an enhanced YOLOv5 algorithm for object detection in high-resolution optical remote sensing images, utilizing multiple layers of the feature pyramid, a multi-detection-head strategy, and a hybrid attention module to improve the effect of object-detection networks for use with optical remote sensing images. According to the SIMD dataset, the mAP of the proposed method was 2.2% better than YOLOv5 and 8.48% better than YOLOX, achieving an improved balance between the detection effect and speed.


Introduction
With the rapid development of remote sensing technology, high-resolution optical remote sensing images have been utilized to depict numerous items on the Earth's surface, including aircrafts, automobiles, buildings, etc. [1].Object detection plays a crucial role in the interpretation of remote sensing images and can be used for their segmentation [2,3], description [4,5], and target tracking [6].However, aerial optical remote sensing images manifest a diversity of scale, viewpoint specificity, random orientation, and high background complexity due to their relatively large field of view and the necessity for a high altitude [7,8], whereas the majority of conventional datasets contain ground-level views.Therefore, the object-detection techniques used in the construction of artificial features traditionally have a poor track record for accuracy and speed.The target-detection algorithm based on a convolutional neural network is significantly more efficient and effective than traditional target-detection algorithms.Due to the needs of society and supported by the development of deep learning, the use of neural networks for target detection in optical remote sensing images is a necessity.
Current object-detection algorithms incorporating deep learning to analyze optical remote sensing photographs can be classified as supervised, poorly supervised, or unsupervised.However, due to the complexity and instability of unsupervised and weakly supervised algorithms, supervised algorithms are the most frequently used.Moreover, supervised object-detection algorithms can be classified as either single-stage or two-stage.For instance, the two-stage capsule network SAHR-CapsNet [9] can accurately detect targets in remote sensing pictures.Due to the comparatively late discovery of capsule networks, however, the vast majority of modern two-stage object-detection algorithms Remote Sens. 2023, 15, 614 2 of 17 have been based on the RCNN [10][11][12] series.The methods described by previous researchers [13] have integrated the techniques of dilation convolution [14] and OHEM [15] into Faster RCNN [12] frameworks to improve the detection precision of small objects with a high density in optical remote sensing images.A similar practice is described in reference [16].Wang et al. [17] proposed a fine detector with contextual information to improve the region suggestion network (RPN) in Faster RCNN to overcome background clutter and difficulties in recognition of foreground items in remote sensing images.The detection of airports and ports in downsampled satellite images, followed by mapping the discovered items back to the original ultra-high resolution satellite images, can successfully enable the concurrent detection of objects of varying sizes, according to research [18], based on the assumption that airplanes are often located in airports and ships are in ports and oceans.Weng et al. [19] proposed a rotating object-detection approach based on RCNN [10] to improve the accuracy of object detection in remote sensing images by addressing the randomization of target orientation.
Although two-stage object-detection algorithms are comparatively advantageous in terms of the accuracy and detection effect, they tend to involve complex models and operate at a slower speed, so researchers have focused on single-stage algorithms allowing for a trade-off of the speed and accuracy, such as YOLO [20][21][22][23][24][25], SSD [26,27], MobileNet [28][29][30], etc., adding an attention mechanism that improves the network's ability to detect remote sensing images.Previous research [31] integrated the attention mechanism CBAM [32] into the lightweight YOLOX network [33] to improve its detection accuracy for small targets in remote sensing image datasets.Another study [34] replaced the backbone of YOLOv3 with the backbone of MobileNetV3.It combined the attention mechanism to develop a lightweight single-stage object detection network SeMo-YOLO with an increased detection speed in the remote sensing object-detection network.However, instead of using the output of FPN and PANET simultaneously, the aforementioned YOLO-based network uses only one.However, the feature pyramid's output has been underutilized, and the detection impact could still be enhanced.The single-stage object detection network is a dense anchor box network with the problem of a positive and negative sample imbalance [35], so previous researchers [36] added focal loss [35] into the training process to alleviate the positive and negative sample imbalance and used the self-attention mechanism to extract high-level semantic information from the depth-feature map for target spatial localization, thereby improving the accuracy of the SSD model [26] for target localization.The aforementioned single-stage target-detection technique, which is used in remote sensing photographs, detects targets more quickly than a two-stage target detection network as it fixes the network input to a particular size, such as 640 × 640 or 512 × 512.However, small targets (dozens of pixels or even fewer) tend to be lost when this method is employed for highresolution remote sensing target-detection datasets, leading to the poor detection of small targets and a deterioration in the model's overall detection effect caused by a reduction in the resolution [36][37][38].
The aforementioned techniques are crucial for object detection in remote sensing images, yet the following issues remain.
(1) Some researchers employed a two-stage model for object recognition, which is characterized by a high model complexity, a large number of parameter calculations, and a sluggish performance.
(2) Some researchers used single-stage networks for object detection in optical remote sensing images.However, most of them are scaled down to a lower input resolution, which diminishes the effectiveness of model detection.
(3) The majority of YOLO-based networks utilize either the output of FPN [39] or PAFPN [40], but not both.
(4) Some researchers have demonstrated that the hybrid attention mechanism can enhance the precision of object detection in optical remote sensing images.Nevertheless, hybrid attention modules are scarce now.
We proposed a lightweight object detection network for high-resolution remote sensing images based on the YOLOv5 framework in order to balance the detection accuracy, speed, the number of model parameters, and the existing features.The following is a summary of the contributions of this paper: (1) Based on the YOLOv5 network topology, a single-stage object detection network named YOLO-HR for high-resolution optical remote sensing photographs was suggested.
(2) A multi-detection-head approach that can exploit the features of both FPN and PANET was proposed.
(3) A lightweight hybrid attention module was proposed.
(4) The model's efficiency and viability were validated using the SIMD [41] dataset.The article is structured as follows: Section 2 presents a brief summary of the related works from the remote sensing object detection dataset, object detection network, and attention mechanism.Based on an analysis of the existing detection head output strategy in single-stage algorithms, Section 3 offers a multi-detection-head strategy and the YOLO-HR network.Section 4 evaluates the proposed technique on the SIMD datasets and presents the experimental results.The paper is summarized in Section 5.

Datasets of Optical Remote Sensing Image Object Detection
Traditional remote sensing image datasets for object detection are reviewed in Table 1.Where Dataset represents the name of the dataset, Categories denotes the number of categories in the dataset, Images is the number of images, Instances represents the total number of targets, and Year represents the release year of the dataset.The recently published SIMD [41] dataset was utilized for this study, with the majority of its images measuring 1024 by 768 pixels.

Attention Mechnishem
Currently, attention mechanisms in deep learning are commonly categorized as soft attention, hard attention, and self-attention.The soft attention mechanism assigns a weight between 0 and 1 to each input item and evaluates the majority of the data, but not equally.The hard attention mechanism assigns a weight of 0 or 1 to each input item.Unlike soft attention, hard attention just considers the component that demands attention and promptly discards unnecessary information.The self-attention mechanism assigns a weight to each input item based on the interaction between the input items, i.e., the "voting" between the input items determines which input items receive attention.The soft attention method is the most widespread in the field of remote sensing image object detection, and its representative articles include SE [53], CBAM [32], ECA [54], Co-Attention [55], Reverse Attention [56], Cross Attention [57], etc. Numerous articles [31,[58][59][60][61][62] have demonstrated that mixed attention mechanisms improved the effect of a remote sensing target detection network, including strengthening the detection effect and increasing the detection accuracy.In this study, the hybrid attention module MAB also consisted of hybrid soft attention mechanisms.

Object Detection Networks in Remote Sensing Image
Radar-based object detection and optical remote sensing object detection are the two types of remote sensing picture object detection.Optical sensors require favorable weather conditions and ample sunshine to produce high-quality photographs.The most notable advantage of radar sensors is that they are unaffected by the weather [63].For example, the synthetic aperture radar (SAR) is a high-resolution image radar that can detect camouflage and penetrate masking objects in all-weather situations.One of the current hot topics in remote sensing is the application of neural networks to detect SAR images with complex and variable scenes [63][64][65][66].This paper focuses on object detection in optical remote sensing images.Typically, image data are often derived from satellite photographs, such as Google Earth, or aerial images, such as UAS.The recent applications of deep learning to recognize objects in optical remote sensing images have produced satisfactory results.Wang et al. [67][68][69] took advantage of the advancements in the Faster RCNN [12], RetinaNet [35], and YOLOv3 [22] networks to detect wildlife in high-resolution UAV images.Sun et al. proposed a comprehensive partial-based convolutional neural network called PBNet for composite object detection in high-resolution optical remote sensing images [70].In the past, RCNN was used to identify aircraft targets in very high-resolution remote sensing photographs with a poor precision and sluggish speed.Therefore, a mix of dense convolutional networks, multi-scale representation methods, and a number of enhancement techniques were utilized to strengthen the fundamental VGG16-Net's structure, raise accuracy, and more effectively recognize the target in satellite optical remote sensing images.[13].The experiments mentioned [13,35,[63][64][65][66][67][68][69][70] above used horizontal boundary boxes (HBB), which sometimes do not offer precise direction and scale information and have an excessive number of superfluous pixels in the backdrop.In addition, HBB and non-maximal inhibition (NMS) collaboration usually leads to missing detection when detecting objects with high aspect ratios and dense parking.In recent years, the recognition of directional objects (OBB) in RS images has garnered growing attention [71][72][73][74][75][76].OBBs are often slower than HBBs in training and deployment.Hence, HBBs remain the focus of the current research.The algorithm of this study was also based on HBB.

Comparison of Prediction Head
The majority of the current YOLO series detection heads are based on the output feature of FPN and PAFPN, where FPN-based networks such as YOLOv3 and its variants are shown in Figure 1a, which directly utilize the one-way fused features for the output, and the PAFPN-based algorithms of YOLOv4 and YOLOv5 add a low-level to high-level channel on top of this, which directly transmits the low-level information upwards (Figure 1b).As demonstrated in Figure 1c and similarly in some studies [87][88][89], Zhu et al. added a detection head for a particular detection task in the TPH-YOLOv5 model.In Figure 1b,c, only the PAFPN features are used for the output, while the FPN features are underutilized.Therefore, YOLOv7 attaches three auxiliary heads to the FPN output, as depicted in Figure 1d, although the auxiliary heads are only used for a "rough selection" and have a low weight.The detecting head of SSD was proposed to improve the too-coarse design of the anchor set by the YOLO network, as depicted in Figure 1e, and the design concept consists mostly of a dense anchor design with multiple aspect ratios at multiple scales.Inspired by Figure 1c-e, this paper proposed a multi-detection-head strategy for the YOLO detection head, as depicted in Figure 1f, which could utilize the feature information of PANet and FPN simultaneously.Additionally, an output head was added directly at the 64-fold downsampling, which caused the network to contain the prior global information.
channel on top of this, which directly transmits the low-level information upwards (Figure 1b).As demonstrated in Figure 1c and similarly in some studies [87][88][89], Zhu et al. added a detection head for a particular detection task in the TPH-YOLOv5 model.In Figure 1b, c, only the PAFPN features are used for the output, while the FPN features are underutilized.Therefore, YOLOv7 attaches three auxiliary heads to the FPN output, as depicted in Figure 1d, although the auxiliary heads are only used for a "rough selection" and have a low weight.The detecting head of SSD was proposed to improve the too-coarse design of the anchor set by the YOLO network, as depicted in Figure 1e, and the design concept consists mostly of a dense anchor design with multiple aspect ratios at multiple scales.Inspired by Figure 1c-e, this paper proposed a multi-detection-head strategy for the YOLO detection head, as depicted in Figure 1f, which could utilize the feature information of PANet and FPN simultaneously.Additionally, an output head was added directly at the 64-fold downsampling, which caused the network to contain the prior global information.

Overall Structure of YOLO-HR
The multi-detection-head method could efficiently use the network's output features.YOLO-HR was an object detection network for high resolution remote sensing photographs.As depicted in Figure 2, the YOLO-HR network described in this paper can be separated into Backbone, Neck, and Head.The basic structure of Backbone was a CSP-DenseNet with C3 and Convolutional modules at its core.After the data enhancement, images were fed into the network and numerous convolutional modules retrieved features after channel mixing by the Conv module with a kernel size = 6.They were

Overall Structure of YOLO-HR
The multi-detection-head method could efficiently use the network's output features.YOLO-HR was an object detection network for high resolution remote sensing photographs.As depicted in Figure 2, the YOLO-HR network described in this paper can be separated into Backbone, Neck, and Head.The basic structure of Backbone was a CSP-DenseNet with C3 and Convolutional modules at its core.After the data enhancement, images were fed into the network and numerous convolutional modules retrieved features after channel mixing by the Conv module with a kernel size = 6.They were connected to PANet in Neck after the feature enhancement module named SPPF.Bidirectional feature fusion was undertaken to enhance the network's detecting capability.Conv2d was used to independently scale the fused feature layers to generate the multi-layer outputs.As depicted in Figure 3a, the NMS algorithm combined the outputs of all single-layer detectors to produce the final detection frame.
connected to PANet in Neck after the feature enhancement module named SPPF.Bidirectional feature fusion was undertaken to enhance the network's detecting capability.Conv2d was used to independently scale the fused feature layers to generate the multilayer outputs.As depicted in Figure 3a, the NMS algorithm combined the outputs of all single-layer detectors to produce the final detection frame.2, where the ECA [54] is depicted in the bottom left corner.After channel-level global average pooling without dimension reduction, the ECA is efficiently performed using the rapid 1D convolution of size k to capture local cross-channel interaction information, taking into account each channel's relationship with its k neighbors.The CA attention mechanism [55] is depicted in Figure 1's lower right corner, which encodes each channel along the horizontal and vertical coordinates, respectively, using a channel-level global average pooling of size (H,1) or (1, W) connected to PANet in Neck after the feature enhancement module named SPPF.Bidirectional feature fusion was undertaken to enhance the network's detecting capability.
Conv2d was used to independently scale the fused feature layers to generate the multilayer outputs.As depicted in Figure 3a, the NMS algorithm combined the outputs of all single-layer detectors to produce the final detection frame.2, where the ECA [54] is depicted in the bottom left corner.After channel-level global average pooling without dimension reduction, the ECA is efficiently performed using the rapid 1D convolution of size k to capture local cross-channel interaction information, taking into account each channel's relationship with its k neighbors.The CA attention mechanism [55] is depicted in Figure 1's lower right corner, which encodes each channel along the horizontal and vertical coordinates, respectively, using a channel-level global average pooling of size (H,1) or (1, W)   2, where the ECA [54] is depicted in the bottom left corner.After channel-level global average pooling without dimension reduction, the ECA is efficiently performed using the rapid 1D convolution of size k to capture local crosschannel interaction information, taking into account each channel's relationship with its k neighbors.The CA attention mechanism [55] is depicted in Figure 1's lower right corner, which encodes each channel along the horizontal and vertical coordinates, respectively, using a channel-level global average pooling of size (H,1) or (1, W) pooling kernel.The above two transformations collect features along two spatial directions to produce a pair of direction-aware feature maps, which are then concatenated and modified with convolution and Sigmoid functions to provide the attention output.
Table 2 displays the parameter settings for the entire network's structure.Input displays the input size of the image, Output displays the output size of the current layer, Argvs are the input parameters of the current module, From represents the input source of the current layer, N represents the number of repetitions of the current module, and Parameters displays the size of the parameter number of the current layer.

Data Augmentation
The essence of data augmentation is to artificially introduce human visual prior knowledge, which can improve the performance of the model very well, and it has basically become the standard for model training.The more commonly used geometric transformation methods are flip, rotate, crop, scale, pan, dither, etc.The pixel transformation methods include adding pretzel and Gaussian noise, performing a Gaussian blur, adjusting the HSV contrast, and adjusting the brightness, saturation, histogram equalization, white balance, etc.In addition to the above methods, this paper also uses a variety of data enhancement methods in the training phase, each with different random ratios, such as Mosica, CUTOUT, small target replication, etc.Among them, flip and rotation are used to solve the problem of Angle diversity in remote sensing images, zoom and shift are used to solve the problem of the multi-scale in remote sensing images, dithering and adding noise are used to improve the problem of a complex background in remote sensing images, and small target replication is used to expand the samples and improve the detection effect of small targets.

Loss Function
The loss function of YOLO-HR was composed of three components: target confidence loss, target category loss, and target positioning loss.The loss function could be expressed as follows: where L all contained three hyperparameters, the weight of each component, which could be modified before training based on the actual circumstances.In this work, the corresponding weights of the three sections were 1.0, 0.5, and 0.05.The target confidence loss utilized the BCE (binary cross-entropy) loss, with the following expression: Among them, K * K could take on three distinct values, with the particular size being dependent on the image size.Taking 1024 × 1024 as an example, they were 16 × 16, 32 × 32, 64 × 64, and 128 × 128, respectively, illustrating the number of grids on the feature maps generated by YOLO-HR at three different scales.B represented the number of preceding boxes.I ij obj specifies whether the jth previous box of the ith grid had a prediction target.
I ij obj is 1 if the condition was met; else, it was 0. I ij noobj indicated if the jth previous box of the ith grid did not contain a predicted target.If not, I ij noobj was 1; otherwise, it was 0. C i j and C i j represented the actual and expected confidence values, respectively.λ noobj was a constant coefficient, typically assumed to be 0.5, that was used to balance the positive and negative samples.
The target confidence loss was also the BCE loss, and the expression is as follows: where K * K, B and I ij obj were consistent with Equation (1), c was the target category, and P i j (c) and P i j (c) were the probability that the target in the jth prediction box in the ith grid belongs to the real value and the predicted value of a certain category, respectively.
The SIoU loss [90] replaced the CIoU loss [91] function for the target positioning loss in order to increase the training speed and reasoning precision in this paper.The following is the formula: ) where ∆ represented the Distance cost, γ the Angle cost, Ω the Shape cost, and θ expressed the Shape cost level of concern.

Experimental Platform
The experimental platform of this paper is shown in Table 3.The SIMD dataset is a multi-category, open-source, high-resolution remote sensing object detection dataset containing a total of 15 classes, as illustrated in Figure 4. Additionally, the SIMD dataset is more distributed with small-and medium-sized targets (w < 0.4, h < 0.4), and the detection head used by YOLO-HR proposed in this paper to detect this region is double the number of detection heads used by the common YOLO algorithm, so YOLO-HR has greater advantages on this dataset.
where ∆ represented the Distance cost,  the Angle cost,  the Shape cost, and  expressed the Shape cost level of concern.

Experimental Platform
The experimental platform of this paper is shown in Table 3.The SIMD dataset is a multi-category, open-source, high-resolution remote sensing object detection dataset containing a total of 15 classes, as illustrated in Figure 4. Additionally, the SIMD dataset is more distributed with small-and medium-sized targets (w < 0.4, h < 0.4), and the detection head used by YOLO-HR proposed in this paper to detect this region is double the number of detection heads used by the common YOLO algorithm, so YOLO-HR has greater advantages on this dataset.

Related Indexes
The network performance evaluation is mainly based on the mAP (mean average accuracy) during training and the performance of the trained network in the validation set.To measure the detection results quantitatively, the accuracy Precision, Recall, and mAP are used here as the performance evaluation metrics.The expressions of P and R are as follows.
where True positives (TP) are the number of samples that are actually positive and classified as positive by the classifier; True negatives (TN) are the number of samples that are actually negative and classified as negative by the classifier; False positives (FP) are the number of samples that are actually negative but classified as positive by the classifier; and False negatives (FN) are the number of samples that are actually positive but classified as negative by the classifier.Average Precision (AP) is the area enclosed by the P-R curve.Usually, the better the classifier, the higher the AP value.Mean Average Precision (mAP) is the AP of each category taken separately, and then the average of the AP of all categories is calculated, representing a composite measure of the average precision of the detected targets.AP50 in the later text means that the IoU threshold is greater than 0.5, mAP is mAP 0.5:0.95,and step is 0.05.

Experiments on SIMD 4.2.1. Ablation Test
It was possible to connect the output of the SPPF module to the output head and thus identify large targets in the image.However, the output of the SPPF module had multiple connections and is concerned with targets of multiple scales, so using it directly for the detection head to identify large objects would result in a poor model representation, as shown in Figure 5. Figure 5 depicts a visual comparison of the heat map of some detection findings prior to and following the addition of the MAB module.After adding the MAB module, this detection head focused on detecting large objects, while the prediction of small targets was assigned to other prediction heads and the expression effect of the model was improved, which was also more in line to divide the detection head based on the target size in the YOLO algorithm.
number of samples that are actually negative but classified as positive by the classifier; and False negatives() are the number of samples that are actually positive but classified as negative by the classifier.
Average Precision (AP) is the area enclosed by the P-R curve.Usually, the better the classifier, the higher the AP value.Mean Average Precision (mAP) is the AP of each category taken separately, and then the average of the AP of all categories is calculated, representing a composite measure of the average precision of the detected targets.AP50 in the later text means that the IoU threshold is greater than 0.5, mAP is mAP 0.5:0.95,and step is 0.05.

Ablation Test
It was possible to connect the output of the SPPF module to the output head and thus identify large targets in the image.However, the output of the SPPF module had multiple connections and is concerned with targets of multiple scales, so using it directly for the detection head to identify large objects would result in a poor model representation, as shown in Figure 5. Figure 5 depicts a visual comparison of the heat map of some detection findings prior to and following the addition of the MAB module.After adding the MAB module, this detection head focused on detecting large objects, while the prediction of small targets was assigned to other prediction heads and the expression effect of the model was improved, which was also more in line to divide the detection head based on the target size in the YOLO algorithm.Using the calculation results of the YOLOv5s algorithm as a reference, the effects of the MPH output strategy and MAB module on the calculation results were examined in the SIMD dataset and 1024 × 1024 image resolution, as shown in Table 4 and Figure 6, respectively, from top to bottom, indicating that the increase in the modules is in order.Finetune means the model was pre-trained on the ImageNet dataset, and then the trained model was fine-tuned on the SIMD dataset.The results showed that after the addition of the MPH strategy and MAB module, the number of parameters in the model increased by 2.5 M. Still, the increase in the number of parameters was negligible compared with the disk capacity of hundreds of G.The speed was not significantly improved, but the AP50 of the model increased by 2.1%, mAP increased by 2.2%, and the accuracy increased by 1.5%.The recall rate increased by 1.19%.

Comparison Experiments
Simultaneously, the classic YOLOv3-Tiny, Faster RCNN, YOLOv7, and YOLOX models were selected for comparison tests in this paper.The Yolov3-Tiny, YOLOv5 (DenseNet + PAN), and YOLO-HR codes and pre-training models utilized in this experiment were obtained through the YOLOv5 open-source framework and the YOLOv7 models through the YOLOv7 open-source framework.The YOLOX(Darknet-53 + FPN) algorithm was derived from the literature [33], while the other models, including Faster RCNN(Resnet-50 + FPN), were derived from the MMDetection [93] open-source framework.We tested and compared YOLO-HR and other algorithms at a 1024 by 1024 image resolution.We merely compared the number of parameters to prevent variations caused by the model storage methods of various formats.For instance, the amount of parameters in YOLOv5s is 7.11 M, but Pytorch's model storage format is 14.4 M. In order to rule out randomness, the running time was computed as the average time for testing 1000 photos, as shown in Table 5.The suggested approach outperformed YOLOv5, YOLOv3-Tiny, YOLOv7-Tiny, YOLOX, and the Faster RCNN model using Resnet-50 as its backbone in terms of the detection outcomes (mAP and AP50).Although slightly more sophisticated than the YOLOv5 and YOLOX models, the number of references of a few meters was minimal, even compared to the modest storage space of edge devices such as Nvidia TX2 and NX, which was only 32 gigabytes, so it was more than sufficient.In terms of the speed, it was superior to YOLOv3-Tiny and Faster RCNN and the detection speed was only 0.5 ms higher than that of YOLOv5, without a substantial reduction in the detection speed.The complete detection findings of YOLO-HR proposed in this paper offered benefits over the appeal algorithm.The results of the experiments indicated that the YOLO-HR algorithm struck a more suitable balance between the reference number, speed, and detection effectiveness.Some of the detection results are shown in Figure 7. From each detection result, there was not much difference with other algorithms, but compared with other algorithms, the algorithm in this paper improved the detection effect of the model while ensuring no significant increase in the time consumption and enhanced the expression effect of the model by using the attention mechanism.

Conclusions
To address the issue that the majority of the current models utilized for optical remote sensing image object detection underutilized the output features of the feature pyramid, we proposed a multi-head strategy based on prior work and we proposed a hybrid attention module, MAB, for the lack of hybrid attention mechanisms.Finally, we embedded the aforementioned two methods into the YOLOv5 network and presented a high resolution optical remote sensing target recognition algorithm named YOLO-HR.The YOLO-HR algorithm employed several detection heads for object detection and recycled the output features of the feature pyramid, allowing the network to enhance the detection effect further.The experiments indicate that the YOLO-HR algorithm allows for a greater number of downsampling multiples and faster detection results than other algorithms while preserving the original detection speed.In subsequent work, we plan to extend and apply the concept of modifying the network structure presented in this paper to other object detection algorithms, study other feature reuse strategies, and investigate the deployment and application issues of the algorithm presented in this paper in greater depth.

Conclusions
To address the issue that the majority of the current models utilized for optical remote sensing image object detection underutilized the output features of the feature pyramid, we proposed a multi-head strategy based on prior work and we proposed a hybrid attention module, MAB, for the lack of hybrid attention mechanisms.Finally, we embedded the aforementioned two methods into the YOLOv5 network and presented a high resolution optical remote sensing target recognition algorithm named YOLO-HR.The YOLO-HR algorithm employed several detection heads for object detection and recycled the output features of the feature pyramid, allowing the network to enhance the detection effect further.The experiments indicate that the YOLO-HR algorithm allows for a greater number of downsampling multiples and faster detection results than other algorithms while preserving the original detection speed.In subsequent work, we plan to extend and apply the concept of modifying the network structure presented in this paper to other object detection algorithms, study other feature reuse strategies, and investigate the deployment and application issues of the algorithm presented in this paper in greater depth.

Figure
Figure 3b depicts the structural composition of each module of the YOLO-HR network.Conv comprises a 2D convolutional layer, BN layer batch normalization, and Silu activation function, C3 comprises two 2D convolutional layers plus a bottleneck layer, and Upsample is the upsampling layer.The SPPF module is a sped-up version of the SPP module, and the MAB module is depicted in Figure2, where the ECA[54] is depicted in the bottom left corner.After channel-level global average pooling without dimension reduction, the ECA is efficiently performed using the rapid 1D convolution of size k to capture local cross-channel interaction information, taking into account each channel's relationship with its k neighbors.The CA attention mechanism[55] is depicted in Figure1's lower right corner, which encodes each channel along the horizontal and vertical coordinates, respectively, using a channel-level global average pooling of size (H,1) or(1, W)

Figure 2 .
Figure 2. The overall structure of YOLO-HR.

Figure 3b depicts the
Figure 3b depicts the structural composition of each module of the YOLO-HR network.Conv comprises a 2D convolutional layer, BN layer batch normalization, and Silu activation function, C3 comprises two 2D convolutional layers plus a bottleneck layer, and Upsample is the upsampling layer.The SPPF module is a sped-up version of the SPP module, and the MAB module is depicted in Figure2, where the ECA[54] is depicted in the bottom left corner.After channel-level global average pooling without dimension reduction, the ECA is efficiently performed using the rapid 1D convolution of size k to capture local cross-channel interaction information, taking into account each channel's relationship with its k neighbors.The CA attention mechanism[55] is depicted in Figure1's lower right corner, which encodes each channel along the horizontal and vertical coordinates, respectively, using a channel-level global average pooling of size (H,1) or(1, W)

Figure 3 .
Figure 3. Composition modules of YOLO-HR.(a) The principle of YOLO-HR multi-head output; (b) the other composition modules of YOLO-HR.

Figure
Figure 3b depicts the structural composition of each module of the YOLO-HR network.Conv comprises a 2D convolutional layer, BN layer batch normalization, and Silu activation function, C3 comprises two 2D convolutional layers plus a bottleneck layer, and Upsample is the upsampling layer.The SPPF module is a sped-up version of the SPP module, and the MAB module is depicted in Figure2, where the ECA[54] is depicted in the bottom left corner.After channel-level global average pooling without dimension reduction, the ECA is efficiently performed using the rapid 1D convolution of size k to capture local crosschannel interaction information, taking into account each channel's relationship with its k neighbors.The CA attention mechanism[55] is depicted in Figure1's lower right corner, which encodes each channel along the horizontal and vertical coordinates, respectively, using a channel-level global average pooling of size (H,1) or (1, W) pooling kernel.The above two transformations collect features along two spatial directions to produce a pair of direction-aware feature maps, which are then concatenated and modified with convolution and Sigmoid functions to provide the attention output.Table2displays the parameter settings for the entire network's structure.Input displays the input size of the image, Output displays the output size of the current layer,

Figure 4 .
Figure 4. Distribution of targets in SIMD dataset.(a) shows the distribution of the number of categories; (b) shows the distribution of target width and height in the image; the color from white to blue (from light to dark) indicates a more concentrated distribution.

Figure 4 .
Figure 4. Distribution of targets in SIMD dataset.(a) shows the distribution of the number of categories; (b) shows the distribution of target width and height in the image; the color from white to blue (from light to dark) indicates a more concentrated distribution.

Figure 5 .
Figure 5. Heat map visualization of the partial detection results before and after adding the MAB module.Both are visualized with Grad-CAM [92].

Figure 5 .
Figure 5. Heat map visualization of the partial detection results before and after adding the MAB module.Both are visualized with Grad-CAM [92].

18 Figure 7 .
Figure 7.Comparison of the detection effect of different model detection parts.

Figure 7 .
Figure 7.Comparison of the detection effect of different model detection parts.

4. 2 . 3 .
Qualitative ResultsSome qualitative results of the YOLO-HR algorithm proposed in this paper on the SIMD dataset are shown in Figure8.As shown in the figure, the YOLO-HR model could better detect objects in remote sensing images with special viewing angles, including objects with complex backgrounds, random directions, and different scales.

18 Figure 8 .
Figure 8.Some detection results on the SIMD dataset of YOLO-HR.

Author
Contributions: D.W. and R.L. conceived and designed the experiments; D.W. and S.W. performed the experiments; T.X. and S.S. analyzed the data; X.L. contributed analysis tools; D.W. wrote the paper; R.L. supervised this work.All authors have read and agreed to the published version of the manuscript.Funding: This work was supported by the National Natural Science Foundation of China (NSFC) (Grant No. 51875164); National Key Research and Development Program of China (No. 2018YFB2003801).Data Availability Statement: The datasets presented in this study are available through: https://github.com/ihians/simd(accessed on 12 August 2022) Conflicts of Interest: The authors declare no conflicts of interest.

Figure 8 .
Figure 8.Some detection results on the SIMD dataset of YOLO-HR.

Table 1 .
Universal remote sensing image object detection dataset.

Table 2 .
Parameter setting of the network structure.

Table 4 .
Performance improvement of each part design on the result.

Table 4 .
Performance improvement of each part design on the result.

Table 5 .
Comparison with other algorithms.