YOLO-DSD: A YOLO-Based Detector Optimized for Better Balance between Accuracy, Deployability and Inference Time in Optical Remote Sensing Object Detection

: Many deep learning (DL)-based detectors have been developed for optical remote sensing object detection in recent years. However, most of the recent detectors are developed toward the pursuit of a higher accuracy, but little toward a balance between accuracy, deployability and inference time, which hinders the practical application for these detectors, especially in embedded devices. In order to achieve a higher detection accuracy and reduce the computational consumption and inference time simultaneously


Introduction
Object detection in optical remote sensing images (ORSIs) is a crucial but challenging task for remote sensing technology and has been widely applied in many fields, such as military, natural resources exploration, urban construction, agriculture and mapping [1,2].The development of a cost-effective detector considering the characteristic of ORSIs is the persistently pursued direction, and has attracted a large amount of attention from scholars and practitioners.
The approaches for object detection can be roughly divided into traditional detectors and deep learning (DL)-based detectors.DL-based detectors, especially convolutional neural network (CNN) detectors, have gradually replaced traditional detectors since they possess better adaptability and generalization in different application scenarios.There are two categories of DL-based detectors: one-stage [3][4][5][6][7][8][9] and two-stage [10][11][12][13].The one-stage detectors directly regress bounding boxes and probabilities for each object simultaneously without region proposals; thus, they perform well regarding inference speed.Two-stage detectors employ the region proposals to improve the location and detection accuracy, with the sacrifice of the inference speed.With the emergence of large-scale natural scene images (NSIs) datasets for object detection tasks such as Pascal VOC [14] and MS COCO [15], DL-based detectors have been further developed for a better tradeoff between accuracy and cost, including Faster-RCNN [12], single shot multibox detector (SSD) [3], the series of You Only Look Once (YOLO) [4][5][6]8], CenterNet [7], EfficientDet [9] and RetinaNet [16].These detectors with continuous improvement have been widely applied in various natural scene visual detection tasks.
Since ORSIs are photographed from an overhead perspective at different heights, whereas NSIs are shot from a horizontal perspective at relatively close distance, three main differences have emerged as follows: first, the available feature of most detected objects in ORSIs is less obvious than that in NSIs and leads to greater inter-class similarity.Second, the intra-class difference is more prominent since object scales of the same category in ORSIs usually vary greater.Third, the background in ORSIs is more complex and abundant than that in NSIs.Differences between ORSIs and NSIs with instances are shown in Figure 1.These differences make object detection in ORSIs more difficult, and most of the well-designed detectors for NSIs are not elaborately optimized for ORSIs.For the problems of a greater intra-class difference and inter-class similarity caused by the characteristic of objects in ORSIs, the detector needs to extract more abundant object features with high-level semantics to overcome it.However, the feature of objects in ORSIs are easily submerged by the redundant and complex background information and thus will decrease or even disappear when transmitted in the detector.Thus, DL-based detectors also require a stronger feature extraction and transmission ability.
Appl.Sci.2022, 12, x FOR PEER REVIEW 2 of 25 accuracy, with the sacrifice of the inference speed.With the emergence of large-scale natural scene images (NSIs) datasets for object detection tasks such as Pascal VOC [14] and MS COCO [15], DL-based detectors have been further developed for a better tradeoff between accuracy and cost, including Faster-RCNN [12], single shot multibox detector (SSD) [3], the series of You Only Look Once (YOLO) [4][5][6]8], CenterNet [7], EfficientDet [9] and RetinaNet [16].These detectors with continuous improvement have been widely applied in various natural scene visual detection tasks.Since ORSIs are photographed from an overhead perspective at different heights, whereas NSIs are shot from a horizontal perspective at relatively close distance, three main differences have emerged as follows: first, the available feature of most detected objects in ORSIs is less obvious than that in NSIs and leads to greater inter-class similarity.Second, the intra-class difference is more prominent since object scales of the same category in ORSIs usually vary greater.Third, the background in ORSIs is more complex and abundant than that in NSIs.Differences between ORSIs and NSIs with instances are shown in Figure 1.These differences make object detection in ORSIs more difficult, and most of the well-designed detectors for NSIs are not elaborately optimized for ORSIs.For the problems of a greater intra-class difference and inter-class similarity caused by the characteristic of objects in ORSIs, the detector needs to extract more abundant object features with high-level semantics to overcome it.However, the feature of objects in ORSIs are easily submerged by the redundant and complex background information and thus will decrease or even disappear when transmitted in the detector.Thus, DL-based detectors also require a stronger feature extraction and transmission ability.With the popularity and wide application of embedded devices such as unmanned aerial vehicles (UAVs), the demand for real-time optical remote sensing object detection deployed on edge devices has increased rapidly.UAVs with far less computing resource and storage space than computers involve wide application scenarios such as rescue, military and surveying tasks, which require a high detection accuracy, flexible equipment deployment and less inference time for detectors [17].
In recent years, several outstanding achievements have been made by researchers in fields related to ORSIs, and can be roughly divided into heavyweight [18][19][20][21] and lightweight detectors [22][23][24][25].Most of the heavyweight detectors usually have a high accuracy but require a large computational cost, and thus hinder their real-time response and the deployment on UAVs, whereas lightweight detectors have practical deployability and a fast inference speed but it is difficult for them to achieve as high a competitive accuracy as heavyweight detectors, especially for large multi-category object detection tasks With the popularity and wide application of embedded devices such as unmanned aerial vehicles (UAVs), the demand for real-time optical remote sensing object detection deployed on edge devices has increased rapidly.UAVs with far less computing resource and storage space than computers involve wide application scenarios such as rescue, military and surveying tasks, which require a high detection accuracy, flexible equipment deployment and less inference time for detectors [17].
In recent years, several outstanding achievements have been made by researchers in fields related to ORSIs, and can be roughly divided into heavyweight [18][19][20][21] and lightweight detectors [22][23][24][25].Most of the heavyweight detectors usually have a high accuracy but require a large computational cost, and thus hinder their real-time response and the deployment on UAVs, whereas lightweight detectors have practical deployability and a fast inference speed but it is difficult for them to achieve as high a competitive accuracy as heavyweight detectors, especially for large multi-category object detection tasks [23,24,26].Therefore, optimizing the structure of heavyweight detectors toward a better balance between accuracy, deployability and inference time is an issue well worth investigating.To establish a detector with a better balance between accuracy, deployability and inference time, a novel detector called YOLO-DSD for real-time optical remote sensing object detection based on YOLOv4 was developed in this study.The main contributions are as follows: (1) a new feature extraction module named a dense residual (DenseRes) Block was designed for better feature extraction and to reduce the computational cost and inference time in the backbone network.(2) Convolution layer-batch normalization layer-leaky ReLu (CBL) ×5 modules in the neck were improved with a short-cut connection and named S-CBL×5 to strengthen the transmission of object features.(3) A novel lowcost attention mechanism called a dual channel attention (DCA) Block was proposed to enhance the representation of the object feature.The experimental results in the DIOR dataset indicate that YOLO-DSD outperforms YOLOv4 by increasing mAP 0.5 from 71.3% to 73.0%, with a 23.9% and 29.7% reduction in Params and Flops, respectively, but a 50.2% improvement in FPS.In the RSOD dataset, the mAP 0.5 of YOLO-DSD is increased from 90.0~94.0%to 92.6~95.5% under different input sizes.Compared with the SOTA detectors, YOLO-DSD achieves a better balance between accuracy, deployability and inference time.

DL-Based Detectors for Optical Remote Sensing Object Detection
DL-based detectors have been widely applied in natural sense visual tasks.However, detectors established on NSIs need to further improve their feature extraction ability for optical remote sensing object detection tasks due to the problems of a greater intra-class difference, inter-class similarity and feature loss in ORSIs.Therefore, some heavyweight detectors have been improved and applied in ORSIs by many scholars.Xu et al. [18] modified YOLOv3 with a multi-receptive field to take full advantage of the feature information and to detect optical remote sensing objects effectively.Cheng et al. [19] designed an end-to-end cross-scale feature fusion framework for ORSIs object detection based on Faster R-CNN with a feature pyramid network (FPN) [16].Yin et al. [20] proposed a multi-scale feature extraction network based on RetinaNet, which strengthens the detection performance of irregular objects in ORSIs.Yuan et al. [21] established a multi-FPN that performs well in object detection with a complex background.The above research has successfully made obvious improvements in detection accuracy, but come with the non-ignorable sacrifice of the deployability or inference speed, and thus further hinder the application of detectors in edge devices.As a consequence, some lightweight DL-based detectors have been elaborately designed and improved to facilitate the application in edge devices.Li et al. [22] designed a lightweight detector by taking advantage of YOLOv3 and DenseNet [27].Lang et al. [23] employed the backbone network of ThunerNet [28] and constructed a six-layer feature fusion pyramid to enhance the detection performance.The improved YOLOv4-tiny proposed by Lei et al. [24] was constructed with an efficient channel attention mechanism to enhance the information sensitivity in each channel.Li et al. [25] established a lightweight detector for vehicle and ship detection through using a semantic transfer block and the distillation loss.Although these lightweight detectors have a better accuracy after improvement, there is still an obvious gap in the detection accuracy compared with heavyweight detectors.
Our motivation is to propose an end-to-end detector that can achieve a higher detection accuracy, better deployability and less inference time in order to meet the requirements of edge device real-time detection.YOLOv4 [8] is one of the widely used one-stage detectors, with an impressive performance in accuracy, deployability and inference time.It has been improved and applied in various fields, such as agriculture, industry and transportation [29][30][31][32], which verify its excellent generalization.In this study, YOLOv4 was utilized as the basic framework, while it was optimized from feature extraction modules, structures of the neck and the attention mechanism for a better application in optical remote sensing object detection.

Feature Extraction Modules in Backbone
The backbone that is utilized to extract high-level semantic features of images is the first part of the DL-based detector.It comprises several feature extraction modules.VGG [33] is one of the earliest backbones for object detection and utilizes 3 × 3 convolution layers as the feature extraction module.However, its heavy computation burden and shallow depth hinder the deployability and performance of detectors.
To solve this problem, He et al. [34] introduced a new feature extraction module named a Res Block to deepen the depth of backbones by adding short-cut connections.ResNet based on the Res Block achieves a better accuracy than VGG in the natural scene dataset, with a lower computation burden and deeper depth.The backbone DarkNet53 of YOLOv3 [6] also uses the Res Block as the main feature extraction module.Since then, many feature extraction modules based on the Res Block, such as a ResNeXt Block [35], Res2 Block [36], Dense Block [27] and CSP Block [37], have been improved and developed.The trunk of the ResNeXt Block is split into 32 paths that transform the input from high to low dimensions and back to high dimensions using the same topology, and aggregates them through element-wise addition.Although the ResNeXt Block outperforms the Res Block with fewer parameters and a higher detection accuracy in the natural scene dataset, since the semantic relevance between background and detected objects in ORSIs is stronger than that in NSIs [38], the operation of the ResNeXt Block easily breaks this relevance, and is thus not conducive to the detection performance in ORSIs.The Res2 Block can generate multi-scale features through a hierarchical short-cut connection and increase in receptive fields, thus improving the detection accuracy and reducing the computational consumption.However, its structure with parallel convolution and interactive operations significantly increases the inference time.The Dense Block contains several dense layers.The output of the dense layer is concatenated with its input, and the concatenated feature map serves as the input of the next dense layer.This structure takes full advantage of the short-cut that can better retain the feature and reduce the computation burden.However, the Dense Block will deteriorate in the situation where the background submerges features of detected objects in ORSIs, since the background information is more redundant and complex.Meanwhile, the structure of the Dense Block will reduce its inference speed due to the asymmetry of the input channels number and output channels number for a convolution operation.The CSP Block is the feature extraction module of the backbone CSP DarkNet in YOLOv4.It is mainly composed of several Res units based on a short-cut and a cross-stage part containing a 1 × 1 convolution layer.Although this structure can double the number of gradient paths and improve the detection accuracy through a splitting and merging strategy, there is parallel convolution and the problem of the trunk of the CSP Block being stacked alternately by an excessive convolution layer, which significantly increase the degree of network fragmentation and thus decrease the inference speed [39].
In order to alleviate the shortcomings of the above Blocks, a novel feature extraction module DenseRes Block is proposed in this study to improve the backbone in YOLOv4.Firstly, the input feature map of the DenseRes Block was compressed in order to increase the proportion of object feature information.Then, the series-connected residual structure with the same topology was utilized not only to obtain the high-level semantics of the object feature but also to reduce the computational consumption and inference time.Finally, the feature map output from the residual structure was combined with the input of the DenseRes Block to enhance the semantic relevance between background and detected objects.

Structure of the Neck
In the neck, feature maps output from the backbone will be processed and transmitted to the prediction part of the detector.The neck of the early DL-based detectors only directly transmits the last feature map of the backbone to the prediction part.The shallow feature map contains rich location information but low-level semantic information, whereas the deep feature map is the opposite; thus, this structure is not conducive to object detection, especially for small objects.In order to improve the detection performance of detectors for small objects, Liu et al. [3] proposed a neck structure that directly transfers the feature maps of different levels from the backbone to the prediction part of the detector for multi-scale detection, and proves that the utilization of a shallow feature map is beneficial for small object detection.However, shallow feature maps still lack high-level semantic information, while deep feature maps are still short of location information.FPN [16] is designed to transfer the high-level semantic information to the shallow feature map through the bottomup structure to further improve the detection performance of the detector for small objects.In order to make the deep feature map possess rich location information and high-level semantic information, BFPN [40] has been developed to fuse the penultimate feature map and the last feature map based on FPN, while PANet [41] adds a top-down structure based on FPN to transmit location information to the deep feature map.Both BFPN and PANet can improve the detection performance for middle and large objects while maintaining a high detection accuracy for small objects.
YOLOv4 adopts the PANet as the framework in the neck.However, YOLOv4 suffers from the problem of feature loss in ORSIs due to many convolution operations in the neck.Therefore, a short-cut connection based on a residual is introduced to each CBL×5 in the neck for strengthening the transmission of object features without an increase in the computational burden and inference time.

Attention Mechanism
The attention mechanism assigns different weights to the pixel according to the spatial or channel relationship between the pixels in the feature map to enhance the representation of the feature, and it mainly includes three categories: a channel attention mechanism (e.g., an SE Block [42] and ECA Block [43]), spatial attention mechanism (e.g., a CA Block [44]) and hybrid attention mechanism (e.g., a CBAM Block [45]).The attention mechanism can improve the detection accuracy in NSIs with a few parameters and computation burden increase for detectors.The SE Block squeezes and then extends channel information through two full connection layers in order to learn the relationship of global channel information and effectively improve the detection performance, but the relationship between local channel information is not considered.The ECA Block learns the relationship between local channels through 1-D convolution with an adaptive convolution kernel, which promotes the detection performance but ignores the relationship of global channel information.In the CA Block, the information is extracted by average pooling in horizontal and vertical directions, respectively, and then concatenated and fused by 2-D convolution.The fused information is split into two parts and each part is further extracted by the convolution layer, respectively.The hybrid attention mechanism CBAM Block combines the channel and spatial attention mechanism.Both the CA Block and CBAM Block bring an obvious improvement in the detection accuracy in the natural scene dataset, but their complex structure increase the inference time.Meanwhile, it is difficult for them to use a few parameters to extract the spatial information of ORSIs due to their more complex background and less spatial feature information for detected objects in ORSIs.
In order to more efficiently highlight the feature related to the detection task in ORSIs with a better robustness, a novel channel attention mechanism named a DCA Block was proposed to enhance the representation of the object feature in ORSIs through combing global and local channel information with a slight inference time increase.

Method Overview
The structure of YOLOv4 is given in Figure 2. YOLOv4 consists of a backbone, neck and prediction.YOLOv4 is established for NSIs and not practical enough to be adopted in ORSIs directly.Specifically, the backbone CSP DarkNet in YOLOv4 utilizes the CSP Block [37] as the feature extraction module and performs well in detection accuracy, but its model complexity and computational burden can be further reduced to improve its deployability and inference speed for ORSIs.The neck PANet [41] employed in YOLOv4 can strengthen the integration of a shallow and deep feature map, but its CBL×5 modules will easily cause the problem of feature loss, which is not conducive to information transmission for objects in ORSIs.Moreover, attention mechanisms that can enhance the feature representation are not utilized in YOLOv4.
The proposed detector YOLO-DSD based on YOLOv4 is shown in Figure 3. Three new modules are presented to improve the performance of YOLOv4.In the backbone, we developed a DenseRes Block as the main module for a better feature extraction and reduction in computational cost.In the neck, S-CBL×5 was proposed to handle the information loss problem, and the proposed attention mechanism, the DCA Block, was added after each S-CBL×5 module to enhance the representation of features.

Method Overview
The structure of YOLOv4 is given in Figure 2. YOLOv4 consists of a backbone, neck and prediction.YOLOv4 is established for NSIs and not practical enough to be adopted in ORSIs directly.Specifically, the backbone CSP DarkNet in YOLOv4 utilizes the CSP Block [37] as the feature extraction module and performs well in detection accuracy, but its model complexity and computational burden can be further reduced to improve its deployability and inference speed for ORSIs.The neck PANet [41] employed in YOLOv4 can strengthen the integration of a shallow and deep feature map, but its CBL ×5 modules will easily cause the problem of feature loss, which is not conducive to information transmission for objects in ORSIs.Moreover, attention mechanisms that can enhance the feature representation are not utilized in YOLOv4.
The proposed detector YOLO-DSD based on YOLOv4 is shown in Figure 3. Three new modules are presented to improve the performance of YOLOv4.In the backbone, we developed a DenseRes Block as the main module for a better feature extraction and reduction in computational cost.In the neck, S-CBL×5 was proposed to handle the information loss problem, and the proposed attention mechanism, the DCA Block, was added after each S-CBL×5 module to enhance the representation of features.

Improvement in the Backbone
YOLOv4 adopts a CSP Block, shown in Figure 4a, to extract features of images in the backbone.Although the CSP Block performs well in detection accuracy, the structure of the CSP Block containing a parallel convolution operation for reusing the feature of the 'Input' and excessive convolution layers caused by 'Res Unit' takes up a large amount of computing resources and inference time [39].Aiming at this problem of the CSP Block,

Improvement in the Backbone
YOLOv4 adopts a CSP Block, shown in Figure 4a, to extract features of images in the backbone.Although the CSP Block performs well in detection accuracy, the structure of the CSP Block containing a parallel convolution operation for reusing the feature of the 'Input' and excessive convolution layers caused by 'Res Unit' takes up a large amount of computing resources and inference time [39].Aiming at this problem of the CSP Block, we proposed a DenseRes Block, shown in Figure 4b, and employed it in the backbone for feature extraction.height, width and channel number of the map, respectively.Since the feature of detected objects in ORSIs is easily overwhelmed by that of the background when transmitted, we utilized a feature map with fewer channels as the output of the first convolution operation to compress the 'Input' to focus on object features and reduce the proportion of background information.Therefore, the feature map  1 ∈ ℝ W×H×G was computed by where 1) contains the 3 × 3 convolution layer that compacts the number of channels from C to G, the BN layer and the leaky ReLu activation function.If n = 1, the DenseRes Block is the same as the Res Block.When n > 1, the DenseRes Block will compress the 'Input' and make a feature extraction.It was proven in Ref. [39] that the following operations can effectively reduce the memory access cost and the inference time of the model: (1) the input channel and output channel of the convolution layer should be equal as much as possible; (2) the number of fragmented operators (i.e., the number of individual convolution or parallel operations in one building block) should be reduced.Therefore,  (1 <  ≤ ) ∈ ℝ × × could be designed as The DenseRes Block is only composed of several series-connected 3 × 3 convolution operations f (i) For the feature map Input ∈ R W×H×C , W, H and C indicate the height, width and channel number of the map, respectively.Since the feature of detected objects in ORSIs is easily overwhelmed by that of the background when transmitted, we utilized a feature map with fewer channels as the output of the first convolution operation to compress the 'Input' to focus on object features and reduce the proportion of background information.Therefore, the feature map y 1 ∈ R W×H×G was computed by where 3×3 contains the 3 × 3 convolution layer that compacts the number of channels from C to G, the BN layer and the leaky ReLu activation function.If n = 1, the DenseRes Block is the same as the Res Block.When n > 1, the DenseRes Block will compress the 'Input' and make a feature extraction.It was proven in Ref. [39] that the following operations can effectively reduce the memory access cost and the inference time of the model: (1) the input channel and output channel of the convolution layer should be equal as much as possible; (2) the number of fragmented operators (i.e., the number of individual convolution or parallel operations in one building block) should be reduced.Therefore, y j (1 < j ≤ n) ∈ R W×H×G could be designed as where f 3×3 (1 < j ≤ n) contains the 3 × 3 convolution layer with the same number G of input and output channels, the BN layer and the leaky ReLu activation function.⊕ indicates the element-wise addition.From the comparison between the CSP Block and the DenseRes Block shown in Figure 4, the output of each 'Res Unit' in the CSP Block will go through two convolution layers with different kernel sizes, whereas that of each 'y i ' in the DenseRes Block only goes through one 3 × 3 convolution layer.Therefore, the fragment degree can be decreased.Moreover, we used a short-cut based on residual learning to connect y j (1 < j ≤ n) and y j−1 (1 < j ≤ n) for the problem of feature loss in the process of feature extraction.
In ORSIs, there will be potential semantic relevance between the object and the background [21,38].For example, cars and airplanes tend to park on land whereas ships tend to sail on the sea, and bridges are built over water whereas overpasses are built over land.In order to make the network better learn high-level semantic relevance, the Output ∈ R W×H×C was designed as where concatenates y 1 , y 2 , . . ., y n in the channel dimension to a feature map with the same size as possessing more object information was combined with the Input ∈ R W×H×C holding more background information by element-wise addition directly to improve the detection accuracy.Compared with the CSP Block, such a designed structure in the DenseRes Block not only reuses the feature of 'Input' but also omits a parallel convolution operation, which can further reduce the degree of the fragment in the backbone.
The DenseRes Block was utilized in order to replace the original module, the CSP Block, in the backbone.The architecture and complexity of the restructured backbone, named DarkNet-DenseRes, is shown in Table A1, Appendix A.

Improvement in the Neck
YOLOv4 uses the feature pyramid structure of PANet in the neck to fuse feature maps of different levels and extract a feature, which performs well in object detection in natural scenes.However, the feature information of objects in ORSIs is usually far less obvious than that of objects in natural scenes, and information loss caused by excessive convolutional operations in PANet limits the detection performance of the network for the objects in ORSIs.In order to solve this problem, S-CBL×5 was utilized to replace each CBL×5 in the original neck as shown in Figure 3.The structure comparison between CBL×5 and S-CBL×5 is given in Figure 5. S-CBL×5 adds two short-cuts based on CBL×5 and does not add additional parameters and inference time.
obvious than that of objects in natural scenes, and information loss caused by excessive convolutional operations in PANet limits the detection performance of the network for the objects in ORSIs.In order to solve this problem, S-CBL×5 was utilized to replace each CBL×5 in the original neck as shown in Figure 3.The structure comparison between CBL×5 and S-CBL×5 is given in Figure 5. S-CBL×5 adds two short-cuts based on CBL×5 and does not add additional parameters and inference time.To highlight significant features related to the detection task, the DCA Block was proposed to optimize the weight distribution of each feature map in the channel dimen- To highlight significant features related to the detection task, the DCA Block was proposed to optimize the weight distribution of each feature map in the channel dimension by combining the local and global relationship between channels with a slight increase in computations cost and inference time.The structure of the DCA Block is shown in Figure 6.
where  Conv1D k represents the 1-dimension convolution layer.Since each feature map has a different number of channels and the kernel size of the convolution layer is proportional to the number of the channels [43], the mapping between its kernel size (k) and the number of input channels (C) is given in Equation (5). Conv1D k could adaptively select the kernel size according to non-linearly mapping Equation ( 5); thus, it can extract the local relationship between covered channels more effectively than the convolution layer with a handgiven convolution kernel size.
At the same time, two full connection layers were used as a bottleneck in the 'Global Extraction Path' to build the global relationship of each channel: where f k Conv1D represents the 1-dimension convolution layer.Since each feature map has a different number of channels and the kernel size of the convolution layer is proportional to the number of the channels [43], the mapping between its kernel size (k) and the number of input channels (C) is given in Equation (5).f k Conv1D could adaptively select the kernel size according to non-linearly mapping Equation ( 5); thus, it can extract the local relationship between covered channels more effectively than the convolution layer with a hand-given convolution kernel size.
At the same time, two full connection layers were used as a bottleneck in the 'Global Extraction Path' to build the global relationship of each channel: where f FC is the first full connection layer that compresses the channel number from C to C/R, and f (2) FC is the second full connection layer that extends the channel number from C/R to C. The value of the zoom factor R that could reduce the complexity of the structure was set to 32 according to the experimental results in Section 4.4.1.The structure of the 'Global Extraction Path' with two full connection layers has a stronger non-linearity and can fit better with the complex global relationship between each channel.
Thirdly, the output of the 'Global Extraction Path' and 'Local Extraction Path' were combined by element-wise addition, and the sigmoid function was applied to generate the weight w ∈ R 1×1×C .Finally, the output of the DCA Block was calculated as: where ⊗ represents the operation of the element-wise product.As shown in Figure 3, we added the proposed DCA Block after each S-CBL×5 to generate an improved PANet (shown in Figure 7) with a structure that is more suitable for optical remote sensing object detection and has a nearly equal computational cost compared to the original structure.Thirdly, the output of the 'Global Extraction Path' and 'Local Extraction Path' were combined by element-wise addition, and the sigmoid function was applied to generate the weight w ∈ ℝ × × .Finally, the output of the DCA Block was calculated as: ) where ⊗ represents the operation of the element-wise product.As shown in Figure 3, we added the proposed DCA Block after each S-CBL×5 to generate an improved PANet (shown in Figure 7) with a structure that is more suitable for optical remote sensing object detection and has a nearly equal computational cost compared to the original structure.

Prediction
Decoding and obtaining the detection result were processed in the prediction.As shown in Figure 3, each output of the neck went through with a CBL module and a 1 × 1 convolution layer, and three feature maps,

Prediction
Decoding and obtaining the detection result were processed in the prediction.As shown in Figure 3, each output of the neck went through with a CBL module and a 1 × 1 convolution layer, and three feature maps, P 1 ∈ R 52×52×num_class , P 2 ∈ R 26×26×num_class and P 3 ∈ R 13×13×num_class , were generated.Then, as shown in Figure 8, P 1 , P 2 and P

Loss Function
The loss function of YOLOv4 includes three parts: confidence, classification and bounding box regression loss.YOLOv4 employs the complete intersection over union (IoU) loss (CIoU) [46], replacing the mean squared error loss adopted in YOLOv3 with the bounding box regression loss.CIoU takes the overlap area, center point distance and aspect ratio into consideration simultaneously, and the convergence speed and detection accuracy were improved.CIoU introduces a penalty item αν based on the distance IoU loss to impose the consistency of the aspect ratio for the ground truth (b gt ) and bounding box (b b ).The loss of CIoU can be defined as Equation (9).
where b gt , b b are the center of b gt , b b , respectively, ρ denotes the Euclidean distance, c represents the diagonal length of the smallest enclosing rectangle covering b gt , b b , α is a positive trade-off value and ν means the consistency of the aspect ratio.w gt , w b are the width of the b gt , b b , respectively.h gt , h b are the height of the b gt , b b , respectively.
CIOU can directly minimize the distance between the bounding box and ground

Loss Function
The loss function of YOLOv4 includes three parts: confidence, classification and bounding box regression loss.YOLOv4 employs the complete intersection over union (IoU) loss (CIoU) [46], replacing the mean squared error loss adopted in YOLOv3 with the bounding box regression loss.CIoU takes the overlap area, center point distance and aspect ratio into consideration simultaneously, and the convergence speed and detection accuracy were improved.CIoU introduces a penalty item αν based on the distance IoU loss to impose the consistency of the aspect ratio for the ground truth (bb gt ) and bounding box (bb b ).The loss of CIoU can be defined as Equation (9).
where b gt , b b are the center of bb gt , bb b , respectively, ρ denotes the Euclidean distance, c represents the diagonal length of the smallest enclosing rectangle covering bb gt , bb b , α is a positive trade-off value and ν means the consistency of the aspect ratio.w gt , w b are the width of the bb gt , bb b , respectively.h gt , h b are the height of the bb gt , bb b , respectively.CIOU can directly minimize the distance between the bounding box and ground truth and accelerate the model convergence.Previous works [47][48][49] have proved that CIOU can perform better in detecting objects with diverse sizes, which can match well with the characteristics of remote sensing object detection tasks.

Experiments and Discussion
In this section, we conduct ablation and comparative experiments on a public optical remote sensing dataset DIOR [2] with 20 categories to validate the proposed YOLO-DSD, considering the accuracy, deployability and speed indictors.Another optical remote sensing dataset RSOD [50] with 4 categories was utilized to further verify the effectiveness of the proposed YOLO-DSD compared with YOLOv4.[15], objects with area of ground truth less than 32 × 32 pixels, between 32 × 32 pixels and 96 × 96 pixels and larger than 96 × 96 pixels are taken as small, middle and large-sized objects, respectively.Each category and the size distribution of objects in DIOR is shown in Figure 10.It can be seen that objects in DIOR possess great size difference and are concentrated in small and middle-sized.[15], objects with area of ground truth less than 32 × 32 pixels, between 32 × 32 pixels and 96 × 96 pixels and larger than 96 × 96 pixels are taken as small, middle and large-sized objects, respectively.Each category and the size distribution of objects in DIOR is shown in Figure 10.It can be seen that objects in DIOR possess great size difference and are concentrated in small and middle-sized.
Moreover, since images in DIOR are carefully collected under various environment conditions, such as different weathers and seasons, these images possess richer variations in viewpoint, background, occlusion, etc. Problems of intra-class diversity and intra-class similarity are more laborious due to the above characteristics.The main difficulties in realworld tasks can be well reflected by DIOR; thus, ablation experiments of YOLO-DSD and comparative experiments with SOTA detectors were conducted in DIOR dataset.Moreover, since images in DIOR are carefully collected under various environment conditions, such as different weathers and seasons, these images possess richer variations in viewpoint, background, occlusion, etc. Problems of intra-class diversity and intra-class similarity are more laborious due to the above characteristics.The main difficulties in real-world tasks can be well reflected by DIOR; thus, ablation experiments of YOLO-DSD and comparative experiments with SOTA detectors were conducted in DIOR dataset.

RSOD Dataset
RSOD [50] contains 976 images that have been clipped into approximately 1000 × 1000 pixels, and the spatial resolution of these images ranges from 0.3 m to 3 m.There are 6950 object instances in this dataset in total, covered by 4 common classes in ORSIs, including 4993 aircraft, 1586 oil tanks, 180 overpasses and 191 playgrounds.Each instance of classes is shown in Figure 11.
In addition, instances in RSOD dataset are under various scenes, including urban, grasslands, mountains, lakes, airport, etc.Although the scale of RSOD is not as large as that of DIOR, the characteristics of images in optical remote sensing object detection task can also be reflected by RSOD dataset.Therefore, we further analyzed the effectiveness of YOLO-DSD compared with YOLOv4 in RSOD dataset.In addition, instances in RSOD dataset are under various scenes, including urban, grasslands, mountains, lakes, airport, etc.Although the scale of RSOD is not as large as that of DIOR, the characteristics of images in optical remote sensing object detection task can also be reflected by RSOD dataset.Therefore, we further analyzed the effectiveness of YOLO-DSD compared with YOLOv4 in RSOD dataset.

Evaluation Indicator
Detectors in this study were analyzed from three perspectives, including detection accuracy, deployability and speed.The evaluation indicators of each performance are shown in Table 1.The higher the mAP and FPS, but the lower Params and Flops, the better the detector.

Indicator Class
Indicator Description Accuracy mAP 0.5 (%) Average precision when IOU = 0.5.It is the most used indicator in remote sensing object detection.mAP 0.5:0.95(%) Mean values of mAPs under each IOU, which are taken at an interval of 0.05 between 0.5 and 0.95.mAP S , mAP M , mAP L (%) The mAP 0.5:0.95 of small, middle and large-sized object defined in MS COCO.

Deployability Params Number of detector parameters. Flops
Floating point operations.Speed FPS (img/s) Frames transmitted per second.

Experiment Setting
In this study, the deep learning framework PyTorch1.7.1 was utilized to implement all of the detectors in this study.The experimental environment was ubuntu18.04,CUDA11.1,CUDNN8.0.5 and NVIDIA GeForce RTX 3080.In order to ensure enough training samples and to make the test set reflect the characteristics of each dataset well, training and test sets in DIOR were split by 1:1, whereas those in RSOD were split by 4:1 randomly.A total of 90% of the training set was utilized for training detectors, and 10% was used for monitoring to avoid overfitting.The input size and batch size of detectors was set to 416 × 416 and 7, respectively.Adam optimizer was employed to update the parameters, with a weight decay of 2 × 10 −4 .The relationship between learning rate and epoch is shown in Figure 12.For anchor-based detectors, K-means was utilized to opti-

Evaluation Indicator
Detectors in this study were analyzed from three perspectives, including detection accuracy, deployability and speed.The evaluation indicators of each performance are shown in Table 1.The higher the mAP and FPS, but the lower Params and Flops, the better the detector.

Indicator Class
Indicator Description Accuracy mAP 0.5 (%) Average precision when IOU = 0.5.It is the most used indicator in remote sensing object detection.mAP 0.5:0.95(%) Mean values of mAPs under each IOU, which are taken at an interval of 0.05 between 0.5 and 0.95.mAP S , mAP M , mAP L (%) The mAP 0.5:0.95 of small, middle and large-sized object defined in MS COCO.

Deployability Params Number of detector parameters. Flops
Floating point operations.
Speed FPS (img/s) Frames transmitted per second.

Experiment Setting
In this study, the deep learning framework PyTorch1.7.1 was utilized to implement all of the detectors in this study.The experimental environment was ubuntu18.04,CUDA11.1,CUDNN8.0.5 and NVIDIA GeForce RTX 3080.In order to ensure enough training samples and to make the test set reflect the characteristics of each dataset well, training and test sets in DIOR were split by 1:1, whereas those in RSOD were split by 4:1 randomly.A total of 90% of the training set was utilized for training detectors, and 10% was used for monitoring to avoid overfitting.The input size and batch size of detectors was set to 416 × 416 and 7, respectively.Adam optimizer was employed to update the parameters, with a weight decay of 2 × 10 −4 .The relationship between learning rate and epoch is shown in Figure 12.For anchor-based detectors, K-means was utilized to optimize the size of anchors before training.  .× 100%), while achieving a 0.2% higher mAP 0.5 and almost the same mAP 0.5:0.95compared with YOLOv4 as the baseline.The detector improved by S-CBL×5 in the neck based on "+DenseRes Block" is beneficial for mAP 0.5 and mAP 0.5:0.95, which are brought about by the increase in mAP M and mAP L without affecting the deployability and inference speed.However, the mAP S slightly decreased by 0.3% because the short-cut utilized in S-CBL×5 strengthened the transmitting of the feature, and thus introduced background features additionally, which attenuated the representation of the feature for small-sized objects.The detector further improved by the DCA Block achieved a significant increase in mAP due to the enhancement of feature expression, and made up for the loss of mAP S caused by the short-cut with the same Params and Flops, while the FPS was only slightly reduced by 5.3 img/s.
In summary, YOLO-DSD outperforms YOLOv4 both in the detection accuracy, deployability and speed evaluation indicator.YOLO-DSD based on YOLOv4 increases the commonly used indicator mAP 0.5 by 1.7% and the more rigorous indicator mAP 0.5:0.95 by 0.9%.Specifically, YOLO-DSD has a greater advantage in mAP M and mAP L , while it achieves a similar and competitive mAP S compared with YOLOv4.In terms of deployability performance, the Params and Flops of YOLO-DSD decreased by 23.9% and 29.7% more than those of YOLOv4, respectively.YOLO-DSD also performs well in inference speed: it is 50.2% faster than YOLOv4 in FPS.

Detectors
Params Flops FPS mAP 0.5 mAP 0. We further analyzed the performance of the DenseRes Block.The ablation results of the DenseRes Block are shown in Table 3.The structure of the DenseRes Block in each     × 100%), while achieving a 0.2% higher mAP 0.5 and almost the same mAP 0.5:0.95compared with YOLOv4 as the baseline.The detector improved by S-CBL×5 in the neck based on "+DenseRes Block" is beneficial for mAP 0.5 and mAP 0.5:0.95, which are brought about by the increase in mAP M and mAP L without affecting the deployability and inference speed.However, the mAP S slightly decreased by 0.3% because the short-cut utilized in S-CBL×5 strengthened the transmitting of the feature, and thus introduced background features additionally, which attenuated the representation of the feature for small-sized objects.The detector further improved by the DCA Block achieved a significant increase in mAP due to the enhancement of feature expression, and made up for the loss of mAP S caused by the short-cut with the same Params and Flops, while the FPS was only slightly reduced by 5.3 img/s.
In summary, YOLO-DSD outperforms YOLOv4 both in the detection accuracy, deployability and speed evaluation indicator.YOLO-DSD based on YOLOv4 increases the commonly used indicator mAP 0.5 by 1.7% and the more rigorous indicator mAP 0.5:0.95 by 0.9%.Specifically, YOLO-DSD has a greater advantage in mAP M and mAP L , while it achieves a similar and competitive mAP S compared with YOLOv4.In terms of deployability performance, the Params and Flops of YOLO-DSD decreased by 23.9% and 29.7% more than those of YOLOv4, respectively.YOLO-DSD also performs well in inference speed: it is 50.2% faster than YOLOv4 in FPS.We further analyzed the performance of the DenseRes Block.The ablation results of the DenseRes Block are shown in Table 3.The structure of the DenseRes Block in each detector is shown in Figure 13.Model 1 is the detector improved by the DenseRes Block without the structure of the 'Short-cut' and 'Combine'.'Short-cut' and 'Combine' are introduced to the DenseRes Block in Model 2 and Model 3, respectively.Model 4 utilizes the complete DenseRes Block to improve the backbone of YOLOv4.From the comparison between Model 1 and Model 2, the 'Short-cut' introduced to DenseRes Block for the mitigation of feature loss can improve the mAP of objects in each size.After adding the 'Combine' to DenseRes Block, Model 3 performs better on the middle and large-sized object, while the mAP S decreases slightly by 0.1%.The possible reason for this is that the feature of the middle and large-sized object is obvious enough to build high-level semantic relevance with the background feature, while the feature of the small object is not obvious enough and thus it is easy for it to be overwhelmed.Model 4 improved by the complete DenseRes Block achieves the highest mAP and a significant increase in mAP S , mAP M and mAP L .It is probable that, on the basis of the 'Short-cut', the feature of each size object can be better retained when transmitting in the DenseRes Block, and can thus benefit the building of high-level semantic relevance with a background feature through 'Combine'.without the structure of the 'Short-cut' and 'Combine'.'Short-cut' and 'Combine' are introduced to the DenseRes Block in Model 2 and Model 3, respectively.Model 4 utilizes the complete DenseRes Block to improve the backbone of YOLOv4.From the comparison between Model 1 and Model 2, the 'Short-cut' introduced to DenseRes Block for the mitigation of feature loss can improve the mAP of objects in each size.After adding the 'Combine' to DenseRes Block, Model 3 performs better on the middle and large-sized object, while the mAP S decreases slightly by 0.1%.The possible reason for this is that the feature of the middle and large-sized object is obvious enough to build high-level semantic relevance with the background feature, while the feature of the small object is not obvious enough and thus it is easy for it to be overwhelmed.Model 4 improved by the complete DenseRes Block achieves the highest mAP and a significant increase in mAP S , mAP M and mAP L .It is probable that, on the basis of the 'Short-cut', the feature of each size object can be better retained when transmitting in the DenseRes Block, and can thus benefit the building of high-level semantic relevance with a background feature through 'Combine'.Note: 'Short-cut*' indicates a short-cut to connect yi (1 < i ≤ n) and  3.
The experimental results of DCA module ablation are shown in Tables 4 and 5. Table 4 shows the influence of scaling factor R on the performance of the DCA Block.The results show that, when R = 32, DCA can achieve the best performance.Table 5 exhibits the influence of three different fusion methods shown in Figure 14 on the performance of the DCA Block.The results show that the DCA Block with a different fusion method can effectively improve the detection accuracy.Specifically, compared with DCA in series, DCA in parallel has a more obvious advantage in small and middle-sized objects, while the FPS is slightly reduced by 0.7 img/s.This may be due to the fact that, when employing the same number of operation layers in one building block, although the structure designed in parallel has a higher fragment, it can keep the integrity of the feature better compared with that in series.For the proposed DCA Block, which has a small structure complexity, utilizing the structure in parallel makes it perform better in the enhancement of feature expression without an obvious sacrifice of inference time.3.
The experimental results of DCA module ablation are shown in Tables 4 and 5. Table 4 shows the influence of scaling factor R on the performance of the DCA Block.The results show that, when R = 32, DCA can achieve the best performance.Table 5 exhibits the influence of three different fusion methods shown in Figure 14 on the performance of the DCA Block.The results show that the DCA Block with a different fusion method can effectively improve the detection accuracy.Specifically, compared with DCA in series, DCA in parallel has a more obvious advantage in small and middle-sized objects, while the FPS is slightly reduced by 0.7 img/s.This may be due to the fact that, when employing the same number of operation layers in one building block, although the structure designed in parallel has a higher fragment, it can keep the integrity of the feature better compared with that in series.For the proposed DCA Block, which has a small structure complexity, utilizing the structure in parallel makes it perform better in the enhancement of feature expression without an obvious sacrifice of inference time.  5.
Comparative experiment for different backbones: The performances of the CSP Dark-Net, which is improved by the proposed DenseRes Block (DarkNet-DenseRes) and other backbones, are demonstrated in Table 6.Based on the CSP DarkNet framework, the proposed DenseRes Block outperforms the ResNeXt Block and Dense Block in all indicators.Although the mAP 0.5 and mAP 0.5:0.95 of DarkNet-DenseRes are slightly lower than those of DarkNet-Res by 0.1% and 1.3%, the Params and Flops of DarkNet-DenseRes are only approximately 1/3 and 1/4 of DarkNet-Res, while the FPS of DarkNet-DenseRes is approximately 1/4 higher than that of DarkNet-Res.Similarly, the mAP 0.5 and mAP 0.5:0.95 of Dark-Net-DenseRes are 0.9% and 1.1% lower than those of DarkNet-Res2; however, the Params and Flops of DarkNet-DenseRes are only approximately 1/3 and 1/2 of those of DarkNet-  5.
Comparative experiment for different backbones: The performances of the CSP Dark-Net, which is improved by the proposed DenseRes Block (DarkNet-DenseRes) and other backbones, are demonstrated in Table 6.Based on the CSP DarkNet framework, the proposed DenseRes Block outperforms the ResNeXt Block and Dense Block in all indicators.Although the mAP 0.5 and mAP 0.5:0.95 of DarkNet-DenseRes are slightly lower than those of DarkNet-Res by 0.1% and 1.3%, the Params and Flops of DarkNet-DenseRes are only approximately 1/3 and 1/4 of DarkNet-Res, while the FPS of DarkNet-DenseRes is approximately 1/4 higher than that of DarkNet-Res.Similarly, the mAP 0.5 and mAP 0.5:0.95 of DarkNet-DenseRes are 0.9% and 1.1% lower than those of DarkNet-Res2; however, the Params and Flops of DarkNet-DenseRes are only approximately 1/3 and 1/2 of those of DarkNet-Res2, while the interference speed is 2.3 times that of DarkNet-Res2 according to FPS.The superiority of DarkNet-DenseRes compared with CSP DarkNet was analyzed and proved in ablation experiments.DarkNet-DenseRes also has obvious advantage in all indicators compared with ResNet50.Although DarkNet-DenseRes has a similar accuracy and speed to VGG16, VGG16 has seven times as much Flops than that of DarkNet-DenseRes.Therefore, DarkNet-DenseRes achieves the optimal balance of accuracy, deployability and speed.Comparative experiment for different necks: Table 7 shows the performance of each neck structure that was tested by applying a no-feature pyramid structure (None), FPN, BFPN, PANet (Baseline) and S-PANet to the modified YOLOv4, with the DenseRes Block in the backbone.'None' has the lowest Params (18.83 M) and Flops (4.89 G) and the highest FPS (85.5 img/s), but it does not perform well in detection accuracy, and, in particular, its mAP S is only 8.1%, whereas that of the other four necks ranges from 9.1% to 9.5%.Therefore, the feature pyramid structure is vital for detection accuracy and, in particular, for small size objects, which occupy more than 50% in DIOR.Although FPN and BFPN are slightly better than PANet in deployability and inference speed, they have more than a 2.6% inferiority in mAP of middle and large-sized objects, which, in total, account for approximately 50% of objects in DIOR.It was proven that the structure of PANet is important to the detection accuracy in YOLOv4 for ORSIs.PANet and S-PANet have almost the same Params, Flops and FPS, but our S-PANet performs better than PANet in mAP 0.5 and mAP 0.5:0.95 .In conclusion, S-PANet is more suitable for optical remote sensing object detection than other necks.Comparative experiments for different attention mechanisms: Taking modified YOLOv4 with the DenseRes Block in the backbone and S-PANet in the neck as the baseline (None), the indicator values of different attention mechanisms are exhibited and compared in Table 8.The CA Block and CBAM Block containing the spatial attention mechanism fail to improve the detection accuracy, and the FPS decreases significantly due to those complex structures.Most channel attention mechanisms, including the SE Block, ECA Block and DCA Block, can improve the detection accuracy.The DCA Block improves the detection accuracy for small, medium and large sizes of objects, and achieves the highest mAP 0.5 = 73.0%and mAP 0.5:0.95= 40.0%,with an increase of 1.1% and 0.8% compared with 'None', respectively, when R = 32, and the FPS only decreases by 5.3 img/s.In the case of the SE Block, mAP 0.5 and mAP 0.5:0.95increases by 0.2% and 0.1%, and the FPS decreases by 3.4 img/s.The ECA Block improves both mAP 0.5 and mAP 0.5:0.95 by 0.1%, and decreases the FPS by 2.8 img/s.Therefore, the proposed DCA Block can achieve the best balance between accuracy and speed.Comparative experiments for different detectors: The performances of the proposed YOLO-DSD and eight SOTA detectors are demonstrated in Table 9. RetinaNet and Efficient-Det have a better deployability than YOLO-DSD, but their detection accuracy, especially for small-sized objects and speed, are far behind that of YOLO-DSD, so this hinders the application of these detectors in optical remote sensing object detection.The large Flops of SSD and Faster-RCNN require a huge amount of computing resources, which greatly increases the difficulty in deploying them on edge devices.Although the Params and Flops of CenterNet are 67% and 69% that of YOLO-DSD, and the FPS is 46% faster, the detection accuracy of CenterNet is significantly lower than that of YOLO-DSD (mAP 0.5:0.95:35.8% vs. 40.0%),and the mAP S is only 62.5% that of YOLO-DSD.YOLO-Lite has an obvious disadvantage in detection accuracy for small and large-sized objects, even though it has a better deployability compared with YOLO-DSD.The inference speed of YOLOv3 is nearly the same as that of YOLO-DSD, but the deployability and detection accuracy of YOLOv3 are obviously inferior to that of YOLO-DSD.The superiority of YOLO-DSD compared with YOLOv4 was analyzed and proved in ablation experiments.Therefore, YOLO-DSD outperforms other SOTA detectors in the balance of accuracy, deployability and speed.Figures 15-17 exhibit the detection performance of Faster-RCNN, CenterNet, YOLOv4 and YOLO-DSD on DIOR.The detection result of the small-sized instance in Figure 15 indicates that both Faster-RCNN and CenterNet obviously miss detection.Although YOLOv4 could completely detect airplanes, it incorrectly detected a storage tank.Our YOLO-DSD can correctly detect all airplanes without any false detection.Figure 16 presents the detection results of an instance in the complex urban background.We can see that Faster-RCNN only detects one ground track field, and that CenterNet misses two bridges and two ground track fields and misdetects an overpass.YOLOv4 misses one bridge and one ground track field, whereas YOLO-DSD detects all objects correctly.The detection results of instances in a complex suburban background are given in Figure 17.It can be seen that Faster-RCNN detects only one Expressway-Service-Area, CenterNet has two false detections of an overpass and windmill, YOLOv4 detects two Expressway-Service-Areas as one, and YOLO-DSD correctly detects all objects.The above instances verify that YOLO-DSD can handle object detection under different complex backgrounds well.
presents the detection results of an instance in the complex urban background.We can see that Faster-RCNN only detects one ground track field, and that CenterNet misses two bridges and two ground track fields and misdetects an overpass.YOLOv4 misses one bridge and one ground track field, whereas YOLO-DSD detects all objects correctly.The detection results of instances in a complex suburban background are given in Figure 17.It can be seen that Faster-RCNN detects only one Expressway-Service-Area, CenterNet has two false detections of an overpass and windmill, YOLOv4 detects two Expressway-Service-Areas as one, and YOLO-DSD correctly detects all objects.The above instances verify that YOLO-DSD can handle object detection under different complex backgrounds well.The precision-recall curves and AP (IOU = 0.5) of YOLOv4 and YOLO-DSD in each category are given in Figure 18 for a better illustration of the difference in detection accuracy.It can be seen that YOLO-DSD detects better than YOLOv4 in 11 categories, including airplane, airport, baseball field, chimney, dam, Expressway-Service-Area, golffield, groundtrackfield, stadium, storagetank and transtation.In particular, the AP of YOLO-DSD in airport, baseballfield, Expressway-Service-Area and groundtrackfield is over 2% presents the detection results of an instance in the complex urban background.We can see that Faster-RCNN only detects one ground track field, and that CenterNet misses two bridges and two ground track fields and misdetects an overpass.YOLOv4 misses one bridge and one ground track field, whereas YOLO-DSD detects all objects correctly.The detection results of instances in a complex suburban background are given in Figure 17.It can be seen that Faster-RCNN detects only one Expressway-Service-Area, CenterNet has two false detections of an overpass and windmill, YOLOv4 detects two Expressway-Service-Areas as one, and YOLO-DSD correctly detects all objects.The above instances verify that YOLO-DSD can handle object detection under different complex backgrounds well.The precision-recall curves and AP (IOU = 0.5) of YOLOv4 and YOLO-DSD in each category are given in Figure 18 for a better illustration of the difference in detection accuracy.It can be seen that YOLO-DSD detects better than YOLOv4 in 11 categories, including airplane, airport, baseball field, chimney, dam, Expressway-Service-Area, golffield, groundtrackfield, stadium, storagetank and transtation.In particular, the AP of YOLO-DSD in airport, baseballfield, Expressway-Service-Area and groundtrackfield is over 2% presents the detection results of an instance in the complex urban background.We can see that Faster-RCNN only detects one ground track field, and that CenterNet misses two bridges and two ground track fields and misdetects an overpass.YOLOv4 misses one bridge and one ground track field, whereas YOLO-DSD detects all objects correctly.The detection results of instances in a complex suburban background are given in Figure 17.It can be seen that Faster-RCNN detects only one Expressway-Service-Area, CenterNet has two false detections of an overpass and windmill, YOLOv4 detects two Expressway-Service-Areas as one, and YOLO-DSD correctly detects all objects.The above instances verify that YOLO-DSD can handle object detection under different complex backgrounds well.The precision-recall curves and AP (IOU = 0.5) of YOLOv4 and YOLO-DSD in each category are given in Figure 18 for a better illustration of the difference in detection accuracy.It can be seen that YOLO-DSD detects better than YOLOv4 in 11 categories, including airplane, airport, baseball field, chimney, dam, Expressway-Service-Area, golffield, groundtrackfield, stadium, storagetank and transtation.In particular, the AP of YOLO-DSD in airport, baseballfield, Expressway-Service-Area and groundtrackfield is over 2% The precision-recall curves and AP (IOU = 0.5) of YOLOv4 and YOLO-DSD in each category are given in Figure 18 for a better illustration of the difference in detection accuracy.It can be seen that YOLO-DSD detects better than YOLOv4 in 11 categories, including airplane, airport, baseball field, chimney, dam, Expressway-Service-Area, golffield, groundtrackfield, stadium, storagetank and transtation.In particular, the AP of YOLO-DSD in airport, baseballfield, Expressway-Service-Area and groundtrackfield is over 2% higher than that of YOLOv4.The AP of YOLO-DSD in airplane, transtation and stadium significantly increase by 6.63%, 5.21% and 17.02%, respectively.For the other nine categories, YOLO-DSD only slightly decreases by 0.35~1.78%compared with YOLOv4 in AP, but still has a competitive accuracy.Therefore, YOLO-DSD has a better accuracy performance than YOLOv4 in the large-scale ORSIs dataset DIOR in total.
higher than that of YOLOv4.The AP of YOLO-DSD in airplane, transtation and stadium significantly increase by 6.63%, 5.21% and 17.02%, respectively.For the other nine categories, YOLO-DSD only slightly decreases by 0.35~1.78%compared with YOLOv4 in AP, but still has a competitive accuracy.Therefore, YOLO-DSD has a better accuracy performance than YOLOv4 in the large-scale ORSIs dataset DIOR in total.

Experiment Results and Discussion in RSOD Dataset
In order to further exhibit the superiority of the proposed YOLO-DSD based on YOLOv4 in optical remote sensing object detection application, another comparison experiment between YOLO-DSD and YOLOv4 was conducted in a four-category dataset, RSOD [50], which contained aircraft, oil tank, playground and overpass.The experiment result is shown in Table 10.It can be seen that YOLO-DSD outperforms in accuracy and inference time compared with YOLOv4 under different input sizes, including 416 × 416, 512 × 512 and 608 × 608.Specifically, YOLO-DSD increases mAP 0.5 , mAP 0.5:0.95 and FPS by 2.6%, 0.8% and 50.2%, respectively, under the input size 416 × 416, while, under the input size 512 × 512, mAP 0.5 , mAP 0.5:0.95 and FPS are improved by 2.1%, 1.9% and 54.9%, respectively.In terms of the input size 608 × 608, the mAP 0.5 , mAP 0.5:0.95 and FPS of YOLO-DSD are 1.5%, 1.2% and 59.3% higher than those of YOLOv4.
However, it is noteworthy that the overpass AP of YOLO-DSD is higher than that of YOLOv4 in RSOD, whereas it is the opposite in DIOR.One possible reason for this is that 'bridge' and 'overpass' possess a significant inter-class similarity and thus interfere with the detection performance of YOLO-DSD in these two categories in DIOR.Therefore, how to overcome the inter-class similarity between 'bridge' and 'overpass' for a better detection accuracy while keeping its deployability and inference speed is one of our future works.

Experiment Results and Discussion in RSOD Dataset
In order to further exhibit the superiority of the proposed YOLO-DSD based on YOLOv4 in optical remote sensing object detection application, another comparison experiment between YOLO-DSD and YOLOv4 was conducted in a four-category dataset, RSOD [50], which contained aircraft, oil tank, playground and overpass.The experiment result is shown in Table 10.It can be seen that YOLO-DSD outperforms in accuracy and inference time compared with YOLOv4 under different input sizes, including 416 × 416, 512 × 512 and 608 × 608.Specifically, YOLO-DSD increases mAP 0.5 , mAP 0.5:0.95 and FPS by 2.6%, 0.8% and 50.2%, respectively, under the input size 416 × 416, while, under the input size 512 × 512, mAP 0.5 , mAP 0.5:0.95 and FPS are improved by 2.1%, 1.9% and 54.9%, respectively.In terms of the input size 608 × 608, the mAP 0.5 , mAP 0.5:0.95 and FPS of YOLO-DSD are 1.5%, 1.2% and 59.3% higher than those of YOLOv4.
However, it is noteworthy that the overpass AP of YOLO-DSD is higher than that of YOLOv4 in RSOD, whereas it is the opposite in DIOR.One possible reason for this is that 'bridge' and 'overpass' possess a significant inter-class similarity and thus interfere with the detection performance of YOLO-DSD in these two categories in DIOR.Therefore, how to overcome the inter-class similarity between 'bridge' and 'overpass' for a better detection accuracy while keeping its deployability and inference speed is one of our future works.Note: * k, c, and s mean kernel size, output channels and stride of the convolution layer, respectively.

Figure 1 .
Figure 1.Three main differences between RSIs and NSIs.The first and second lines show instances from NSIs and RSIs, respectively.

Figure 1 .
Figure 1.Three main differences between RSIs and NSIs.The first and second lines show instances from NSIs and RSIs, respectively.

Figure 4 .
Figure 4.The structure comparison between CSP Block (a) and DenseRes Block (b).
Appl.Sci.2022, 12, x FOR PEER REVIEW 10 of 25 sion by combining the local and global relationship between channels with a slight increase in computations cost and inference time.The structure of the DCA Block is shown in Figure 6.

Figure 6 .
Figure 6.The structure of the proposed DCA Block.The DCA Block is composed of a 'Local Extraction Path' and 'Global Extraction Path' in parallel.The 'Global Extraction Path' is used to learn the global relationship between channels, whereas the 'Local Extraction Path' is employed to extract the local channel relationship.Firstly, global average pooling was employed to obtain the integrated information in the space dimension of each channel  1 1×1×C and  G1 1×1×C , where  1 1×1×C and  G1 1×1×C indicate the input of the 'Local Extraction Path' and 'Global Extraction Path', respectively, and  1 1×1×C =  G1 1×1×C .Secondly,  2 1×1×C in the 'Local Extraction Path' could be computed by

Figure 6 .
Figure 6.The structure of the proposed DCA Block.The DCA Block is composed of a 'Local Extraction Path' and 'Global Extraction Path' in parallel.The 'Global Extraction Path' is used to learn the global relationship between channels, whereas the 'Local Extraction Path' is employed to extract the local channel relationship.Firstly, global average pooling was employed to obtain the integrated information in the space dimension of each channel y 1×1×C L1

Figure 7 .
Figure 7.The structure of the improved PANet.

_
and P 3 ∈ ℝ × × _ , were generated.Then, as shown in Figure8, P 1 , P 2 and P 3 were mapped back to the original image and the image was divided into 52 × 52, 26 × 26 and 13 × 13 sizes of grids.Each grid corresponding to a feature map contains the information of three anchors.In each anchor, (x, y) and (w, h) are the offset coefficient and size coefficient, respectively, Cf is the confidence of the grid containing the object and C 1 , C 2 , C 3 , …, C n are the confidence of each object class, respectively.

Figure 7 .
Figure 7.The structure of the improved PANet.
3 were mapped back to the original image and the image was divided into 52 × 52, 26 × 26 and 13 × 13 sizes of grids.Each grid corresponding to a feature map contains the information of three anchors.In each anchor, (x, y) and (w, h) are the offset coefficient and size coefficient, respectively, Cf is the confidence of the grid containing the object and C 1 , C 2 , C 3 , . . ., C n are the confidence of each object class, respectively.convolution layer, and three feature maps, P 1 ∈ ℝ × × _ , P 2 ∈ ℝ × × _ and P 3 ∈ ℝ × × _ , were generated.Then, as shown in Figure8, P 1 , P 2 and P 3 were mapped back to the original image and the image was divided into 52 × 52, 26 × 26 and 13 × 13 sizes of grids.Each grid corresponding to a feature map contains the information of three anchors.In each anchor, (x, y) and (w, h) are the offset coefficient and size coefficient, respectively, Cf is the confidence of the grid containing the object and C 1 , C 2 , C 3 , …, C n are the confidence of each object class, respectively.

Figure 8 .
Figure 8.The image is divided into 52 × 52, 26 × 26 and 13 × 13 by P 1 , P 2 and P 3 , respectively.Figure 8.The image is divided into 52 × 52, 26 × 26 and 13 × 13 by P 1 , P 2 and P 3 , respectively.Then, each grid generated three bounding boxes according to the information combined with anchors, and the process of converting the anchor to the bounding box is illustrated in Figure9.(C x , C y ) are the upper left corner position of the current grid and the center of each grid anchor.(σ(x), σ(y)) is the offset of the bounding box relative to the anchor.The width b w and height b h of the bounding box were obtained through multiplying the width p w and height p h of the anchor by scaling factors e w and e h , respectively.Finally, the detection results were obtained after redundant bounding boxes were removed through NMS.
Appl.Sci.2022, 12, x FOR PEER REVIEW 12 of 25 Then, each grid generated three bounding boxes according to the information combined with anchors, and the process of converting the anchor to the bounding box is illustrated in Figure 9. (C x , C y ) are the upper left corner position of the current grid and the center of each grid anchor.(σ(x), σ(y)) is the offset of the bounding box relative to the anchor.The width b w and height b h of the bounding box were obtained through multiplying the width p w and height p h of the anchor by scaling factors e w and e h , respectively.Finally, the detection results were obtained after redundant bounding boxes were removed through NMS.

Figure 9 .
Figure 9.The process of converting anchor to bounding box.

Figure 9 .
Figure 9.The process of converting anchor to bounding box.

4. 1 .
Datasets 4.1.1.DIOR Dataset DIOR [2] is a large ORSIs dataset that was established in 2020 to develop and validate data-driven methods.It contains 23,463 images and 192,472 objects in total, covering 20 categories in optical remote sensing field.Images in this benchmark dataset have been clipped into 800 × 800 pixels.There are vast scale variations across objects in DIOR because it contains images with spatial resolutions ranging from 0.5 m to 30 m.According to the definition of COCO

Figure 10 .
Figure 10.Each category and the size distributions of objects in DIOR.4.1.2.RSOD Dataset RSOD [50] contains 976 images that have been clipped into approximately 1000 × 1000 pixels, and the spatial resolution of these images ranges from 0.3 m to 3 m.There are

Figure 10 .
Figure 10.Each category and the size distributions of objects in DIOR.
Appl.Sci.2022, 12, x FOR PEER REVIEW 14 of 25 6950 object instances in this dataset in total, covered by 4 common classes in ORSIs, including 4993 aircraft, 1586 oil tanks, 180 overpasses and 191 playgrounds.Each instance of classes is shown in Figure 11.

Figure 11 .
Figure 11.Each category of objects in RSOD.

Figure 11 .
Figure 11.Each category of objects in RSOD.

25 Figure 12 .
Figure 12.The relationship between learning rate and epoch.

4. 4 .
Experiment Results and Discussion in DIOR Dataset 4.4.1.Ablation Experiment Ablation experiments were conducted to verify the effectiveness of each improved module in YOLO-DSD, and the results are shown in Table 2.The detector improved with the DenseRes Block reduces Params by 23.9% (

Figure 12 .
Figure 12.The relationship between learning rate and epoch.

4. 4 .
Experiment Results and Discussion in DIOR Dataset 4.4.1.Ablation Experiment Ablation experiments were conducted to verify the effectiveness of each improved module in YOLO-DSD, and the results are shown in Table 2.The detector improved with the DenseRes Block reduces Params by 23.9% ( 48.81−64.

Figure 13 .
Figure 13.The structure of DenseRes Block in each detector in Table3.

Figure 13 .
Figure 13.The structure of DenseRes Block in each detector in Table3.

Figure 14 .
Figure 14.The structure of DCA Block in each detector in Table5.

Figure 14 .
Figure 14.The structure of DCA Block in each detector in Table5.

Figure 15 .
Figure 15.The detection result of small-sized instance.

Figure 16 .
Figure 16.The detection result of instance in complex suburban background.

Figure 17 .
Figure 17.The detection result of instance in complex urban background.

Figure 15 .
Figure 15.The detection result of small-sized instance.

Figure 15 .
Figure 15.The detection result of small-sized instance.

Figure 16 .
Figure 16.The detection result of instance in complex suburban background.

Figure 17 .
Figure 17.The detection result of instance in complex urban background.

Figure 16 .
Figure 16.The detection result of instance in complex suburban background.

Figure 15 .
Figure 15.The detection result of small-sized instance.

Figure 16 .
Figure 16.The detection result of instance in complex suburban background.

Figure 17 .
Figure 17.The detection result of instance in complex urban background.

Figure 17 .
Figure 17.The detection result of instance in complex urban background.

Figure 18 .
Figure 18.The precision-recall curves and AP (IOU = 0.5) of YOLOv4 and YOLO-DSD in each category.

Figure 18 .
Figure 18.The precision-recall curves and AP (IOU = 0.5) of YOLOv4 and YOLO-DSD in each category.
] is a large ORSIs dataset that was established in 2020 to develop and validate data-driven methods.It contains 23,463 images and 192,472 objects in total, covering 20 categories in optical remote sensing field.Images in this benchmark dataset have been clipped into 800 × 800 pixels.There are vast scale variations across objects in DIOR because it contains images with spatial resolutions ranging from 0.5 m to 30 m.According to the definition of COCO

Table 1 .
The evaluation indicators.

Table 1 .
The evaluation indicators.

Table 2 .
The ablation results of YOLO-DSD.

Table 2 .
The ablation results of YOLO-DSD.

Table 3 .
The ablation results of DenseRes Block.

Table 3 .
The ablation results of DenseRes Block.

Table 7 .
Results of comparative experiment for different necks.

Table 8 .
Results of comparative experiment for different attention mechanisms.

Table 9 .
Results of comparative experiment for different detectors.

Table 10 .
Results of comparative experiment for YOLOv4 and YOLO-DSD in RSOD.

Table A1 .
The architecture and complexity of DarkNet-DenseRes.