Fast and Accurate Object Detection in Remote Sensing Images Based on Lightweight Deep Neural Network

Deep learning-based object detection in remote sensing images is an important yet challenging task due to a series of difficulties, such as complex geometry scene, dense target quantity, and large variant in object distributions and scales. Moreover, algorithm designers also have to make a trade-off between model’s complexity and accuracy to meet the real-world deployment requirements. To deal with these challenges, we proposed a lightweight YOLO-like object detector with the ability to detect objects in remote sensing images with high speed and high accuracy. The detector is constructed with efficient channel attention layers to improve the channel information sensitivity. Differential evolution was also developed to automatically find the optimal anchor configurations to address issue of large variant in object scales. Comprehensive experiment results show that the proposed network outperforms state-of-the-art lightweight models by 5.13% and 3.58% in accuracy on the RSOD and DIOR dataset, respectively. The deployed model on an NVIDIA Jetson Xavier NX embedded board can achieve a detection speed of 58 FPS with less than 10W power consumption, which makes the proposed detector very suitable for low-cost low-power remote sensing application scenarios.


Introduction
With the rapid development of satellite and imaging technology, optical remote sensing images with high spatial resolution are obtained more conveniently than ever before [1]. Studies on analyzing and understanding remote sensing images have drawn wide attention. Image classification, segmentation, object detection, and tracking have become the hot topics in the field of remote sensing [2][3][4]. Among them, object detection has presented a broader application prospect in real-world applications and is sought after by researchers in recent years [5].
For object detection tasks, deep neural network-based schemes have shown superior performance over traditional approaches [6,7]. In general, these schemes can be divided into two major categories: (1) one-stage neural network which adopts a fully convolutional architecture that outputs a fixed number of predictions on the grid, such as SSD [8], YOLO [9], and M2Det [10], and (2) two-stage network that leverages a proposal network to find regions of interest that have a high probability to contain an object and a second network to get the classification score and spatial offsets, such as FPN [11] and Faster R-CNN [12]. These detectors have been successfully utilized in many applications, such as robotics, autonomous vehicles, and surveillance systems.
However, a direct utilization of generic detectors in remote sensing images usually does not deliver satisfactory results. The major reason is that there are many distinct features different from natural images in remote sensing images, such as very complex geometric background, dense object distributions, and variety of objects with large variant in shapes and scales [13,14]. To address these new design challenges faced in object detection in remote sensing images, many studies have been proposed in the literature [15][16][17][18][19][20]. For instance, Wang et al. [15] designed a new pyramid structure to optimize the Faster-RCNN detector by adding a feature-reflowing pathway from the lower level for each scale to enrich the feature expression. Huang et al. [16] proposed a cross-scale fusion module based on the M2Det detector to extract sufficient comprehensive semantic information from features for performing multi-scale fusion. Zhao et al. [17] improved the SSD detector by adding a channel attention module to strengthen the long-term semantic dependence between objects to improve the discriminative ability of the deep features.
In this paper, we focus on the issue of the large variant in object distributions and scales in remote sensing images. Traditional anchor-based detectors [21,22] match regions of possible objects by a set of pre-allocated anchors with pre-defined aspect ratios; therefore, the final accuracy of the trained neural network model highly relies on the anchor configuration. However, in remote sensing images, the scale and distribution of the target objects vary in a very wide range. For instance, Figure 1 compares the sizes of the four classes of objects in the RSOD [23] dataset. Due to the fact that remote sensing images are often acquired by sensors with the same hardware settings, and because the shooting angle is also fixed [24], the sizes of the target object in remote sensing images are directly related to the real-world object scale. As a result, there is a large gap in the size of the ground truth bounding boxes among all objects within one training dataset. In Figure 1, the aircraft and oiltanks have much smaller scales but larger quantities in the image when compared to the overpass and playground. The current anchor selection scheme, such as K-means clustering [25][26][27], tends to allocate more anchors for objects with larger quantities. This makes generic detector perform very badly on objects that does not have sufficient corresponding samples during anchor configuration. To demonstrate this phenomenon, we compare some representative detection results of using the YOLOv4-Tiny network on the RSOD dataset in Figure 1. For the object classes of overpass and playground, the detection results are very inaccurate. Beside the above problems, this study also deals with the efficiency of the deep neural network which consists most parts of the detector. In real-world applications, remote sensing object detection tasks are commonly used in rescue, military, and other scenarios [28,29]. This requires the detectors to be as lightweight as possible and be able to efficiently deploy on low-cost low-power embedded devices. Regardless of the type of detection framework used, optimizations of the detailed algorithm, such as the backbone neural network, multi-scale feature fusion, and adaptive anchor setting, are also important in trading off between the detection accuracy and speed to meet the requirements of the target application.
The contributions of this paper are the following. (1) We propose a lightweight backbone deep neural network design, which can achieve the optimal balance between model size and detection accuracy for fast processing on low-cost low-power embedded hardware platforms. (2) We propose an automatic anchor configuration scheme based on differential evolution (DE), which can minimize the average distance between ground truth bounding boxes and selected anchors, and improve the accuracy of object matching. (3) Comprehensive experiments on multiple datasets are conducted, and the results show that the proposed lightweight detector outperforms state-of-the-art detection approaches by 5.13% and 3.58% on the RSOD and DIOR datasets, respectively. We have also deployed the proposed detector on a embedded hardware platform, i.e., an NVIDIA Jetson Xavier NX board [30], achieving a real-time detection speed of 58 FPS with less than 10 W power consumption.
The reminder of this paper is organized as follows. The related work on existing studies of deep neural network design for remote sensing object detection tasks is summarized in Section 2. The proposed neural network design and automatic anchor configuration scheme is introduced in Section 3. The experimental setup is shown in Section 4. Section 5 describes the experimental results in detail, whereas Section 6 shows the performance of the deployment on embedded platform. Finally, Section 7 concludes this paper.

Related Work
The identification of objects in remote sensing images is a subset of a wider research field in object detection. Taking into account the needs of neural network deployment, the existing studies can be categorized into three main groups: lightweight object detection network, visual attention mechanism, and optimized anchor configuration.

Lightweight Object Detection Network
State-of-the-art approaches can be divided into region-based and single-shot detectors. The former ones, such as R-CNN [31], Fast R-CNN [32], and Faster R-CNN, use a set of Regions-of-Interest (ROIs) to extract sub-promotions that may contain objects, then finegrained detection and classification modules analyze each ROI. Although region-based schemes have high detection accuracy, the complex network structures often result in high computation workload and low processing speed. Single-shot detectors, such as SSD, YOLO, and FCOS [33], tackle the input image in a single pipeline. As a benefit of the end-to-end structure, these detectors can achieve real-time detection speed, which is very suitable as a lightweight object detection network structure. Designers can further balance the accuracy and model size according to the requirements of their own target application.
As the most typical end-to-end network structure, YOLO [9,27,34,35] is favored by many application scenarios [36][37][38]. YOLO-Tiny is the lightweight version of the YOLO detector. YOLO detector is generally composed of three basic parts: backbone, neck, and head [35]. The backbone neural network, as an important part of the network to extract deep image information, generally performs as the image classification network, such as Darknet53, ResNet [39], Vgg [40], and Mobilenet [41]. The key information of object is extracted in the neck, and used to predict the position and category in the head. Although YOLO-Tiny meets the actual deployment requirements, it could not meet the accuracy requirements on remote sensing images due to the following drawbacks:

•
The shallow backbone and prediction network structures are not sufficient to extract deep semantic information, which limits the performance of the network in complex scenarios, for instance, very small objects or complicated backgrounds [42]. In addition, the simple organization of prediction layers cannot effectively cover objects of various proportions, especially when remote sensing images with dense object distributions were considered. • The performance of the detector is especially sensitive to anchor configurations, which not only affects the speed of training, but also the robustness of the network. As been explained in previous section, the small amount of anchors with fixed scales in YOLO-Tiny will deliver poor detection results due to the large variation in object scales.

Visual Attention Mechanism
Aiming at highlighting the salient feature of images, the attention module has been widely used in various types of detectors. It has became the mainstream approach of improving network accuracy. For instance, Wang et al. [43] proposed the residual attention network, which improved the expressive ability of the network. A channel-level attention mechanism through feature recalibrating was proposed by SENet [44], which could improve neural network's performance in classification, detection, and other tasks to a new extent. CBAM [45], which combined channel attention and spatial attention, was also proposed to further improve the model's accuracy. The CA module [46] that integrated the channel and the spatial direction not only reduced the amount of parameters, but also had stronger capabilities. In general, a typical attention module can be divided into two parts: channel attention module and spatial attention module, which have their own advantages and disadvantages for different applications, and the specific effects were often evaluated based on experiments on the target dataset.
Although the attention module can deliver increased detection accuracy, the improvements often come from objects of large scales, which means that, due to the special calculation utilized, some unobvious features are ignored, and the attention model usually focuses on feature-rich large objects, while small objects [46] are often ignored. For remote sensing images with a large number of small objects, the effectiveness of attention modules still needs to be further investigated.

Optimized Anchors
Anchor is used by the YOLO framework in the detection pipeline, which is extremely important in remote sensing image-based object detection tasks. In previous studies, there are two commonly used methods to allocate the anchors for the detection network: 1.
Manual configuration. The anchors selected by the manual correction method are more straightforward and robust. However, it requires the designer to have rich experience in the application field and perform comprehensive manual experiments before determining the best setting.

2.
Automatic configuration based on optimizations. According to distribution of the data set, this type of scheme can automatically find the best anchor position, which greatly relieves the effort of searching for the optimal configurations and also delivers higher accuracy and faster training speed.
K-means clustering is widely used as the optimization scheme for automatic anchor selection in YOLO. Given a specific number of anchors, the best anchor positions are selected by clustering the bounding boxes in the datasets. The IOU score is further used as the relative distance for anchor configurations in YOLOv4 [35]. Junos et al. [47] used a similar scheme to optimize the anchor in YOLOv3 and applied the network to the crop harvesting system. Zlocha et al. [48] optimized the anchor configuration based on a differential evolution search algorithm in RetinaNet. However, in previous studies, all objects were unified into one big category, which was not reasonable for datasets with unevenly distributed object scales. As we have pointed out in the introduction section, traditional scheme tends to shift the anchor selection toward the category with the maximum number of ground truth samples, resulting in poor detection performance for other objects.

Methodology
In this paper, we aim to design a lightweight deep neural network to accurately detect objects of large variant in scale and quantity in remote sensing images. The proposed detector framework is illustrated in Figure 2. In the proposed detector, we have designed a YOLOv4-like backbone network followed by three prediction layers to capture and combine rich contextual features, while, at the same time, minimizing the network's computational cost. Moreover, efficient channel attention is also developed to form a downsampling module, namely, the Cross Stage Partial connections with Attention (CSPA), to efficiently extract high-level feature information for object classification and localization. Finally, an automatic optimal anchor selection approach based on differential evolution (DE) is proposed to address the problem of biased anchor allocation due to large variant of object scale and quantity in remote sensing images.

Lightweight Neural Network
In this section, the details of the proposed neural network architecture are presented. Our network design emphasizes the lightweight feature and computational efficiency at the minimum cost of detection accuracy. The overall neural network architecture is illustrated in Figure 2. The proposed network is organized in two major components: a downsampling backbone and a prediction pipeline. The input images are first resized to an appropriate size (for instance, 416 × 416 pixels), and then the downsampling backbone network, which has a YOLOv4-like architecture consisting of two convolutional layers and three CSPA modules, extract the salient features form the input image at different layers. To enhance the network's ability of capturing important features within one channel and fusing information among different input channels, we propose to add an efficient channel attention layer after the third convolution layer in each of the CSPA module.
In the proposed network, a prediction pipeline consisting of feature pyramids and three-scale prediction layers is designed to transform features from multiple branches into prediction results. The feature pyramid can effectively fuse meaningful semantic information obtained from salient feature maps of low resolution and finer-grained information extracted from the earlier branches. The three-scale prediction networks are used to predict encoding parameters of bounding box and class predictions. The final inference results are obtained after filtering the prediction box via non-maximal suppression (NMS).
In the original YOLOv4-Tiny structure, only two layers of networks were utilized in the prediction part to alleviate the computational complexity of the excessively large network in YOLOv4. However, this shallow network structure is extremely unfriendly to remote sensing object detection task. For instance, given an input image of the resolution of 416 × 416 pixels, the dimensions of the output feature maps of the second prediction layer in the original YOLOv4-Tiny structure are 26 × 26. According to the principle of the YOLO algorithm, the input image is divided into many grid cells of equal dimension, each of which will detect objects that appear within it. In the case of using two prediction layers, the image will be divided into 26 × 26 grid cells, which means that each cell is of the size 16 × 16. Generally, when the scales of the target object are too small relative to the size of the grid cells, the object could not be precisely located by the YOLO algorithm. In remote sensing images, there exists a large amount of small objects which cannot be precisely located by grid cells larger than 16 × 16. Therefore, in this work, we propose to add one extra prediction layer with an output feature map dimensions of 56 × 56, which corresponds to grid cells of 8 × 8 dimension on the input image. From Figure 3, it can be seen that, in the proposed network, the corresponding grid cells mapped on the input images well fit the scales of the small objects. Moreover, based on quantitative analysis (please refer to the detailed result on RSOD in Section 5.1), we have found that the proposed three-layer prediction network was optimal and there was no need to introduce extra layers at the expense of larger model size for very little improvement in detection accuracy.

Efficient Channel Attention
Instead of utilizing both channel and spatial level attentions, we propose to use a single channel attention module to enhance the visual attention of the backbone network. It is based on two design considerations: First, channel attention modules normally have a relatively small amount of parameters compared to spatial ones, which can facilitate rapid training convergence and deliver a faster prediction speed during deployment. Second, we have conducted experiments showing that using channel attention modules was sufficient to improve the detection accuracy to the desired level for remote sensing images. In this work, an efficient channel attention module was designed and utilized in the backbone network by following the basic network structure presented in [49] as shown by Figure 4.
The detailed architecture of the Efficient Channel Attention (ECA) module is summarized as follows: Given the input feature map I, I ∈ R C×W×H . In the first step, an average pooling operation is designed to compress I along the spatial dimension W × H to obtain the channel descriptor (R C×1×1 ). In the following layer, it is converted into another channel descriptor (R 1×C ) through transpose and squeeze operations. Then, the local cross-channel interaction is completed through fast 1D convolution of the kernel size k size , where k size also represents the coverage ratio of interaction. In the following parts, transpose and squeeze operation are utilized again to convert channel descriptor back to the original channel dimensions (R C×1×1 ). Finally, the channel weight is generated through the Sigmoid function, and multiplied by I.
To compare the performance of the classic channel attention module (Squeeze-and-Excitation (SE) [44]) with the proposed ECA module in terms of feature extraction capability and operation efficiency, we have conducted experiment to quantitatively measure the increment in model size and improvement in detection accuracy for both schemes when adopted in the proposed backbone network. The results are reported in Table 1. It can be observed that the SE module outperforms the baseline model by 3.10%, but with an increment of 2688 parameters, while ECA module outperforms SE module by 0.22% but contributes to 74% less parameters. It can be concluded that the ECA module can capture sufficient cross-channel interaction in an efficient way to improve the detection performance with minimal cost in model size and computational complexity.

Optimal Anchor Configuration Based on Differential Evolution
The anchor configuration, a hyperparameter for the training of the network model, affects the performance of the model in a great degree. Optimal anchor configuration can improve the network accuracy without additional consumption. However, when facing the large variation of object scales in remote sensing images, the anchor configuration scheme used in YOLO, i.e., K-means clustering, will lead to biased anchor allocating setting. The improvement in accuracy tends to be concentrated in the categories with a larger number of objects. One major reason is that the key evaluation metric widely used by previous studies is the mAP score, which is embodied as an average AP score of all categories. Excessively increasing the AP score of a certain category can increase the overall accuracy of the model to a certain extent. However, further improvement is not possible due to the limited detection ability for the other categories with very low AP scores, which means that the trained model is biased. In general, a more reasonable approach is to consider both the distribution of the quantity and size of the objects for all categories, and develop a scheme that can improve the overall performance of the neural network and balance the accuracy of all target categories.
To better capture the relationship between object scale and quantity, we propose an improved anchor configuration scheme based on differential evolution (DE). This method takes the height and width of the anchors as variables, and the sum of the nearest distances from the ground truth bounding boxes to the anchors as a fitness function. In addition, a weight value is also added to the distance calculation to avoid possible biased training of the neural network. Finally, the minimum value of the fitness function is solved by using DE [50], reaching the goal of minimizing the distance.
More specifically, for a given dataset, the distance from one ground truth bounding box to one anchor can be formulated as follows: dis(truth, anchor) = 1 − IOU(truth, anchor) where IOU(truth, anchor) represents the intersection over union of the ground truth and the anchor box, centered at the origin. The corresponding calculation formula is where S overlap refers to the overlap area between the ground truth and the anchor box. S union refers to the union area between them.
By denoting x ij as the j-th ground truth of the i-th category, and θ k as k-th anchor box, the distance between x ij and the anchor boxes can be expressed in the following form: which means that we choose the anchor with the smallest distance as its best match. Therefore, the distance between all samples (X) and anchors (θ) can be calculated by where m 1 , m 2 , . . . , m n represent the number of ground truth samples in the n-th category.
In Equation (4), the values of the weights for different categories in the fitness function are set to be inversely proportional to the number of objects, which helps to eliminate the attraction effect of large number of objects for more anchors.
Here, we use single-objective DE [51] to solve this optimization problem of Equation (4). In Equation (4), the real values in x ij and θ k are all scaled to the range of [0,1]. To prevent conflicting with the aforementioned variables, the decision variables in DE are replaced by P i , i.e., P i = θ ∈ (k, 2) used as the fitness function of DE. Then, DE searches for the best anchors of the smallest function G through constant iterations. The pseudocode of the proposed optimal anchor selection scheme is described by Algorithm 1, where C r denotes crossover rate, F s represents the scaling factor, N p is the population size, and t is the iteration number.

Algorithm 1 Anchor configurations algorithm based on DE
Input: input parameters C r , F s , N p Output: output argmin P t i G(P t i ) and P t i 1: Initialize population P = (P t 1 , P t 2 , . . . , P t N p ) 2: Counter t ← 0 3: while stop condition not met do 4: for i ∈ (1, 2, . . . , N p ) do 5: ν i ←differential mutation (F s ; i, P) 6: end fort ← t + 1 13: end while 14: return argmin P t i G(P t i ) and P t i In the first step, the proposed algorithm performs an initialization operation on the population P consisting of decision variables P t i , where P t i represents the i-th individual in the t-th iteration. The initialized population is distributed in a certain defined area according to the population size N p , and each individual represents a candidate solution. Generally, the initial population should cover the whole search space. As the population size N p increases, the probability of obtaining the global optimal solution also increases. In this work, each individual is defined as a specific anchor configuration.
In the following iteration, the procedures of differential mutation, crossover, and selection are repeated performed. The goal of differential mutation is to create groups of new individuals which have a certain level of probability of being the optimal solution. The difference introduced by mutant between the parent and children is quantitatively controlled by a scaling factor F s . Later, elements θ j , (j = 1, 2, · · · , k) in individuals are randomly swapped by the crossover operation for the current iteration and its differential mutated group. This procedure promotes population diversity, and the crossover probability is controlled by C r . Then, newly generated and contemporary individuals are compared and better individuals are selected for transmission to the next iteration. The iteration continues until the best individual and objective function value are found. Through the above algorithm, the minimum value of fitness function can be obtained, which corresponds to the best anchor value.

Hardware Platforms
The proposed lightweight model was expected to run on low-power embedded devices. However, to demonstrate the advantage of the proposed optimization schemes and the performance of the proposed detector more comprehensively, experiments were conducted on two different hardware platforms, including an NVIDIA GeForce RTX2080Ti desktop GPU and an NVIDIA Jeteson Xavier embedded board. The hardware specifications of the two platforms are summarized in Table 2. To deploy on the embedded board, all tested neural network model were quantized into FP16 data format to relieve the pressure on external memory bandwidth.

Datasets and Training Parameters
In this work, experiments on two public remote sensing datasets were conducted to verify the effectiveness of the proposed detector and further evaluate its accuracy and speed. The RSOD remote sensing dataset was selected to measure the key performance metrics of the detector. RSOD includes 4993 aircraft in 446 images, 1586 oiltanks in 165 images, 191 playgrounds in 189 images, and 180 overpasses in 176 images. The dataset was randomly divided into the training and test set according to a 7.5:2.5 ratio.
Considering the relatively limited number of samples in the RSOD, in order to further verify the performance of our method, we have also chosen DIOR to evaluate and test the performance of the proposed model. DIOR [52] contains a wider range of 20 object categories, a total number of 23,463 images, in which 192,472 examples were labeled, and 11,725 images were used for training and 11,738 images were used for testing.
The training parameters of the final network model was setting as follows: The initial weight was pre-trained on the COCO dataset; Adam was utilized as the optimizer, while the initial learning rate was 1 × 10 −4 and the maximum training epoch was set to 100.

Evaluation Metrics
We used a total number of four metrics to evaluate the performance of the proposed method: (1) mAP; (2) FLOPs (floating-point operations); (3) number of parameters; (4) FPS (frames per second). The mean average precision (mAP), which is widely used by previous studies, is still used as the key accuracy score in this paper to compare with the state-of-the-art. FLOPs and parameters are used to evaluate the computational complexity and memory footprint of the neural network model, respectively. FPS is used to evaluate the processing speed of the model on target hardware devices.

Improvements by Network Structure
In this paper, we have proposed two optimizations on the lightweight neural network: multi-scale prediction layers and attention modules. The performance gains of these two schemes were evaluated independently and the results are shown in the following sections.
As discussed in Section 3.1, a deeper prediction network corresponds to a finer grid on the input images, which can locate small objects more accurately. However, detection speed will often be sacrificed due to increased model size and computational complexity. To balance between processing speed and accuracy, we have conducted a series of experiments to explore the optimal structure of the prediction network. The obtained results are compared in Table 3. Compared to only using two prediction layers, adopting a three layer prediction network can improve the detection accuracy in terms of mAP score by 2%, while the increment in FLOPs is~48%. In particular, the detection accuracy of overpass and aircraft, which often have smaller scales in the images, are improved by 2.27% and 7.74%, respectively. Although adding a fourth layer to the prediction network can further raise the average accuracy by around 1% (the improvement is mainly on the small object of aircraft), the computational complexity doubled compared to using three layers and is 3× than that of using only two layers. In addition, the extra prediction layer also creates a large amount of unnecessary bounding boxes, which will also reduce the executing speed of the post-processing procedures, such as NMS. Therefore, it is concluded, based on the comparative experiment results, that the three layer prediction network architecture was the most cost-effective design. For visual attention enhancement, the proposed ECA design was compared with several state-of-the-art attention modules, including SE, CBAM, and CA, as listed in Table 4. The parameter count only includes the parameters introduced by the attention module, while the other metrics correspond to the whole neural network. The proposed ECA module delivers a 3.32% boost in mAP with almost no loss in processing speed. Although the CBAM and CA modules have both channel and spatial attention, they failed to improve the detection results. The main reason behind this was that these two attention modules that were added at a position close to the input have introduced a larger amount of parameters (4×than that o f the proposed scheme) to the original model, causing the training process very sensitive to the initialization state of the backbone model and the original loss function failed to generate sufficient backpropagation information to update the new parameters. The SE module has a very close inference performance with the proposed ECA module, but the processing speed is 2% slower on desktop GPU. This cost in speed will be further enlarged when deployed on embedded platforms. The proposed ECA module also has a hyperparameter k size , i.e., the filter size of the 1D convolution. Table 5 shows the experimental results with different values of k size , including setting fixed value and adaptive ones in all convolution layers. The ECA module achieves the best performance when k size = 3 and k size = 7. Furthermore, note that the adaptive approach does not outperform fixed ones. We conjecture the main reason is that one layer of convolution with small filter size is sufficient to capture enough spatial feature information within one channel, and larger perception filter and more layers are redundant [49].

Improvements by Anchor Configuration
Population size (N p ) and maximum iteration number are two key initialization parameters for the DE solver utilized in the proposed anchor configuration algorithm. To determine the most reasonable initialization parameters, the following experiments were carried out.
Firstly, we set P i = θ ∈ (6, 2) for the DE optimizer, in which only six anchors were configured. Therefore, other parameters were set as follows: C r = 0.7, F s = 0.5, maximum iteration = 500. The measured average value of the fitness function under different population sizes are listed in Figure 5 and the detailed convergence time are reported in Table 6.  When N p = 300, the objective function reaches the lowest value among all different configurations, so N p = 300 was selected as the optimal parameter setting. In addition, it was also found that 500 iterations did not make the DE solver fully converge, so the maximum iteration number was increased to 1000. The final performance of DE is shown in Figure 6. As the iteration number increases, the minimum and average values of the fitness function gradually coincide. The best anchor settings obtained for the RSOD dataset are summarized in Table 7. The obtained anchors, except for the fourth one, tends to have a larger scale in the height dimension. This phenomenon reveals that our scheme has optimized the anchor settings to best match the objects at varies scales. To more clearly observe the advantage of our proposed algorithm, we have also visualized the distribution of the ground truth bounding boxes and the obtained anchor boxes by using a scatter plot illustrated in Figure 7, in which the results obtained by using K-means clustering are compared. The figures show that the anchors obtained by the proposed algorithm are more evenly distributed among the entire data set, i.e., anchor allocation fully takes into account the distribution of samples in each category. For instance, in the dimension scale of 0.2 to 0.5, the K-means clustering scheme only allocated a single anchor to capture all the objects with large variant in scales, which will inevitably cause degradation in detection accuracy. In contrast, the proposed scheme has allocated four anchors in this range, each of which also corresponds to the clustering centers of a specific category of object. Therefore, it can be concluded that the proposed algorithm can locate the target objects more accurately than the original YOLOv4-Tiny network as the example shows in Figure 8. Note that, in this experiment, both networks have the same number of prediction layers.    Besides the accuracy improvements, the proposed anchor configuration scheme also delivers a faster training speed over traditional approaches. The loss functions obtained by adopting the aforementioned two anchor selection algorithms in training of the neural network models are compared in Figure 9. From the curves, we can see that the training process which adopted the proposed anchor selection scheme converges more quickly, and the final loss drops by about 50% relative to the K-means clustering scheme in the case of using six anchors. Because anchors selected by the K-means clustering scheme are narrowed in a small region, it is difficult for the detector to capture information from those samples outside this region to achieve better matching results. The anchors obtained by the proposed algorithm distribute more evenly in the dataset, which greatly improves the overall learning efficiency of the neural network.  After obtained the optimal anchor setting, the detection accuracy achieved by adding three prediction layers are shown in Table 8. Compared with K-means clustering, the proposed algorithm can improve the detection accuracy by 1.13% in terms of mAP when using 9 anchors.

Comparison with the State-of-the-Art
We first compare the performance of the proposed detector with three generic detectors, including SSD, YOLOv4, and the YOLOv4-Tiny lightweight network on both the RSOD and DIOR datasets, respectively. Table 10 shows the performance of different networks on RSOD. Detection speed was measured on the same desktop GPU. Compared to YOLOv4-Tiny, the proposed scheme can achieve a considerable 5.13% improvement in mAP, while the cost in speed is an~50 FPS decline on desktop GPUs. However, the differences in speed will become negligible when deployed in embedded platforms. It can also be observed that the proposed network can even achieve a slightly higher accuracy over the SSD model, while the processing speed is 4× faster. In addition, the proposed model was also trained and evaluated on the DIOR dataset, and the experimental results were reported in Tables 11 and 12. Compared with YOLOv4-Tiny, our method improved the detection accuracy by 3.58%.  Table 13 compares the proposed scheme with state-of-the-art detectors that have been optimized for remote sensing images. Among all the detectors, CSFF [53] and CF2PN [16] have the highest accuracy but the most complex network structure. For instance, CSFF adopted ResNet-101 as the backbone and an FPN as the prediction network, which greatly improved the accuracy of remote sensing object detection. However, the network model is 8× larger than that of the proposed scheme, resulting in a 15× slower processing speed on desktop GPUs. The extremely large neural network model and high computational workload have prohibited similar schemes like CSFF and CF2PN to be deployed on embedded hardware platforms. In contrast, Simple-CNN [54] and ASSD-lite [55] have used simpler backbone structures and more compact network design. These two methods can then achieve real-time processing speed (around 60 FPS) on desktop GPUs. However, the computational workload of these two detectors is still too large to meet the capacity of our target embedded device (i.e., to achieve a real-time speed of 60 FPS, the network model of the detector should have less than 8 GFLOPs workload). The only lightweight detector that can compete with the proposed scheme in terms of the number of parameters is LO-Det [56]. However, in LO-Det, the authors have designed a very complex FPN network, in which channel shuffle and split operations were repeatedly used in each layer. There operations can improve the network's accuracy but are very unfriendly to parallel processing on GPUs. Therefore, LO-Det only achieved a 4× slower processing speed than our scheme. The proposed detector achieves a considerably higher processing speed of 227.9 FPS than all the reference schemes on the desktop GPU, which reveals that our scheme is not only lightweight in model structure but also very efficient to be executed on GPU devices for parallel processing.

Deployment on Embedded Platform
We have deployed the proposed object detection framework on the NVIDIA Jetson Xavier NX board installed on an UAV machine. The proposed lightweight neural network model was quantized into 16-bit floating-point numbers (FP16) by using the TensorRT toolkit. The final prediction accuracy and speed results are reported in Table 14, in which FP32 refers to the standard 32-bit floating-point precision. By quantizing the network model to a reduced precision, the Jetson NX platform can deliver twice the computational capacity than using the standard FP32 data format (i.e., 500 GFLOPs vs. 250 GFLOPs peak performance). In addition, the detection accuracy of the quantized model was well preserved. Thanks to the lightweight, yet efficient network structure proposed, the system can perform high accuracy real-time object detection tasks at the speed of 58.17 FPS on captured remote sensing images with the power consumption of 8.5 W. Therefore, the achieved computational performance is 294 GFLOPs. When compared to the original YOLOv4-Tiny model, our scheme has a significant advantage of 5.1% improvement in detection accuracy, while the sacrifice in speed is not noticeable for practical usage. Moreover, the proposed detector also presented a 15.6% higher computational efficiency over YOLOv4-Tiny. The main reason is that parallelism of the convolutional units is optimized in the device, especially for 3 × 3 conv. Large amount of computation in our proposed is stemming from the increase in 3 × 3 conv, which improves the utilization of device resources.

Conclusions
In this paper, we have proposed an efficient lightweight object detector for remote sensing images based on deep convolutional neural networks. To achieve the best balance between detection speed and accuracy, we first designed an improved YOLOv4-like backbone network with three prediction layers to alleviate the problem of multi-scale object detection. Further combined with efficient channel attention to obtain important features, the detector can detect small objects with improved accuracy and no significant overhead in computational workload. Then, an optimal anchor configuration scheme was proposed to solve the problem of obtaining biased anchors due to the large variation in object scales in remote sensing images. Finally, evaluation was conducted on both the RSOD and DIOR datasets, respectively, and comparisons with state-of-the-arts show that the proposed lightweight detector has a significant advantage in processing speed while the detection accuracy is maintained at a close level. Furthermore, real-world deployment on the NVIDIA Jetson Xavier NX verified that our scheme was very suitable for low-cost low-power real-time remote sensing object detection tasks.

Conflicts of Interest:
The authors declare no conflict of interest.