Object Detection Based on Global-Local Saliency Constraint in Aerial Images

: Different from object detection in natural image, optical remote sensing object detection is a challenging task, due to the diverse meteorological conditions, complex background, varied orientations, scale variations, etc. In this paper, to address this issue, we propose a novel object detection network (the global-local saliency constraint network, GLS-Net) that can make full use of the global semantic information and achieve more accurate oriented bounding boxes. More precisely, to improve the quality of the region proposals and bounding boxes, we ﬁrst propose a saliency pyramid which combines a saliency algorithm with a feature pyramid network, to reduce the impact of complex background. Based on the saliency pyramid, we then propose a global attention module branch to enhance the semantic connection between the target and the global scenario. A fast feature fusion strategy is also used to combine the local object information based on the saliency pyramid with the global semantic information optimized by the attention mechanism. Finally, we use an angle-sensitive intersection over union (IoU) method to obtain a more accurate ﬁve-parameter representation of the oriented bounding boxes. Experiments with a publicly available object detection dataset for aerial images demonstrate that the proposed GLS-Net achieves a state-of-the-art detection performance.


Introduction
Simultaneous localization and category recognition are the fundamental but challenging tasks of aerial image object detection. With the increasing number of aircraft and satellites, more and more aerial images are now becoming available. Object detection in aerial images has become one of the hot topics in the computer vision field, and is used in a wide range of applications, such as traffic control, airport surveillance, monitoring of oil storage facilities, inshore ship detection, and military target discovery [1][2][3][4]. The difficulties of aerial image object detection are mainly due to the varied weather conditions and the variation of the orientation and scale of the objects. In recent years, deep learning has shown its great potential in computer vision tasks [5][6][7][8][9][10], and significant progress has been made in the field of object detection. To avoid ambiguity, here we define the objects in this article as the categories that are selected by experts in aerial image interpretation, according to whether a kind of objects is common and its value for real-world applications [11].
Although natural image object detection based on deep learning has made progress, some difficulties are still encountered when these methods are directly applied to remote sensing image object detection tasks, due to the special characteristics of aerial images (as shown in Figure 1): 1. Intricate background noise. The complex background information reduces the accuracy of the region proposal, meaning that the extracted features contain more noise. 2. Drastic scale change. Since the flight trajectories and sensors of the different aircraft are not completely consistent, almost every aerial image product has unique resolution and imaging characteristics. In addition, for example, although warships and fishing vessels are both classified as ships, the number of pixels occupied by the different ships varies greatly. 3. Lack of consideration of the semantic information between the scene and objects. In remote sensing images, there are scene-object semantic relationships between, for example, aircraft and airports, cars and roads or parking lots, ships and water, and bridges and water. The R-CNN-based algorithms divide the image into different regions and then extract the features, discarding the scene information, which could be used as a constraint. 4. Arbitrary target orientation and dense arrangement. Because the image acquisition platform may fly over the target from any angle, the target may have an arbitrary orientation under the overhead view. In addition, scenarios such as ships berthing in sequence at a wharf and densely arranged vehicles driving on crowded roads in a city greatly increase the difficulty of the target detection, and miss-detection can easily occur. 5. Lack of aerial image datasets available for neural networks. Compared with the usual natural image datasets, there are relatively limited training datasets available for neural networks.
In this paper, in order to obtain improved object detection results with optical remote sensing images, we put forward an effective network structure, namely, the global-local saliency constraint network (GLS-Net). In GLS-Net we first build a saliency pyramid, which combines a non-deep-learning saliency algorithm with the FPN to reduce the impact of the complex background, without requiring more saliency labels. We then propose a branch to enhance the semantic connection between the target and the global scenario, based on a global attention modulated mechanism called GA-Net. In addition, a fast feature fusion strategy is used to combine the local object information, based on a saliency pyramid, with the global semantic information optimized by an attention mechanism. To obtain more accurate bounding boxes, we use an angle-sensitive intersection over union (IoU) method to calculate the matching degree of the prediction and ground truth, to obtain the five parameters of the final oriented bounding boxes. The main contributions of this paper can be summarized as follows: 1. We propose the use of a saliency pyramid, which makes the target pixels more distinct from the background. 2. We propose a global attention module mechanism, called GA-Net, to constrain the semantic information of the target in the global context, and a fast fusion strategy is used to combine the global information with the objects. 3. During the inference stage, we propose the use of an angle-sensitive IoU algorithm to obtain oriented bounding boxes that are as accurate as possible.
The rest of this paper is organized as follows. Section 2 gives a brief review of the related work on aerial image object detection based on deep learning, object detection based on saliency, and the attention mechanism. In Section 3, we introduce the proposed method in detail. The details of the dataset, the experiments conducted in this study, and the results of the experiments are presented in Section 4. Section 5 concludes this paper with a discussion of the results.

Object Detection of Aerial Images
Over the past decades, research in the field of remote sensing image object detection has made great progress, and many imaginative and effective algorithms have now been proposed to solve various problems in optical remote sensing image detection. According to the feature description method, the object detection algorithms can be mainly divided into two branches: the hand-crafted feature based algorithm and the deep learning based algorithm.
On the one hand, hand-crafted features are widely used in the detection of various objects. For example, Liu et al. [12] proposed a method using a two-cascaded linear model followed by binary linear programming to detect ships, which can reduce the search space by constructing a nearly closed-form rotated bounding box space for the ships; Yang et al. [13] proposed a detection algorithm based on saliency segmentation and the local binary pattern (LBP) descriptor combined with the ship structure; Xu et al. [44] improved the adaptability of the invariant Hough transform with an iterative training method to solve the problem of detecting inshore ships in high-resolution remote sensing imagery; and Cheng et al. [45] proposed a framework composed of linear SVM, called the collection of part detectors (COPD), to complete the task of multiple category object detection in remote sensing images.
On the other hand, deep learning based methods have shown their potential in feature extraction and description for aerial images. For example, Tayara et al. [46] proposed a one-stage framework for object detection in aerial images based on a densely connected convolutional network [10]. Based on YOLOv2 [41], the you only look twice (YOLT) [47] is a two-branch network that can be used to predict the scene and target at the same time, realizing rapid detection of aerial images based on a one-stage framework. However, although the one-stage approach has advantages in the detection speed, the detection accuracy is still a problem. Therefore, most of the object detection methods for aerial images focus on a region-based object detection pipeline, i.e., they are two-stage detectors. For example, Xu et al. [48] and Ren et al. [49] adopted the idea of a deformable convolutional network [50] and proposed two-stage networks that can predict categories and rectangular bounding boxes in aerial images. Instead of using an end-to-end deep learning detector, the method developed by Xiao et al. [51] can be used to detect an airport scene in remote sensing images. This method uses three-scale sliding windows to generate scene candidates, target candidates, and local target candidates. It then extracts the features with a neural network, and finally uses SVM to generate rectangular bounding boxes. Deng et al. [52] also constructed a two-stage detector, which performs data enhancement by dividing the large-scale remote sensing data grid and rotating it, and then fuses the same-scale features from the three levels to generate a new feature map.
The objects in aerial images are mostly small targets. To address this problem, Ren et al. [53] used a densely connected structure to replace the commonly used FPN [37], and modified the default size of the anchors, making the network pay more attention to the many small areas in the feature map. All of the above methods use rectangular bounding boxes as the coordinate representation of the target. In fact, due to the uncertainty of the course of aircraft and space platforms, the target can be oriented in any direction when looking down from an overhead perspective. Many methods have been explored to deal with this issue, with the rotated rectangle bounding box methods being found to show a superior performance. Based on the idea of rotational region CNN (R2CNN) [54], many effective networks have now been proposed. For example, Yang et al. [55] constructed a remote sensing image object detector called SCRDet for predicting rotate rectangular coordinates using an attention mechanism. SCRDet first uses an inception module [56] to fuse the two layers of features from the feature pyramid, and then a spatial attention mechanism and channel attention mechanism [57] are added to this network. Finally, the algorithm combines RoIAlign [58] with the fully connected network to achieve class determination and coordinate prediction. The experiments undertaken by Yang et al. [55] proved that the attention mechanism can reduce the influence of a complex background, to some extent, but SCRDet requires the additional mask labels. Other researches [59][60][61] have focused on using the multi-scale feature map fusion method to solve the problem of the sharp change of the target scale in remote sensing images, and have predicted the rotated rectangle bounding boxes based on two-stage pipelines. All of the above works have attempted to circumvent the problem of the rectangular region proposals generated based on the anchor mechanism in a two-stage network containing both the object and the background, which causes the features extracted based on the R-CNN mechanism to also contain background noise. Using these features increases the difficulty of predicting the rotated bounding box that just surrounds the target, as shown in Figure 2. To solve this problem, Lei et al. [62] and Zhang et al. [63] used an anchor with angles to extract the region proposals. In addition, Liu et al. [64] used affine transformation to transform the target's minimum rotated rectangle area in a region proposal predicted by the normal anchor mechanism (computer screen coordinate). Even more subtly, Ding et al. [65] proposed a framework with a rotated RoI (RRoI) learner and a rotated-position sensitive RoIAlign (RPS-RoI-Align) module to transform a horizontal region of interest (HRoI) into a rotated region of interest (RRoI) and extract rotation-invariant features.

Saliency Detection
The purpose of saliency detection, which is based on the cognitive studies of visual attention, is to obtain the typical objects or regions in the image which are different from most of the background. The early saliency detection methods were mainly based on hand-crafted features. These methods only require prior knowledge in the form of the existing datasets, and they do not require extensive manual labeling of samples. For example, Xie et al. [66] and Qi et al. [67] expressed the saliency based on local features. An alternative approach is to establish the scarcity of image areas relative to local areas by the use of a local contrast method. Correspondingly, some methods emphasize the relationship between the global pixels to predict the saliency map [68][69][70].
Among the different methods, the histogram-based contrast (HC) method developed by Cheng et al. [70] focuses on bottom-up data-driven saliency detection using the histogram contrast of the imagery. Inspired by biological vision, the HC method defines saliency values for the image pixels using the color statistics of the whole input image.
Beyond the hand-crafted feature based methods, deep learning has also shown its great potential in saliency representation. For example, Hou et al. [71] optimized a skip-layer structure by introducing short connections to it within the holistically nested edge detector architecture based on a fully convolutional neural networks (FCNs) [72], and Sun et al. [73] proposed an object detector for aerial images based on saliency detection, namely SBL-RetinaNet, which consists of a codec branch and a saliency prediction branch, and is optimized by focal loss [9]. Although the saliency detection methods based on deep learning have shown great promise, this kind of method needs to introduce new labels (not just rectangular bounding boxes) and cannot be quickly applied to fields without large-scale open-access datasets, such as Gaofen-2 imagery, Gaofen-3 imagery, and unmanned aerial vehicle (UAV)-borne data.

Attention Mechanism
Similar to saliency detection, the attention mechanism originates from the human visual system [57,74,75]. One typical characteristic of the human visual system is that humans tend to pay more attention to some prominent local areas, rather than the whole scene. This attention mechanism improves the efficiency of data processing. In fact, there are still some differences between the attention mechanism and saliency detection. Saliency detection involves obtaining certain regions or features in the image from the characteristics of the image itself, so lower-level feature descriptors such as the HOG are often used. However, the attention mechanism is more flexible, and can consider the spectral, structural, and channel information, plus other aspects [76].
Recently, there have been several attempts to incorporate attention processing to improve the performance of neural networks in large-scale classification and detection tasks. For example, Wang et al. [77] proposed an attention module, namely the residual attention network, which can be incorporated in the state-of-the-art feed-forward network architectures in an end-to-end training fashion. By combining a spatial attention module and a channel attention module through the idea of an encoder and decoder, the mixed attention mechanism not only performs well, but is also robust to noisy inputs. Each attention module is made up of a mask branch and a trunk branch. The mask branch contains fast feed-forward sweep and top-down feedback steps, while the trunk branch can be any one of the state-of-the-art pipelines. Inspired by the classical non-local means method in computer vision, Wang et al. [78] proposed a method called the non-local neural network, which can be plugged into many computer vision architectures. As for the object detection task, the use of a CNN can result in an improvement in accuracy. However, all of the architectures based on CNNs have a crucial but not easily solved problem, in that one of the CNN units only processes the value of the kernel size on the feature map. This mechanism causes each operation to be carried out in a small neighborhood, without considering the influence of pixels in other regions. The idea of the non-local neural network is depicted in Figure 3. In addition, Du et al. [79] designed a new loss mechanism based on principal component analysis (PCA), on the basis of a non-local block. Furthermore, Hu et al. [80] obtained excellent results on ImageNet [81] with a kind of channel attention mechanism, called the squeeze-and-excitation module, and other methods have been improved on this basis [82,83].  [78]. If we suppose that the size of the input feature map is T × H × W × 1024, where means 1024 channels, then "⊗" denotes matrix multiplication and "⊕" denotes element-wise summation. The green boxes denote 1 × 1 × 1 convolutions.

Proposed Framework
The flowchart of the proposed object detection method is shown in Figure 4. The framework is based on the popular Faster R-CNN framework [36] with FPN structure [37]. In order to better adapt to multi-scale object detection tasks, the RPN and Fast R-CNN network obtain the features of {S 2 , S 3 , S 4 , S 5 } at four scales from the saliency pyramid, instead of the one scale in Faster R-CNN [36]. The features {S 2 , S 3 , S 4 , S 5 } from saliency pyramid are fused with the classical FPN layers {P 2 , P 3 , P 4 , P 5 } and saliency map, which is explained in Section 3.1. This saliency pyramid mechanism can make the network concentrate more on the target area, while maintaining the advantage of deep features at a lower computational cost, and it has robustness to diverse scenes, while reducing the influence of complex backgrounds. We adopt ResNet101 [8] as backbone of our framework. Then, to adapt to remote sensing images which contain relationships between objects and scenes, we extract both the object-scene contextual information and the object-object contextual information with GA-Net, and we fuse them with features from RoIAlign by a lightweight structure (Section 3.2). As shown in Figure 4, GA-Net first obtains features of C6 from the last layer of the feature pyramid. Features from the deeper layers contain the relationship between targets, and represent the unique scene distribution information. In order to make full use of this, we only use the channel attention mechanism for the data dimensionality reduction, instead of the spectral attention mechanism, which can reduce the computational difficulty while not losing distribution information.
In addition, we use five parameters ( Figure 5) to represent the predicted coordinates of the rotated rectangular bounding box, which is different from the standard rectangular bounding box. As Figure 1 shows, objects in remote sensing imagery have the characteristics of a small size and a dense distribution. Due to these traits, the commonly used horizontal rectangle coordinates can lead to the omission of objects ( Figure 6). This is because the IoU of two tightly arranged oriented bounding boxes belonging to different targets is large, so we adopt an angle-sensitive IoU calculation method, which is introduced in detail in Section 3.3.  [36] and the feature pyramid network (FPN) [37]; the saliency pyramid, which makes the objects more prominent in the region proposals; the global context network with a channel attention mechanism and a lightweight feature fusion structure; and the angle-sensitive non-maximum suppression in R-CNN.

Saliency Pyramid
A typical feature of remote sensing imagery is the complex background. For example, waves on the water can affect the accuracy of the ship detection task, and complex urban scenes can reduce the detection efficiency for vehicles, etc. In order to reduce the influence of the background, we use a kind of saliency map to construct a saliency pyramid, to reduce the influence of noise. To build a saliency pyramid, we first use a method called the region contrast (RC) method [70] to process the remote sensing images, based on HC algorithm. The RC algorithm first uses an effective segmentation algorithm [84] to initialize the image, and considers the influence of other regions when calculating the saliency value. By using the RC method, the influence of the background noise can be effectively reduced.  Figure 8, and as a general rule, the feature pyramid can be built by Equation (1): where Conv(·) denotes the convolution operation with kernel size [1,1], Pool(·) denotes the pooling operation with stride 2, g(·) denotes the upsampling operation with a factor of 2, and ⊕ is the matrix addition operation.  (2) to implement this process: where Conv(·) denotes the convolution operation with kernel size [1,1], Pool(·) denotes the pooling operation with stride 2, g(·) denotes the upsampling operation with a factor of 2, ⊕ is the matrix addition operation, s is the saliency map generated using the RC algorithm, h(·) denotes the sampling operation, and (·; ·) denotes the concatenation. After these operations, we can obtain a saliency pyramid with five feature maps named

Global Attention Network
Region-based CNNs usually adopt a two-stage strategy to achieve the purpose of object detection. The first step is to extract region proposals with a high recall rate, and the second step is to use a high-accuracy algorithm to classify and predict the bounding box. As a result, each target is predicted using only its own pixel information and a very limited area of the pixel information nearby. Actually, remote sensing images usually capture a large area that carries strong semantic information that characterizes the captured scene. In addition, objects and scenarios are often closely correlated, such as planes being found in airports, ships being associated with water, and vehicles being closely related to roads or parking areas, etc. However, region-based CNNs do not fully exploit this potential relationship. Based on these observations, we propose a global attention network (GA-Net) that learns the global scene semantics with less computation (see Figure 9). More specifically, inspired by the convolutional block attention module (CBAM) [82], GA-Net first learns the correlation between scenes and the objects in the scene, and compensates the learned features as a specific global context to compensate for the loss of object features. Instead of using the feature map from the top of the feature pyramid directly, we optimize the feature map of C5 in the channel dimension, which is similar to the band optimization in hyperspectral image processing. Equation (3) shows the mathematical process: where F is an input tensor with size [n, n, 2048], where the width = height = n and the channel = 2048. After an average-pooling layer and a max-pooling layer, we can obtain two tensors with size [1, 1,2048], where ⊕ is the matrix addition operation and ⊗ is the matrix multiplication operation. In addition, Equation (4) shows the details of MLP, where f coder (·) denotes a convolutional layer with 2048 input channels and 16 output channels, and f decoder (·) denotes a convolutional layer with 16 input channels and 2048 output channels. To reduce the number of parameters, the two convolution layers are used to eliminate the insignificant channels and make the feature map more inclined to describe the distribution of the scene in the spatial dimension. With GA-Net, we can make the useful information more significant, while maintaining the spatial distribution of the scene as much as possible. We do not use the spatial attention module here because this operation would lose the attention for spatial distribution information of the background.
To improve the network efficiency, we use convolutional layers to build the RoIHead, instead of the usual two full connected layers, where M c (F) is a feature map from GA-Net (Equation (3)) and S(F ) is a feature map of the region proposals processed by the RPN and RoIAlign. We obtain a tensor with channel = 272 by concatenating M c (F) with S(F ) in the channel dimension. Finally, the tensor with 272 channels can be fed into the two convolutional layers to generate a tensor with shape [1024, 1], which is used to predict the categories and bounding boxes (see Figure 10).

Angle-Sensitive Non-Maximum Suppression (NMS)
The non-maximum suppression (NMS) is widely used to reduce redundant bounding boxes in object detection. The usual calculation process is:

•
Step 1: All the bounding boxes are arranged in descending score order, and are referred to as set A. • Step 2: Put the first bounding box a 1 ∈ A into set B, and calculate the IoU of a 1 and the other bounding boxes a i ∈ A, i = 1 in order. If the IoU is greater than the threshold (usually 0.5), a i is excluded from set A, otherwise, it is skipped.

•
Step 3: Select the next bounding box in order to put it into set B and continue from step 2 until the set A is empty.
Generally speaking, the IoU can be described as the ratio of the intersection and union of two rectangular areas. For oriented bounding boxes, its mathematical expression is as shown in Equation (5), where area inter denotes the intersection of area1 and area2. rotated_IoU = area inter area1 + area2 − area inter , As Figure 11 shows, the usual rotated IoU may get a misleading value. In order to reduce the impact of this situation, we add a parameter e |θ i −θ j | 90 = e |∆θ| 90 , θ ∈ [−90, 0) to punish the angle differential, which can be seen in Equation (6), where λ is a parameter used to limit the value of IoU, to prevent over-checking of the bounding boxes when the objects are densely arranged. Figure 11. For this ship, the red rectangle indicates a better inference result with a score of 0.9. The green rectangle has the same area as the blue one, and they are also the inference results before the non-maximum suppression (NMS) processing, with scores 0.8 and 0.6. When using the five-parameter coordinates, ∆θ le f t represents the rotation angle between the green bounding box and the red rectangle, and ∆θ right represents the rotation angle between the blue bounding box and the red rectangle. By using Equation (5), we can obtain a result where the rotated_IoU le f t < 0.5 ≤ rotated_IoU right even if both the blue and green boxes should be removed in the same iteration.

Experiments and Results
The widely used "A Large-Scale Dataset for Object Detection in Aerial images" (DOTA) [11] dataset was used in the experiments in oriented object detection in aerial images. In the following, we describe the DOTA dataset, the implementation details, and the ablation studies conducted with the proposed method. .0 release has the characteristics of diverse categories, scales, and sensor sources, so it a very challenging task to detect objects in this dataset. Figure 12 shows the spatial resolution information of the DOTA dataset. In the experiments, training and validation sets were both used for the training, and the test set was used for the testing. In order to reduce the memory requirement, we cropped a series of 1024 × 1024 patches with a stride of 500 from the original images. During the training, we eliminated patches with no annotation. With these processes, we obtain 38,504 image patches. In the inference stage, the image patches are cropped from the test images with an overlap of 500 pixels between the neighboring patches. Zero padding was applied if an image was smaller than the cropped image patches. With these parameters, we obtained 20,012 patches for the test task, and the final results with NMS are submitted to the DOTA dataset evaluation system.

Evaluation Metrics
In this paper, a standard and widely used measure is used for evaluating the performance of the object detection algorithms, i.e., the mean average precision (mAP). The mAP is the average of the sum of the average precision (AP) values of all the categories. In brief, an AP value can be regarded as the area enclosed by the polyline, the x-axis, and the y-axis in a single-category precision-recall chart. In other words, the larger the area, the higher the AP value, and the better the algorithm's effect.

Implementation Details
For the experiments, we built the baseline network based on Faster RCNN [35], inspired by the FPN [37], and the successful ResNet101 [8] pretrained on ImageNet [81]. In addition, we changed the fully connected layers of R-CNN to convolutional layers, as in the FCN [72]. To be fair, we used x, y, w, h, θ instead of the two-point coordinate representation of the upper-left and lower-right points in Faster RCNN [35]. This was done to uniformly use oriented bounding box (OBB) coordinates for the training and testing. Therefore, the difference between the baseline used in the comparative experiments described in this paper and the original Faster R-CNN [35] network lies only in the modification of the coordinate form, the FPN and the R-CNN head.
A series of experiments were designed to better evaluate the effects of the saliency pyramid, the global attention network and the angle-sensitive NMS proposed in this paper. The environment used was a single NVIDIA Tesla V100 GPU with 16 GB memory, along with the PyTorch 1.1.0 and Python 3.7 deep learning frameworks. The initial learning rate was 0.0025, the batch size of the input data was 2, the value of the momentum is 0.9, the value of the weight decay was 0.0001, and the minibatch stochastic gradient descent (SGD) was also used for optimization. The maximum number of proposals after the RPN procedure was set to 2000 during both the training and testing.

Saliency Pyramid
To evaluate the efficiency of the saliency pyramid, we just compared the saliency pyramid + baseline with the baseline.
With this operation, the sizes of feature maps we obtained from ResNet-101 were [256, 256], [128, 128], [64,64], [32,32] 16,16], and these feature maps were sent to RPN and RoI extractor. Compared with the baseline in Table 1, the saliency pyramid can achieve a higher mAP of 69.03% in the DOTA dataset. Figure 13 shows the saliency maps from the RC method [70], which were used in the saliency pyramid. In fact, in a very small area around an object (region proposal), the edges of the objects are more prominently represented, and the water, grass, and roads around the targets, such as boats, sports grounds, and vehicles, are weakened. Here, we simply concatenate the feature maps, and the matrix multiplication can also get a similar effect. Figure 14 shows the detection results from the baseline, the saliency pyramid and the global attention network. From column (a), column (b) and column (e), we can see that due to the use of the saliency features, the difference between the objects and the background in the local area increases, which improves the AP values. The AP values of the small vehicle, the large vehicle and the storage tank of the saliency pyramid outperform the values of the baseline by 3.33%, 8.54% and 2.79%. As for column (h), because the outline of the plane is relatively clear on the saliency feature map, a false detection in the upper part of the image is successfully eliminated, and a false detection of the baseball-diamond in column (b) is removed based on the same reason. However, the images in column (c) show the limitations of the saliency pyramid method, that is a false detection labeled as a vehicle in the lower part of the image. Due to the lack of semantic information between the scene and the object, the significant object on the water was erroneously detected, and a similar error occurred in the middle area of the image in column (f). Table 1. Comparisons with the state-of-the-art detectors in the DOTA test dataset [11]. The baseline is the Faster RCNN detector with the FPN and the convolutional head. GA means a network with only the global attention network based on the baseline. SP means a network with only the saliency pyramid based on the baseline. GA+SP means that the global attention network, the saliency pyramid, and the baseline were used at the same time. Finally, GLS-Net is a network adding the angle-sensitive NMS to GA+SP. In addition, the specific meanings of the following abbreviations are: plane, baseball diamond (BD), bridge, ground track field (GTF), small vehicle (SV), large vehicle (LV), ship, tennis court (TC), basketball court (BC), storage tank (ST), soccer-ball field (SBF), roundabout (RA), harbor, swimming pool (SP), and helicopter (HC). The bold Numbers denote The highest values in each column.  Figure 13. The saliency maps based on the RC algorithm [70]. The images of the above line come from the "A Large-Scale Dataset for Object Detection in Aerial images" (DOTA) dataset [11], and the images of the following line is the result of the RC algorithm. In addition, (a-h) are eight different scenarios. It can be seen that the saliency algorithm can reduce the interference of background information to a certain extent. In the local area, the difference between the object and the background increases, and the structural information of the object is highlighted.

Global Attention Network
In order to demonstrate the effectiveness of the global attention network, we just compared GA-Net + baseline with the baseline. From Table 1, GA-Net obtains a mAP of 69.14%. Compared with the saliency pyramid +baseline (mAP = 69.03%), GA-Net performs better in situations where the scene is closely related to the target. As Figure 14 shows, the results of the GA-Net in column (a), column (c), column (d), column (e) and column (g) are better than the results of the baseline. On the one hand, the increase of the number of parameters makes the network more representational. On the other hand, the use of the scenario information allows the network to find targets that should normally be closely associated with the scenario. Therefore, some missed objects are reduced. However, compared to the results of the saliency pyramid, the GA-Net is better at reducing false detection through the scene semantic constraints. For example, in Figure 14, a false bounding box of the saliency pyramid in column (c), lower part on the image, has been reduced by the GA-Net. Because in normal scenes, a vehicle is unlikely to appear on the water. In addition, in column (c) of the baseline, the guardrail along the river are mistakenly detected as a harbor and a ship, which was effectively eliminated due to the use of scene information by GA-Net. Comparing to the results of the saliency pyramid in column (g) and (h), both of the false basketball-court bounding box and the false plane bounding box cannot be reduced by the scene semantic constraints. Therefore, in local areas, the saliency pyramid has a better effect. In addition, when the objects are densely arranged or the backgrounds is very complex, the use of contextual information increases the difficulty of the object detection comparing to the results of the saliency pyramid, which explains that the saliency pyramid used to remove the effect of the background has better performance in the scene of swimming pool in the villa area, the scene of densely arranged cars, and the scene of oil pipelines in the complex building area. Even though there are problems in some scenarios, GA-Net still achieves a good performance relative to the baseline mAP = 66.88%. Figure 15 shows the encoded scenario-object semantic information. Figure 15. Visualization of the encoded scenario-object semantic information from the global attention network (GA-Net). The four feature maps from different channels behind each image visualize the response of different objects, and these feature maps collectively reflect the semantic relationship between the scene and the objects. The more biased to blue, the lower the response value. For that reason, the GA-Net branch obtains the distribution information of each part of the scene and encodes the information between the scene and the objects.

Angle-Sensitive IoU
To evaluate the efficiency of the angle-sensitive IoU algorithm, we trained GA + saliency pyramid and GLS-Net (GA + saliency pyramid + angle-sensitive IoU) respectively, for comparative experiments. The difference between the two networks is that GLS-Net uses angle-sensitive IoU (Equation (6)) and GA + saliency pyramid uses rotated IoU (Equation (5)) during the inference. λ = 1 was set for GLS-Net. From Table 1, it can be seen that GLS-Net with angle-sensitive IoU achieves a mAP = 72.96%, which is slightly lower than the mAP = 72.99% of GLS-Net without angle-sensitive IoU. In addition, the angle-sensitive IoU algorithm has a greater impact on objects with large differences in length and width, such as aircraft, ships, vehicles, and bridges. Although this experiment shows that the use of the angle-sensitive IoU causes the mAP value to decrease slightly, compared with the use of horizontal IoU, the algorithm is more in line with the idea of the rotated rectangle angle difference constraint. Furthermore, most of the time, compared with a detector using horizontal IoU, the mAP of a detector with rotated IoU may be lower about 1.5% [85]. Table 1 shows the experimental results obtained with the DOTA test dataset and the comparisons with the state-of-the-art algorithms using the official evaluation web (https://captain-whu.github. io/DOTA/evaluation.html), because of the lacking annotated labels. To be fair, the ResNet-101 was used for all the methods in this Table 1 as the backbone network. Among them, because the multi-scale feature integration attention rotation network (MFIAR-Net) [89] does not have detailed AP values based on ResNet101 backbone for each category, only the mAP value was used to express its performance in Table 1. In addition, Figure 16 shows the detection results of the GLS-Net.

Comparison with the State-of-the-Art
Nine state-of-the-art detectors were used including the Faster R-CNN detector (FR-O) provided by DOTA [11] and the rotation region proposal networks (RRPN) [86] with a rotated anchor design. R2CNN indicates the rotational region CNN [54], which was built with a horizontal anchor extractor and horizontal RoIPooling, but R2CNN uses three pooling sizes before the R-CNN head and it can predict the oriented bounding boxes and horizontal bounding boxes at the same time. R-DFPN is the rotation dense feature pyramid network, and the method of Yang et al. [87] also belongs to the R-DFPN category. The RoITransformer method is a HRoI-based method with the RRoI warping operation. The method of Azimi et al. [61] is a kind of cascade structure with a multi-scale RRoI warping operation. MFIAR-Net [89] is a kind of feature pyramid network with an attention mechanism and a multi-scale feature fusion method. SCRDet [55] uses an inception module to fuse multi-scale feature maps, and use a MDA-Net to achieve the pixel attention and the channel attention at the same time. As can be observed in Table 1, the proposed detector, i.e., GLS-Net, outperforms FR-O [36,85], RRPN [85,86], R2CNN [54,85], the method of Yang et al. [87], R-DFPN [85,88], RoITransformer [85], the method of Azimi et al. [61], MFIAR-Net [89] and SCRDet [55] by 18.86%, 11.98%, 12.32%, 10.7%, 15.05%, 3.43%, 4.83%, 0.48% and 0.38%. In addition, the proposed method also achieves good results in each category. The experimental results fully prove that the proposed GLS-Net detector shows a superior performance, compared to the existing state-of-the-art methods.
In conclusion, GLS-Net builds the saliency pyramid to make the object area and structural information more prominent in the local area in the image, reducing the interference of the complex background around the objects. Furthermore, GLS-Net uses the global attention network to obtain the scene semantic information of the input image, and associates the objects with the semantic information of the surrounding environment. The use of scene semantic information can eliminate obvious false detections in some scenes, such as vehicles on the water and the guardrail along the coast. Furthermore, it also improves the detection accuracy of the objects closely related to the scenes, such as airports and airplanes. Finally, the angle-sensitive IoU makes a certain contribution to the calculation of the oriented bounding boxes of objects with large differences in length and width, such as ships, large vehicles and bridges, etc.

Conclusions
In this paper, we have proposed an arbitrary-oriented object detection algorithm, namely GLS-Net, which is effective for detecting objects oriented in different directions in aerial images. The framework combines a saliency pyramid, a global attention sub-network that can capture the semantic information from scene to object, and an angle-sensitive NMS method to obtain more accurate oriented bounding boxes. The experiments undertaken with the public DOTA dataset confirmed the remarkable performance of the proposed method. Despite this, the GLS-Net still has missed detections and inaccurate bounding boxes. In addition, although background noise information can be suppressed by saliency pyramid, whether this step can be replaced by a network structure remains to be studied. In our future work, we will pay more attention to the expression of the high and low frequency information in aerial images. In practical applications, noise greatly affects the detection accuracy and, at the same time, causes the failure of some network elements, resulting in a waste of computing resources. Therefore, we will consider encoding the high-frequency information and the low-frequency information in the convolution process, and we will attempt to reduce the influence of noise through the encoder-decoder structure.