Eagle-Eye-Inspired Attention for Object Detection in Remote Sensing

: Object detection possesses extremely signiﬁcant applications in the ﬁeld of optical remote sensing images. A great many works have achieved remarkable results in this task. However, some common problems, such as scale, illumination, and image quality, are still unresolved. Inspired by the mechanism of cascade attention eagle-eye fovea, we propose a new attention mechanism network named the eagle-eye fovea network (EFNet) which contains two foveae for remote sensing object detection. The EFNet consists of two eagle-eye fovea modules: front central fovea (FCF) and rear central fovea (RCF). The FCF is mainly used to learn the candidate object knowledge based on the channel attention and the spatial attention, while the RCF mainly aims to predict the reﬁned objects with two subnetworks without anchors. Three remote sensing object-detection datasets, namely DIOR, HRRSD, and AIBD, are utilized in the comparative experiments. The best results of the proposed EFNet are obtained on the HRRSD with a 0.622 AP score and a 0.907 AP 50 score. The experimental results demonstrate the effectiveness of the proposed EFNet for both multi-category datasets and single category datasets.


Introduction
Optical remote sensing images contain a large amount of scene information and intuitively reflect the shape, color, and texture of objects. Referring to specific algorithms, object detection of optical remote sensing images aims to search for and locate the objects of interest, such as aircraft, tanks, ships, and vehicles. Typical applications are urban planning, land use, disaster survey, military monitoring, and so on [1,2]. With the rapid development of observation technologies, the resolutions of acquired remote sensing images are becoming higher and higher. These high-resolution remote sensing images can provide detailed high-quality information that offers great opportunities to develop object-level applications. The characteristics and challenges of remote sensing images are summarized as follows: large scale, diverse direction, various shapes, and complex background. A multitude of works have aimed to theoretically and practically solve these problems [3].
The early object-detection algorithms for optical remote sensing images were mostly based on manually designed features [4][5][6][7][8][9]. Usually, candidate regions were first extracted, and then the features were manually designed for the objects. Finally, the object categories were determined by certain classifiers. Typical strategies were prior region uses, template matching, feature classification, selective search, etc. From the human perception of the object location, some methods learned the prior knowledge of candidate regions. This strategy is widely used for some representative applications, including segmentation of ocean and land for ship detection and airport detection for aircraft detection. To separate the sea surface, Antelo et al. [4] utilized the active contour method by constructing and minimizing the energy function. Some methods adopted the idea of template matching and match the candidate feature with the template library of objects. Liu et al. [7] proposed an aircraft detection method from coarse to fine. First, template matching is used to find the candidate areas of aircraft, and then principal component analysis (PCA) and a kernel density function are used to identify each area. Xu et al. [6] generated a ship shape library based on the Hough transform and used the sliding window method to calculate the feature similarity between each window region and shape library. The feature classification-based methods [10,11] usually extract the sliding window features first, and then certain classifiers are designed to predict the sliding image patches. Zhang et al. [12] used a sliding window to generate windows of different sizes and aspect ratios and extracted the visual features for each window. The cascading support vector machine (SVM) is then applied to complete the extraction process of candidate regions. The frequently used tool of selective searchbased methods is segmentation which applies the similarity-merging strategy to obtain large areas. Aiming to capture possible object locations, Uijlings et al. [13] applied the appearance structure to guide the sampling process for the selective search. To reduce the search space, Liu et al. [14] analyzed the possibility of covering ships by rotated bounding boxes. In addition, a small number of potential candidates with high scores are found by a multi-cascaded linear model.
The methods mentioned above mostly adopted the traversal search method possessing redundant calculation and cannot deal with the complex and changeable environment of remote sensing images. Therefore, a great many algorithms have also tried to address the aspect of feature extraction. Feature extraction is the most critical step that directly affects the performance and efficiency of a detection algorithm. The commonly used features in object detection of remote sensing images include the color feature, the texture feature, the edge shape, and the context feature. To overcome the variable characteristics of the sea environment, Morillas et al. [15] proposed using block color and texture features for ship detection. In order to detect buildings, Konstantinidis et al. [16] combined the first moduleenhanced HOG-LBP features and the second module region refinement processes. The texture feature is a visual feature that describes the homogeneity of the image, reflecting the slow change or periodic change of the object surface structure. Brekke et al. [17] conducted oil-spill detection based on the different texture characteristics between the sea surface area and the sea surface oil-slick area. In addition, the edge features reflect the object edge and shape information. To facilitate object detection, edge shape features are usually required to be invariant in scale, translation, and rotation. Sun et al. [18] extracted SIFT features from the sliding window and used the bag of words (BoW) model for classification. Cheng et al. [19] extracted binarized normed gradients (BING) for each window and used weighted SVM classifiers to improve the calculating speed. Tong et al. [20] also used SIFT features for the ship candidate areas. After extracting candidate ships, Shi et al. [21] extracted HOG (histograms of oriented gradients) features for each region. Then an AdaBoost classifier was adopted to screen and classify candidate regions. To improve the rotation invariance of the HOG feature, Zhang et al. [22] utilized part models to generate rotation invariance features. Moreover, the context feature, which mainly represents the spatial position relation of sequential topology adjacency between different instances, is also worthwhile [23][24][25]. On the basis of active contour segmentation, Liu et al. [23] introduced an energy function method to complete the separation of the sea. The ships are detected using context analyses and shape description. Using Markov random fields (MRF), Gu et al. [24], modeled the spatial position relations of objects to discriminate the object categories.
However, the adaptation range and robustness of traditional object-detection algorithms are limited, making them difficult to apply in complex environments of remote sensing images. With the thriving development of deep learning, the deep features extracted by a neural network have a stronger semantic representation ability and discrimination [26,27]. In light of the improvement of diversified object directions, some object-detection methods enhance the training image samples [28,29]. Cheng et al. [30] optimized a new objective function by introducing regularized constraints to achieve rotation invariance. Later on, Cheng et al. [31] also added a rotation-invariant regularizer to convolutional neural network (CNN) features by an objective function that can force tight mapping of feature representations to achieve rotation invariability. The ORSIm detector [32] adopted a novel space-frequency channel feature (SFCF) to deal with the rotation problem. This method comprehensively considers the rotation-invariant features from both the frequency domain and the spatial domain. To provide for small-scale objects [33,34], Zhang et al. [35] up-sampled candidate regions that were extracted in the previous stage. Replacing the convolution, Liu et al. [36] used dilated convolution to reduce parameters on the same receptive field. However, dilated convolution could cause the loss of local information. Wang et al. [37] improved the loss function to increase the training weight of small objects by combining with shallow information. The R3Det [38] improved the positioning accuracy of dense objects by adding fine-tuning modules to ensure the alignment of object features and object centers. Some works also aim to improve the adaptation of various object scales [39][40][41][42][43]. Based on the Faster R-CNN [44], Zhang et al. [41] introduced a candidate region extraction network to detect objects of different scales. A full-scale object-detection network (FSD-NET) was proposed in [42], and this network contained a backbone with a multi-scale enhanced network. In [43], a global component to a local network (GLNet) was also proposed, and the spatial contextual correlations were encoded by the long shortterm memory with a clip. Given that the horizontal bounding boxes are not friendly to oriented objects, a large number of works adopted oriented quadrangles to surround the objects [45][46][47][48][49]. Zhu et al. [46] proposed an adaptive-period-embedding (APE) method to represent oriented objects of aerial images. Instead of regressing the four vertices of oriented objects, an effective and simple framework was proposed in [48]. In this framework, the vertex of horizontal bounding boxes on each corresponding side is glided to the oriented object. Different remote-sensing sensors possess the benefits of complementary information, hence the works [50,51] are based on deep neural networks and integrate several features to obtain an overall performance improvement.
The human visual mechanism possesses the ability to focus on a saliency region with obvious visual features, ignoring irrelevant background. Therefore, the attention mechanism is the most frequently used technique to improve the semantic representation [52][53][54]. To reduce the detection area, Song et al. [55] utilized color, direction, and gradient information to extract visual features and extracted ship regions according to saliency characteristics. To determine a potential airport, Yao et al. [8] adopted saliency regions to extract scale invariant feature transform (SIFT) features. In [56], the authors proposed a convolutional block attention module which consists of a channel attention module and a spatial attention module. Wang et al. [57] used a multi-scale attention structure with a residual connection to meet the scale change. For multi-category detection, Wang et al. [45] also adopted a semantic attention-based network to extract the semantic representation of the oriented bounding box. In light of the densely distributed objects, the SCRDet [58] added a pixel attention mechanism and channel attention mechanism. With respect to the loss funtion, Sun et al. [59] proposed an adaptive saliency-biased loss (ASBL) for the both image level and the anchor level. In addition, the SCRDet++ [60] indirectly used the attention mechanism to improve the boundary differentiation of dense objects. Similarly, the work [61] used the density saliency attention to detect clustered buildings.
Although there are many good attention-based approaches for the object detection of remote sensing images, the robust problem is not yet completely solved. Therefore, in this paper, we propose a novel structure aiming to learn more robust and accurate object classification and positioning for remote sensing images. This framework is inspired by the eagle-eye, which has its complementary and exchangeable mechanism between the two foveae. The main contributions are as follows: (1) We propose a new architecture named the eagle-eye fovea network (EFNet) to detect objects in remote sensing images. This architecture is inspired by the vision attention mechanism and the cascade attention mechanism of eagle-eyes. (2) Two eagle-eye fovea modules, front central fovea (FCF) and rear central fovea (RCF), are included in the EFNet. The FCF mainly aims to learn the candidate-object knowledge based on the channel attention and the spatial attention, while the RCF aims mainly to predict the refined objects with two subnetworks without anchors. (3) The two central foveae possess the complementary mechanism. The experimental results in three public datasets for object detection in remote sensing images demonstrates the effectiveness of the proposed architecture and method.
The remaining sections of this paper are organized as follows. Some related works are reviewed in Section 2. The proposed methodology is introduced in Section 3. Section 4 shows the experimental results. A discussion follows in Section 5. Finally, Section 6 includes our conclusion.

The Mechanism of Eagle Eye
The eagle possesses extremely keen vision which can be used to locate prey. Once the prey is found, the eagle will quickly track the prey until it is captured [62][63][64][65]. The eagle's keen vision is inseparable from its foveae. The density of photoreceptors in an eagle's foveae is several times higher than that of human eyes. The resolution of an eagle's eyes is positively correlated with the density of photoreceptors [66].
An eagle has two foveae in each eye, one deep and one shallow. The deep fovea has higher visual acuity than the shallow fovea. Figure 1 shows the structure of an eagle's eyes and the two foveae. Since each eagle eye has two central fovea and their observation directions are different, the field of vision (FOV) of the eagle eye is very large. The FOV of the eagle eye in the horizontal direction (excluding the blind area) can reach 260 degrees. In the vertical direction, the FOV of the eagle eye also can reach 80 degrees. In the process of predation, the flight path of the eagle is generally not straight because the eagle is usually far away from the prey during predation, which requires the eagle's side vision [67]. Therefore, the eagle can easily observe rabbits on the ground from thousands of meters in the air. Inspired by the eagle's eyes, Abimael et al. [68] designed a parallel structure with two CNN submodules to detect moving objects. The authors claimed that the one CNN was used to perceive the context from videos, and the other CNN was used to focus on the small objects or details.
However, the foveae of eagles cannot observe objects at the same time. Rather, they constantly switch from one to another, and the deep foveae observe objects on the side, while the shallow foveae observe objects on the front.
The switch mechanism of FOV is more like a cascade structure, not a parallel structure. Moreover, the eagle's viewpoint is similar to the remote sensing observation such as used by aircraft or satellites. Hence, the eagle-eye mechanism can give us some inspiration to explore a possible method of parallel structure for remote sensing object detection.

The Attention Module of CBAM
The attention mechanism is the most frequently used technique to improve semantic object saliency [53]. To suppress the characteristics of a complex background, an improved attention region proposal network (A-RPN) was used to predict the object's location. As shown in Figure 2, the feature maps are fed into the convolutional block attention module (CBAM) network [56]. The CBAM is composed of two complementary modules, including a channel attention module and a spatial attention module. These modules can suppress the features of a complex background and highlight the features of objects. Among them, the channel attention module focuses on what the object is by assigning greater weight to channels containing more object information and smaller weight to channels containing more background information.
In Figure 2, the input feature maps are denoted as F ∈ R C×H×W . After the channel attention module, the channel attention map M c ∈ R C×1×1 will be obtained, and the input feature F is weighted by M c to obtain refinement feature F . The spatial attention map M s ∈ R 1×H×W will then be obtained through the spatial attention module. The final output F will be calculated by multiplying M s (F ) and feature F . These formula derivations are as follows: where ⊗ represents the multiplication of the corresponding elements of the matrix, C represents the number of channels for the input feature, and W and H represent the width and height of the feature map.
The channels with useful object feature information will be selected, while the spatial attention module can tell the network where the objects are and helps the network locate objects in the feature maps. First, feature F was obtained after a 3 × 3 convolution of the input feature map. Next, feature F will be obtained by the CBAM. Therefore, the A-RPN can carry out more accurate object classification and position regression. The CBAM is regarded as a universal module and can easily be connected to the convolutional blocks.

The Proposed Methodology
In this section, the proposed eagle-eye fovea network (EFNet) will be introduced in detail. The architecture of the EFNet is shown in Figure 3. First, the whole network architecture is introduced in Section 3.

Network Architecture
Inspired by the two central foveae of eagle eyes and the precision conversion mechanism, we developed a similar vision network for object detection in remote sensing images. The proposed EFNet consists of two eagle-eye central foveae: front central fovea (FCF) and rear central fovea (RCF). The framework of the proposed methodology is shown in Figure 3. For an image, the feature maps will be obtained through a backbone network which is added to an attention module CBAM. This module, as the FCF, will be used to improve the saliency of the candidate objects in the feature pyramid networks (FPN). The FoveaBox, as the RCF, is used to propose the most possible object areas which will be used for classification and box prediction.
It was introduced in Section 2.1 that eagles cannot use both foveae to simultaneously observe objects, but they can switch between the two foveae at any time. The deep fovea is used to observe objects on the side, and the shallow fovea is used to observe objects on the front. This mechanism can be regarded as the cascade mechanism. Inspired by this mechanism, the FCF and RCF are also a cascade distribution. Therefore, these two central foveae are designed to possess the complementary ability for object detection in remote sensing images.

Front Central Fovea
In this subsection, we will introduce the structure of the front central fovea (FCF). It is verified that CBAM is possessing universal applicability across different architectures and different tasks and can be seamlessly integrated into other CNN architectures to enhance the network. Therefore, the CBAM is integrated into the Resnet block [69] in our structure, shown in Figure 4.
The channel attention module focuses on what the object is. The average pooling (Avg Pool ) and maximum pooling (Max Pool ) are utilized to extract two kinds of features denoted as F c avg and F c max . When these features are fed into the middle shared network layer and applied in the shared network layer behind F c avg and F c max , respectively, the corresponding elements of the two features will be obtained. Then, the channel attention map M c ∈ R C×1×1 is obtained by sigmoid activation function as follows: where σ represents the sigmoid function, W 0 ∈ C/r×C , and W 1 ∈ C×C/r . The MLP represents the multi-layer perceptron of the shared network. The features between W 0 and W 1 are processed by the ReLU. Finally, M c (F) is multiplied by its input features to obtain a fine feature map F adjusted by channel attention.  The spatial attention module focuses on where the object is, i.e., the spatial location of the defect on the input feature map. The input of spatial attention is the output F of the channel attentional power module, and the feature map is obtained through average pooling and maximum pooling F s avg ∈ R 1×H×W and F s max ∈ R 1×H×W . Using a 7 × 7 convolution kernel and sigmoid function, the new space attention feature map M s is obtained as follows: where the σ denotes the sigmoid activation function, and f 7×7 is the 7 × 7 convolution kernel.

Rear Central Fovea
The rear central fovea is described in this subsection, and this module mainly refers to the FoveaBox [70]. The FoveaBox is an accurate, flexible, completely anchor-free objectdetection framework. Unlike previous anchor-based methods, the FoveaBox directly learns the possibility of an object's existence and bounding box coordinates without reference to anchor points. This is achieved by: (a) class-sensitive semantic maps that predict the object possibility; (b) generating bounding boxes for each location that might contain an object. As a result, the rear central fovea of the framework mainly utilizes the setting of the FoveaBox. The FoveaBox has five feature levels which derive subnets P l (l = 3, 4, . . . , 7), respectively, and each level output feature map with scale 1 2 l . Due to the wide range of object scales, the FoveaBox adopts different levels to predict objects of different sizes. The dimensions of the seven levels are set as S l = 4 l S 0 , which range from 32 2 to 512 2 . S 0 = 16 (l = 3, 4, . . . , 7). To control the overlapping area between different levels, parameter η is added to adjust the scales of different levels. By adjusting parameters [ S l η 2 , S l η 2 ], one object may be detected at multiple levels.
The object prediction performs in each single FPN level. Two branch networks of the object prediction network are shown in Figure 5. Two branch networks are adopted for the different levels. One is for predicting categories, and the other is for predicting boundary boxes. The output of the classification subnet is W × H × C (C is the count of the feature level channels), and the output of the box prediction subnet is W × H × 4. Next, the non-maximum suppression (NMS) is adopted for each category with a threshold 0.5. Finally, 100 predictions with the highest score are selected for each image.

Object Classification
It is difficult to allocate positive and negative samples when the method is anchor-free, so the multi-level prediction can be used to solved or effectively reduce to the problem of object overlap. The anchor-based methods need to calculate the Intersection over Union (IoU) based on the positive and negative samples. As an anchor-free method, the FoveaBox does not need to calculate IoU. The FoveaBox directly maps ground-truth to the feature maps of the corresponding level. The formula is as follows: where (x1, y1, x2, y2) is a valid box of the ground-truth, and 2 l is the down sampling factor. While (x 1 , y 1 , x 2 , y 2 ) is the mapping box of the target feature pyramid P l , (c x , c y ) is the center position the mapping box.
In addition, not all regions corresponding to ground-truth are positive samples, as shown in the Figure 5. Although the ship is large, the real positive sample is the red area in the middle, which is also the essence of FoveaBox. As a result, a shrunk factor σ is introduced, which can dynamically set the positive sample areas according to the parameters as follows: y pos 1 = c y − 0.5(y 2 − y 1 )σ, x pos y pos 2 = c y + 0.5(y 2 − y 1 )σ. (9) where (x

Box Prediction
In the box prediction, the transformation function is utilized to carry out the coordinate transformation as follows: where z = √ S l . The (x 1 , y 1 , x 2 , y 2 ) are the ground-truth, and (t x 1 , t y 1 , t x 2 , t y 2 ) stands for the prediction output. The smooth L1 loss is used for the box prediction.

Dataset
Three publicly available object-detection datasets of remote sensing images are used to evaluate the proposed methods in the experiments. Some examples of the DIOR, HRRSD, and AIBD are shown in Figure 6.
The first dataset is DIOR [71] which is a large-scale benchmark dataset for remote sensing object detection. The DIOR is sampled from Google Earth and released by the Northwestern Polytechnical University, China. The dataset contains 23,463 images and 20 object classes with 192,472 instances. The 20 object categories are airplane, baseball field, basketball court, airport, bridge, chimney, expressway service area, dam, expressway toll station, ground track field, harbor, golf course, overpass, stadium, storage tank, ship, tennis court, vehicle, train station, and windmill. The spatial resolutions of the images range from 0.5 m to 30 m, and the image scale is 800 × 800 pixels. This dataset possesses four characteristics: (1) large number of object instances and images; (2) various object scales; (3) different weathers, imaging conditions, seasons, etc.; (4) high intra-class diversity and inter-class similarity.
The second dataset is HRRSD [72] which was released by the University of Chinese Academy of Sciences in 2019. The HRRSD contains 21,761 image samples obtained from Google Earth and Baidu map, with spatial resolution ranging from 0.15 m to 1.2 m. The count of the object instances is 55,740 covering 13 object categories. The categories are separately airplane, baseball diamond, crossroad, ground track field, basketball court, bridge, ship, storage tank, harbor, parking lot, tennis court, T junction, and vehicle. The highlight of the dataset is the balanced samples across categories, with nearly 4000 for each category. In addition, the sample count of the train subset is 5401, and those of the validation subset and the test subset are 5417 and 10,943. The 'train-val' subset is the union set of the train subset and the validation subset.
The third dataset is AIBD which is specially self-annotated for the task of building detection. The AIBD which was first introduced in [73] contains a single object category: building. The sample scale of the samples is 500 × 500, and the total count of the samples is 11,571, with the same number of annotation files. Based on the COCO metric, the building instances are divided into large-scale instances, medium-scale instances, and small-scale instances. The counts of the large-scale instances, medium-scale instances, and smallscale instances are 16,824, 121,515, and 51,977, respectively. The color characteristics are distinct from each other with tremendously different backgrounds. The pixel number of the buildings ranges from tens to hundreds of thousands. The geometric shapes of the instances are diversiform and consist of some irregular shapes, such as U-shape, T-shape, and L-shape. The original images of AIBD are from the Inria Aerial Image Data (https: //project.inria.fr/aerialimagelabeling/, accessed on 1 August 2020) which are mainly used for semantic segmentation.

Evaluation Metrics
The Average Precision (AP) and its derivative metrics are adopted to quantitatively evaluate the proposed method. The AP is a comprehensive metric in the task of object detection and based on the precision and recall as Equations (14) and (15).
where the terms TP, FP, and FN are true positives, false positives, and false negatives, respectively. The terms TP, FP, and FN are calculated from the Intersection over Union (IoU) between the bounding boxes of ground-truth and the bounding boxes of prediction as follows: where B pred denotes the bounding box of prediction , and B gt is the bounding box of ground-truth. The standard COCO metrics, including AP, AP 50 , AP 75 , AP s , AP m , and AP l , are briefly reported in Table 1. For the detection of multi-category objects, the AP usually denotes mean average precision (mAP) which is obtained by the average of different category APs. AP at IoU = 0.50 (equally to PASCAL VOC metric). AP 75 : AP at IoU = 0.75 (much strict metric). AP s : AP for small objects which areas are smaller than 32 2 . AP m : AP for medium objects which areas are between 32 2 and 96 2 . AP l : AP for large objects which areas are bigger than 96 2 .

Experimental Setup
The comparative algorithms include general object-detection algorithms and domain algorithms of remote sensing. Some general object-detection algorithms are Faster R-CNN [44], SSD [74], YOLO [75], RetinaNet [76], and FoveaBox [70]. Among them, the Faster R-CNN [44] is the typical representative of the two-stage method, while the SSD [74], YOLO [75], RetinaNet [76] are the representatives of the single-stage method. In addition, the FoveaBox [70] is an anchor-free method. Some domain algorithms of remote sensing are RICAOD [77], RIFD-CNN [31], RICNN-finetuning [30], HRCNN-regression [72], and FRCNN TC [73]. The general object-detection algorithms are performed on all of the testing datasets. Because it is difficult to obtain the released codes, some experimental results of the domain algorithms are mainly cited from existing references.
The percentages of train set, validation set, and test set of DIOR are 0.25, 0.25, and 0.5, respectively However, the train set and validation set of HRRSD and AIBD are jointly used to train models. The main comparison experiments are based on the mmdetection platform (https://github.com/open-mmlab/mmdetection, accessed on 13 June 2021). The platform possesses four Nvidia GeForce RTX 2080 GPUs. The setting of hyper-parameters for the comparative methods in the mmdetection is summarized in Table 2.

Results and Analysis
In this section, the experimental results are shown in detail. Some qualitative examples of the EFNet on three datasets are separately presented in Figures 7-9. The TPs, FPs, and FNs are indicated by green, red, and yellow boxes, respectively. Those object instances with small-size and dense appearance could be falsely detected, such as vehicles, ships, and so on, whereas those objects that possess relatively fixed appearance characteristics, such as the airplanes or storage tanks, are rarely misdetected or missed. For example, the top instance with a red rectangle in Figure 7 is misdetected as a bridge; it is actually an overpass. Although the AIBD contains a single category, the variation within the class is huge. We can see that the detected results are mostly satisfying, not only for common rectangle buildings, but also for the buildings with irregular shape. The scales of the building instances also change tremendously. Although the best results are mostly achieved by the Faster-RCNN of the general object-detection method, it can be demonstrated that the proposed EFNet is a promising method that has a strong ability to perform the object detection of remote sensing images. The EFNet is much better than FoveaBox and is superior to most domain algorithms of remote sensing.    The PR curves of the comparative methods are separately shown in Figures 10-12. Only one representative category was selected from each dataset. These curves reveal that the AP performance of a single category could be significantly different from the AP performance of the whole dataset. That is to say, the AP scores of one method can be better in some categories but worse in other categories. The DIOR and HRRSD are multi-category datasets, so the detailed APs of different categories by EFNet on DIOR and HRRSD are summarized in Tables 6 and 7. On the DIOR, the APs of different categories are diverse. The categories of airplane, tennis court, baseball field, and chimney have comparatively higher APs scores above 0.600, while those of dam, bridge, harbor, and train station have comparatively lower APs scores, below 0.200. However, the distributions of APs on HRRSD are better than DIOR. The comparatively higher APs scores on HRRSD are achieved by airplane, ground track field, storage tank, and tennis court, and all of the APs are above 0.700. Correspondingly, the comparatively higher APs scores on HRRSD belong to the categories of bridge, parking lot, T junction, and basketball court. In addition, we selected one category for each dataset to show the PR curves as the different IoUs between the predicted boxes and the ground-truths, respectively, in Figures 13-15. These PR curves are calculated by the EFNet. The category basketball court is selected for the DIOR, while the category bridge is selected for the HRRSD. The AIBD contains only a single category building, so the PR curves on AIBD are based on building instances.

Discussion
In this section, we will discuss some concerned questions and the future improvements. From the experimental results above, we can find that the proposed framework can be well-used for both multi-category datasets and a single category dataset. The results also reveal that the vision attention mechanism with two foveae modules is beneficial to object detection and can promote the development of the interpretation of the remote sensing observation images.

Effects of the Data Complexity
From the experiment results, we can find that the comparatively higher APs scores are usually obtained by the instances in which apparent shapes are close to their ground-truths. In contrast, the instances that have large ratios of length to width usually receive lower APs scores. The bounding boxes of the instances with large ratios of length to width contain much background of large areas, which reduces the feature discrimination of the objects. On the whole, the quantitative results on HRRSD are the highest among the three testing datasets, while the lowest is DIOR. More categories in DIOR can make the processing much more difficult than with the HRRSD and AIBD. Although the AIBD has only one class, the instance count and the within-class scatter are large. Therefore, the quantitative results on AIBD are lower than HRRSD.

Effects of the Data Annotations
The annotations of testing datasets are not accurate enough, which may cause some problems in quantitative scores. For example, the top instance with the red rectangle in Figure 7 is misdetected as a bridge, while the label of ground-truth shows it is an overpass. In fact, it is more reasonable to assign multiple labels to this object instance because one overpass can also be regarded as a bridge at another time. Therefore, these shortcomings caused by the manual annotation strategy should be noted.

Limitations and Future Improvements
The experimental results demonstrate that the proposed method is effective in remote sensing object detection. However, the proposed method fails to surpass the general Faster R-CNN [44] in the quantitative comparison. One of the biggest limitations of the proposed method is the lack of optimal quantitative scores. In addition, the internal interpretability problem between the FCF and the RCF is another issue, which is common in the field of deep learning. Moreover, the Faster R-CNN is a two-stage method, and the calculation cost is relatively high. Therefore, how to improve the internal network interpretability and the ability of real-time processing of the proposed framework are two research topics in the future.

Conclusions
In this paper, we propose an eagle-eye fovea network (EFNet) for remote sensing object detection. This is inspired by the vision attention mechanism and the cascade attention mechanism of eagle eyes. The core modules of the EFNet are the front central fovea (FCF) and the rear central fovea (RCF). These two foveae have complementary characteristics. The FCF mainly aims to learn the candidate object knowledge based on the channel attention and the spatial attention, while the RCF mainly aims to predict the refined objects with two subnetworks without anchors. The results reveal that the vision attention mechanism with two foveae modules is beneficial to object detection. The EFNet can be used for both multicategory datasets and a single category dataset, which is qualitatively and quantitatively demonstrated by the experimental results on the three datasets.

Conflicts of Interest:
The authors declare no conflict of interest.