A Fast Aircraft Detection Method for SAR Images Based on Efﬁcient Bidirectional Path Aggregated Attention Network

: In aircraft detection from synthetic aperture radar (SAR) images, there are several major challenges: the shattered features of the aircraft, the size heterogeneity and the interference of a complex background. To address these problems, an Efﬁcient Bidirectional Path Aggregation Attention Network (EBPA2N) is proposed. In EBPA2N, YOLOv5s is used as the base network and then the Involution Enhanced Path Aggregation (IEPA) module and Effective Residual Shufﬂe Attention (ERSA) module are proposed and systematically integrated to improve the detection accuracy of the aircraft. The IEPA module aims to effectively extract advanced semantic and spatial information to better capture multi-scale scattering features of aircraft. Then, the lightweight ERSA module further enhances the extracted features to overcome the interference of complex background and speckle noise, so as to reduce false alarms. To verify the effectiveness of the proposed network, Gaofen-3 airports SAR data with 1 m resolution are utilized in the experiment. The detection rate and false alarm rate of our EBPA2N algorithm are 93.05% and 4.49%, respectively, which is superior to the latest networks of EfﬁcientDet-D0 and YOLOv5s, and it also has an advantage of detection speed.


Introduction
Synthetic aperture radar (SAR) can provide continuous and stable observation all day and all night, which has been widely used in various fields [1]. With the fast development of SAR techniques, a large number of high-resolution spaceborne and airborne data have been acquired, which provides new opportunities for SAR targets detection. Nowadays, timely aircraft detection plays a pivotal role in airport management and military activities [2]. This is because the scheduling and placement of aircraft are sensitive spatio-temporally.
Currently, there are three major challenges in aircraft detection: the shattered image features of aircraft, their size heterogeneity, and the interference of complex background. In SAR imaging, features of the aircraft are composed of a series of discrete scattering points, unlike the obvious fuselage and wing information of the aircraft in optical images, which are not easy for visual interpretation. In addition, the size heterogeneity between aircraft cannot be ignored, which could be different significantly among small and big aircraft. Small aircraft are more likely to be missed, resulting in a lower detection rate of the algorithm. Moreover, facilities around the aircraft could be recorded as features similar to those of the aircraft due to scattering [3], which further increases the difficulty of aircraft detection in SAR images. Therefore, it is essential for detection algorithms to recognize effective features of the aircraft.
(PADN) to enhance the learning of aircraft scattering characteristics, and the detection rate of the network was 85.68% on the data of Gaofen-3 airport. Subsequently, Guo et al. [21] have proposed scattering information enhancement module to preprocess the input data to highlight scattering characteristics of the target; then the enhanced data were input into the attention pyramid network for aircraft detection. In their experiment, the average precision (AP) was 83.25%, but the complexity of the whole network should also be noticed.
YOLOv5s is the lightest model among all YOLOv5 networks in depth and width, which presents excellent speed and precision in automatic object detection. Therefore, we choose it as the backbone network and further combined it with our own Efficient Bidirectional Path Aggregation Attention Network (EBPA2N) for efficient aircraft detection. The main contributions of this paper are summarized as follows: (1) An effective and efficient aircraft detection network EBPA2N is proposed for SAR image analytics. Combined with the sliding window detection method, an end-to-end aircraft detection framework based on EBPA2N was established, which offers accurate and real-time aircraft detection from large-scale SAR images. (2) As far as we know, we are the first to apply involution in SAR image analytics. We invented the Involution Enhanced Path Aggregation (IEPA) and Effective Residual Shuffle Attention (ERSA) module in an independent efficient Bidirectional Path Aggregation Attention Module (BPA2M). The IEPA module is proposed to capture the relationship among aircraft's backscattering features to better encode multi-scale geospatial information. As the basic module of the IEPA module, involution redefines the design method of feature extraction. By contrast with the traditional standard convolution, it uses different involution kernels in different spatial positions (i.e., spatial specificity) to integrate pixel spatial information, which is more conducive to establishing the correlation between aircraft scattering features in SAR images. On the other hand, the ERSA module mainly focuses on the scattering features information of the target and suppresses the influence of background clutter, then the influence of speckle noise in SAR images can be reduced. (3) Our experiment has proved the outstanding performance of EBPA2N, which indicates the success of implementing multi-scale SAR image analytics as geospatial attention within deep neural networks. This paper has paved the path for further integration of SAR domain knowledge and advanced deep learning algorithms.
The rest of this paper is organized as follows. In Section 2, the aircraft detection framework of SAR images proposed in this paper is introduced in detail. Section 3 presents the experimental results and corresponding analysis of aircraft detection on three airport SAR data with different networks. In Section 4, the experimental results are discussed and the future research direction is proposed. Section 5 briefly summarizes the results of this paper.

Overall Detection Framework
In this paper, an Efficient Bidirectional Path Aggregation and Attention Network (EBPA2N) is proposed for aircraft automatic detection in SAR images as shown in Figure 1. Based on the trade-off between the speed and the precision, the YOLOv5s backbone network was selected for feature extraction to achieve substantial representation. Then, the last three output feature maps C 3 , C 4 , and C 5 of the backbone network were chosen to inject the bidirectional path aggregation and attention module (BPA2M) to enrich the expression of features. BPA2M is a new object detection module proposed in this paper. It consists of IEPAM and three parallel ERSA. The IEPA module was used to fuse three feature maps of different sizes output from backbone network to learn multi-scale geospatial information. Then, three parallel attention mechanisms were used to refine the multi-scale features of the aircraft. Furthermore, a classification and box prediction network was used to obtain preliminary prediction results. The Non-Maximum Suppression (NMS) [22] method was employed to remove overlapping prediction boxes to obtain detection results. Last but not the least, a new sliding window algorithm is proposed to improve the detection efficiency of large-scale SAR images. feature maps of different sizes output from backbone network to learn multi-scale g spatial information. Then, three parallel attention mechanisms were used to refine multi-scale features of the aircraft. Furthermore, a classification and box prediction n work was used to obtain preliminary prediction results. The Non-Maximum Suppress (NMS) [22] method was employed to remove overlapping prediction boxes to obtain tection results. Last but not the least, a new sliding window algorithm is proposed to i prove the detection efficiency of large-scale SAR images.

YOLOv5s Backbone
The input data with the size of 512 × 512 were sliced to generate feature m × × . In this paper, the three-dimensional tensor is expressed as For the multi-level feature map formed by the backbone network, the low-level f ture maps can provide rich spatial details of the target which is helpful for target locali tion. High-level feature maps with abundant semantic information play an important r in distinguishing foreground from background. Therefore, the effective fusion of featu maps with different scales is helpful to improve the detection accuracy of targets.
In this paper, IEPA module is proposed to adequately fuse the feature maps of and output from the backbone network. As shown in Figure 2a. In IEPA modu

YOLOv5s Backbone
The input data with the size of 512 × 512 were sliced to generate feature map C 1 R 32×256×256 . In this paper, the three-dimensional tensor is expressed as X R C×H×W , where C, H and W represent the channel dimension, height and width of the feature map. Then, through four feature extraction modules with downsampling, rich image features are extracted to form feature maps as C 2 R 64×128×128 , C 3 R 128×64×64 , C 4 R 256×32×32 , and C 5 R 512×16×16 . In particular, the spatial pyramid pooling [23] module is embedded in the top feature extraction block. By using multi-scale pooling to construct features on multiple receptive fields to learn the multi-scale characteristics of aircraft. The detailed internal structure of the YOLOv5s backbone network can be referred to the introduction in [24,25]. For the multi-level feature map formed by the backbone network, the low-level feature maps can provide rich spatial details of the target which is helpful for target localization. High-level feature maps with abundant semantic information play an important role in distinguishing foreground from background. Therefore, the effective fusion of feature maps with different scales is helpful to improve the detection accuracy of targets.
In this paper, IEPA module is proposed to adequately fuse the feature maps of C 3 , C 4 and C 5 output from the backbone network. As shown in Figure 2a. In IEPA module, the 1 × 1 convolution module is used to adjust the number of channels. The stacking of up fusion (UF) and down fusion (DF) modules forms a bidirectional path, which effectively fuses shallow detail features and deep semantic information to achieve complementary advantages, which is conducive to improving the detection rate of targets with different scales. The Cross Stage Feature Refinement (CSFR) module is used in both UF and DF modules. The CSFR module is a new module proposed in this paper, which is inspired by involution [26]. It aims to learn features in a large receptive field to establish the relationship between discrete features of the aircraft, and then improve the performance of the network for aircraft detection. the 1 × 1 convolution module is used to adjust the number of channels. The stacking of up fusion (UF) and down fusion (DF) modules forms a bidirectional path, which effectively fuses shallow detail features and deep semantic information to achieve complementary advantages, which is conducive to improving the detection rate of targets with different scales. The Cross Stage Feature Refinement (CSFR) module is used in both UF and DF modules. The CSFR module is a new module proposed in this paper, which is inspired by involution [26]. It aims to learn features in a large receptive field to establish the relationship between discrete features of the aircraft, and then improve the performance of the network for aircraft detection. As shown in Figure 2b, in the CSFR module, the input feature map is processed by two respective branches. In these two branches, the input feature map adopts 1 × 1 convolution reducing the dimension of features by half to reduce the amount of computation. In branch B, the input feature map after dimension reduction is input into the 1 × 1 convolution module to learn cross channel information interaction. Then, 7 × 7 involution is used to capture the relationship between the scattering features of aircraft in a large range to obtain rich feature representation ability. In the involution, in order to speed up the operation of the network, the input feature map × × are divided into several groups of sub-feature maps by feature grouping, and then they are processed in parallel. Meanwhile, two 1 × 1 convolutions are used as the involution kernels generation function, and then the corresponding involution kernels are adaptively generated according to each pixel of the feature map. Therefore, each group of sub-feature maps with the size of H × W will generate a corresponding involution group, which contains H × W involution kernels. Different involution kernels are represented by different colors in Figure 2b. It is easy to know that the whole involution kernel and the input feature map are automatically aligned in the spatial dimension. Furthermore, each group of feature maps is multiplied with the involution kernels of the corresponding group to learn features. That is, each group of feature maps share the same involution kernels in the channel dimension, and different involution kernels are used in different spatial locations to learn the spatial visual patterns of different spatial regions. Subsequently, all the group feature maps are aggregated to obtain the output of involution module. Finally, branch A and branch B are concatenated, and then the channel information is fused by 1 × 1 convolution to get the output of CSFR module. Importantly, the number of parameters in the CSFR module is less than the standard 3 × 3 convolution, which also facilitates network lightweighting. As shown in Figure 2b, in the CSFR module, the input feature map is processed by two respective branches. In these two branches, the input feature map adopts 1 × 1 convolution reducing the dimension of features by half to reduce the amount of computation. In branch B, the input feature map after dimension reduction is input into the 1 × 1 convolution module to learn cross channel information interaction. Then, 7 × 7 involution is used to capture the relationship between the scattering features of aircraft in a large range to obtain rich feature representation ability. In the involution, in order to speed up the operation of the network, the input feature map X R C×H×W are divided into several groups of subfeature maps by feature grouping, and then they are processed in parallel. Meanwhile, two 1 × 1 convolutions are used as the involution kernels generation function, and then the corresponding involution kernels are adaptively generated according to each pixel of the feature map. Therefore, each group of sub-feature maps with the size of H × W will generate a corresponding involution group, which contains H × W involution kernels. Different involution kernels are represented by different colors in Figure 2b. It is easy to know that the whole involution kernel and the input feature map are automatically aligned in the spatial dimension. Furthermore, each group of feature maps is multiplied with the involution kernels of the corresponding group to learn features. That is, each group of feature maps share the same involution kernels in the channel dimension, and different involution kernels are used in different spatial locations to learn the spatial visual patterns of different spatial regions. Subsequently, all the group feature maps are aggregated to obtain the output of involution module. Finally, branch A and branch B are concatenated, and then the channel information is fused by 1 × 1 convolution to get the output of CSFR module. Importantly, the number of parameters in the CSFR module is less than the standard 3 × 3 convolution, which also facilitates network lightweighting.

Effective Residual Shuffle Attention (ERSA) Module
The feature map processed by the IEPA module contains abundant image information.
The scattering characteristics of aircraft should be highlighted timely and effectively, which is beneficial to improve the detection performance of aircraft. The attentional mechanism Remote Sens. 2021, 13, 2940 6 of 16 is a signal processing mechanism similar to the human brain, which can select high-value information relevant to the task at hand from a large amount of information. Chen et al. [27] revealed the effectiveness of using the attention mechanism in SAR image detection.
In this paper, the ERSA module is proposed to filter the features so that the network pays more attention to the channel features and the spatial regions containing the target information. It is an ultra-lightweight dual attention mechanism module that is inspired by the ideas of residual [28] and shuffle attention [29]. In order to lighten the weight as much as possible, the F c (·) function and sigmoid function are used to form gating mechanism in channel attention and spatial attention branches to adaptively learn the correlation of the channel features and the importance of the spatial features.
The channel attention module is defined as follows: where X 1 R C×H×W is the input feature map, F GAP is the global average pooling, and E 1 R C×1×1 is the channel vector with global receptive field. W 1 and b 1 are a pair of learnable parameters, which are used to scale and translate the channel vector respectively, and learn different importance of channel dimensions. δ is the sigmoid function for normalization. Finally, the normalized channel attention weight vector δ(F c (E 1 )) R C×1×1 is multiplied by the input feature map X 1 R C×H×W to realize the feature recalibration. Similarly, in spatial attention, the group norm (GN) performed on the feature maps X 2 to capture the spatial information, and then the similarity gating mechanism is used to adaptively adjust the input feature map X 2 to highlight the essential spatial information.
The overall ERSA module structure is shown in Figure 3. In order to improve the efficiency of the network, the input feature maps are divided into 32 groups of independent sub-features along the channel dimension for parallel processing. In each group of subfeatures, the input feature maps are divided into two parts according to the channel dimension. They are input into the channel attention branch and the spatial attention branch respectively to learn the important features of the target. Then the two branches are fused to obtain the sub-features after attention enhancement. All the sub-features of groups after attention enhancement are aggregated, and then channel shuffle operator is used to enhance the information flow between feature channels to get the fine features after attention enhancement. Finally, with the addition of skip connection, the coarse-grained features of the initial input feature map are retained effectively and the training process becomes more robust.

Classification and Box Prediction Network
After extracting the features of the aircraft, the BPA2M module outputs effective prediction feature maps at three scales, in which the grid area is divided into 64 × 64, 32 × 32 and 16 × 16 separately. The schematic diagram of classification and box prediction is shown in Figure 4. For each grid cell, the classification and box prediction network generates three anchor boxes with different sizes and aspect ratios (represented by orange box in Figure 4). The size and shape of anchor boxes are obtained by clustering algorithm based on the size of the target box in the dataset. Then, a 1 × 1 convolution is directly used to predict the location, confidence and category of each bounding box through classification regression.

Classification and Box Prediction Network
After extracting the features of the aircraft, the BPA2M module outputs effective p diction feature maps at three scales, in which the grid area is divided into 64 × 64, 32 × and 16 × 16 separately. The schematic diagram of classification and box prediction shown in Figure 4. For each grid cell, the classification and box prediction network gen ates three anchor boxes with different sizes and aspect ratios (represented by orange b in Figure 4). The size and shape of anchor boxes are obtained by clustering algorith based on the size of the target box in the dataset. Then, a 1 × 1 convolution is directly us to predict the location, confidence and category of each bounding box through classifi tion regression. After obtaining the bounding box, the NMS is used to remove the overlap box, a the final output detection result is obtained, as show in Figure 1. In the training stage, t loss of the network is calculated. The network is optimized by adjusting network para eters to minimize network loss.

Detection by Sliding
In this paper, the detection method by sliding window is proposed to perform a craft detection from large-scale SAR images, which can improve the efficiency of the n work [30], as shown in Figure 5. First, the input large-scale SAR image is clipped by slidi window with the window size of 512 × 512 and the step size of 450. In this way, the s of the aircraft is relatively larger in the test samples which is conducive for aircraft det tion [31]. Then, the test samples are fed into EBPA2N to detect the aircraft. After the d tection results are obtained, coordinate mapping and NMS are used as post-processi modules to achieve coordinate aggregation. The final detection results of aircraft fro large-scale SAR images are then produced.

Classification and Box Prediction Network
After extracting the features of the aircraft, the BPA2M module outputs effective prediction feature maps at three scales, in which the grid area is divided into 64 × 64, 32 × 32 and 16 × 16 separately. The schematic diagram of classification and box prediction is shown in Figure 4. For each grid cell, the classification and box prediction network generates three anchor boxes with different sizes and aspect ratios (represented by orange box in Figure 4). The size and shape of anchor boxes are obtained by clustering algorithm based on the size of the target box in the dataset. Then, a 1 × 1 convolution is directly used to predict the location, confidence and category of each bounding box through classification regression. After obtaining the bounding box, the NMS is used to remove the overlap box, and the final output detection result is obtained, as show in Figure 1. In the training stage, the loss of the network is calculated. The network is optimized by adjusting network parameters to minimize network loss.

Detection by Sliding
In this paper, the detection method by sliding window is proposed to perform aircraft detection from large-scale SAR images, which can improve the efficiency of the network [30], as shown in Figure 5. First, the input large-scale SAR image is clipped by sliding window with the window size of 512 × 512 and the step size of 450. In this way, the size of the aircraft is relatively larger in the test samples which is conducive for aircraft detection [31]. Then, the test samples are fed into EBPA2N to detect the aircraft. After the detection results are obtained, coordinate mapping and NMS are used as post-processing modules to achieve coordinate aggregation. The final detection results of aircraft from large-scale SAR images are then produced. After obtaining the bounding box, the NMS is used to remove the overlap box, and the final output detection result is obtained, as show in Figure 1. In the training stage, the loss of the network is calculated. The network is optimized by adjusting network parameters to minimize network loss.

Detection by Sliding
In this paper, the detection method by sliding window is proposed to perform aircraft detection from large-scale SAR images, which can improve the efficiency of the network [30], as shown in Figure 5. First, the input large-scale SAR image is clipped by sliding window with the window size of 512 × 512 and the step size of 450. In this way, the size of the aircraft is relatively larger in the test samples which is conducive for aircraft detection [31]. Then, the test samples are fed into EBPA2N to detect the aircraft. After the detection results are obtained, coordinate mapping and NMS are used as post-processing modules to achieve coordinate aggregation. The final detection results of aircraft from large-scale SAR images are then produced.

Data and Evaluation Metrics
The data used in this experiment were more than 10 large-scale SAR images containing airports from the Gaofen-3 system with 1 m resolution. For the insufficiency of manually annotated aircraft data, we used rotation, translation (data augmentation in width and height directions), flipping and mirror to expand the data. Finally, 4396 samples of aircraft data with the size of 512 × 512 pixels were obtained, and the training set and verification set were divided by a ratio of 8:2. In order to evaluate the performance of the network more objectively and efficiently, four evaluation indexes were used in this paper: detection Remote Sens. 2021, 13, 2940 8 of 16 rate (DR) [32], false alarm rate (FAR) [32], training time and testing time. The calculation formulas of DR and FAR are as follows: (6) where N DT , N DF and N GT are the number of correctly detected aircraft, falsely detected aircraft, and the real aircraft targets. We use N to represent numbers, and the subscripts DT stands for Detected True, DF for Detected False, and GT means Ground Truth, respectively.

Data and Evaluation Metrics
The data used in this experiment were more than 10 large-scale SAR images containing airports from the Gaofen-3 system with 1 m resolution. For the insufficiency of manually annotated aircraft data, we used rotation, translation (data augmentation in width and height directions), flipping and mirror to expand the data. Finally, 4396 samples of aircraft data with the size of 512 × 512 pixels were obtained, and the training set and verification set were divided by a ratio of 8:2. In order to evaluate the performance of the network more objectively and efficiently, four evaluation indexes were used in this paper: detection rate (DR) [32], false alarm rate (FAR) [32], training time and testing time. The calculation formulas of DR and FAR are as follows: where , and are the number of correctly detected aircraft, falsely detected aircraft, and the real aircraft targets. We use N to represent numbers, and the subscripts DT stands for Detected True, DF for Detected False, and GT means Ground Truth, respectively.

Implementation Details
The experimental environment was based on a single 12G memory NVIDIA RTX 2080ti GPU and Unbuntu18.04 system. All networks were based on Pytorch framework, training with 100 epochs by the same data set and recording the training time. The batch size was 16. The learning rates for EfficientDet-D0 [33], YOLOv5s [12] and our network (EBPA2N) were 3e-4, 1e-3 and 1e-3, respectively. In particular, none of these networks was loaded with pretraining models. In addition, multi-scale training and additional enhancement testing techniques (e.g., Test Time Augmentation, TTA) were not used.

Analysis of the Experimental Results
In order to evaluate the performance of the framework proposed in this paper, three large-scale SAR images from the Gaofen-3 system with 1m resolution are used as independent tests, which are airport I (Hongqiao Airport) with 12,000 × 14,400 pixels, airport II (Capital Airport) with 14,400 × 16,800 pixels and airport III (Military Airport) with 9600 × 9600 pixels.
3.3.1. Analysis of Aircraft Detection for Airport I Figure 6 shows the detection results of aircraft by different networks in Airport I. correct detection, false alarms and missed detection are indicated by green, red and yellow boxes respectively. This airport is a large civil airport (Hongqiao airport in Shanghai, China) with heavy transportation. As shown in Figure 6a, the distribution of aircraft is relatively dense, and the distance between adjacent aircraft is small. The backscattering characteristics of the aircraft at the bend of the airport are complex and diverse, which increases the difficulty of the detection. According to Figure 6b, EfficientDet-D0 has many false alarms (red color boxes) outside the airport. However, there are no obvious false alarms in YOLOv5s (Figure 6c) and EBPA2N (as shown in Figure 6d). This shows that the robustness of YOLOv5s and EBPA2N is much better. In the enlarged detail of the bend area of the airport, we can find that EfficientDet-D0 is better than YOLOv5s (as shown in Figure 6c) in missed detection, but it has much more false alarms. However, in the detection results of YOLOv5s, many aircraft are missed. Because IEPA and ERSA modules in EBPA2N can effectively detect aircraft and eliminate false alarms, the detection performance is much better than that of YOLOv5s and EfficientDet-D0. Furthermore, it shows that EBPA2N has good ability of feature learning and can fit the multi-scale and multi-directional features of aircraft well.

Analysis of Aircraft Detection for Airport II
Airport II is another large-scale civil airport (the capital airport in Beijing, China). As shown in Figure 7a, the background area covers large commercial and residential areas. There are also lots of strong scattering highlights, which are likely to cause false alarms. In addition, the mechanical and metal equipment around the aircraft appears to have similar texture as the aircraft, which is also very prone to generate false alarms. The detection results of aircraft for Airport II by each network are shown in Figure 7b-d. According to the results, there is a serious problem of false alarms in EfficientDet-D0 (as shown in Figure 7b), especially at the edge of the airport. In contrast, the detection results of YOLOv5s and EBPA2N are more satisfactory (as shown in Figure 7c,d). According to the magnified view of the local details, the EfficientDet-D0 has only two missed aircraft, but it has five false alarms. YOLOv5s has three missed detection and one false alarm. The result of EBPA2N (as shown in Figure 7d) is the closest to that of the ground truth (as shown in Figure 7a), in which, only one aircraft with weak brightness information has been missed. Remarkably, this aircraft is missed in both networks. This also demonstrates the advantage of the parallel ERSA module used in our network, which can focus more on learning the characteristics of the aircraft, thus the detection rate of the aircraft is improved.

Analysis of Aircraft Detection for Airport III
Airport III is a military airport with 33 aircraft, as shown in Figure 8a, it can be seen that the overall background area is relatively clean. Only a few scattered buildings with strong scattering points in the left area of the airport are prone to interference. At the same time, although the aircraft at the airport are smaller than civil aircraft, the scattering characteristics are obvious, so the detection of the three networks have high detection integrity. Figure 8a-d show the ground truth and the detection results by different networks. It can be seen that only one aircraft has been missed by EfficientDet-D0, which is marked with a yellow box in the Figure 8b, but there are 10 false alarms. This shows that the EfficientDet-D0 has a weak ability to distinguish the effective features of the aircraft. There are only two false alarms in YOLOv5s (as shown in Figure 8c), one of which is distributed in the airport runway and the other is outside the airport area. However, it missed three aircraft. In the proposed EBPA2N network, there is only one missed detection and one false alarm located at the edge of the equipment inside the airport (as shown in Figure 8d). Compared with the original YOLOv5s, the number of missed detection by EBPA2N is reduced, which indicates that the addition of IEPA module enhances the ability of the network to capture scattering characteristics of aircraft. Figure 6c) in missed detection, but it has much more false alarms. However, in the detection results of YOLOv5s, many aircraft are missed. Because IEPA and ERSA modules in EBPA2N can effectively detect aircraft and eliminate false alarms, the detection performance is much better than that of YOLOv5s and EfficientDet-D0. Furthermore, it shows that EBPA2N has good ability of feature learning and can fit the multi-scale and multidirectional features of aircraft well. 3.3.2. Analysis of Aircraft Detection for Airport II Airport II is another large-scale civil airport (the capital airport in Beijing, China). As shown in Figure 7a, the background area covers large commercial and residential areas.  are only two false alarms in YOLOv5s (as shown in Figure 8c), one of which is distributed in the airport runway and the other is outside the airport area. However, it missed three aircraft. In the proposed EBPA2N network, there is only one missed detection and one false alarm located at the edge of the equipment inside the airport (as shown in Figure  8d). Compared with the original YOLOv5s, the number of missed detection by EBPA2N is reduced, which indicates that the addition of IEPA module enhances the ability of the network to capture scattering characteristics of aircraft.

Performance Evaluation
For a more intuitive performance comparison, Tables 1 and 2 show the evaluation indexes of the two airports for different networks. For Detection Rate (DR) and False

Performance Evaluation
For a more intuitive performance comparison, Tables 1 and 2 show the evaluation indexes of the two airports for different networks. For Detection Rate (DR) and False Alarm Rate (FAR), EfficientDet-D0 has the worst performance. The average DR is 85.90%. The average FAR is 34.99%, which indicates that the robustness of the network for aircraft detection is poor. YOLOv5s has more balanced detection performance than EfficientDet-D0. The average FAR is 6.63%, and the average DR is 87.32%. For the proposed EBPA2N, DR is at least 5% higher than YOLOv5s and EfficientDet-D0, and FAR is only 4.49%. So, EBPA2N has much more reliable detection performance. For efficiency, combined with Tables 1 and 2 and Figure 9, EBPA2N is close to YOLOv5s in both training and testing times, which is far better than EfficientDet-D0. The above results show that the proposed network (EBPA2N) has the best detection performance and satisfying computation speed.

Discussion
At present, most of the mainstream target detection networks are designed for optical images, and the research on SAR image is relatively immature. This paper proposes an effective aircraft detection network for large-scale SAR images, which greatly improves the accuracy of aircraft detection and provides fast detection of aircraft.
In the design and development of the detection neural network, the main consideration lies in the trade-off between the detection accuracy and speed, because complex deep neural networks that achieve high accuracy usually suffer from huge computation intensity. The YOLOv5s network is light in computation intensity, so it can be deployed at front-end devices such as mobile terminals. This is the major reason why we combine YOLOv5s with SAR image analysis as the implementation of our deep neural network. The IEPA module is designed to address the correlation between discrete features of aircraft over a larger scope, and then the lightweight ERSA attention module is proposed to adaptively choose important features of aircraft, which systematically integrates attention mechanism and residual structure. According to Tables 1 and 2, the overall FAR of EfficientDet-D0 network is 34.99%, and training and testing time of the network is the longest. This means that EfficientDet-D0 has poor ability in suppressing the interference of a complex background for the detection of small man-made targets in SAR images. The computation efficiency of YOLOv5s is very satisfying, but the detection accuracy needs to be improved. The proposed EBPA2N has achieved a good balance of accuracy and speed in aircraft detection.
Moreover, we have only conducted experiments on single-band SAR images with 1 m resolution from the Gaofen-3 system. For SAR images acquired from different systems (e.g., different band or different resolution), the performance may vary. In the future, we will employ transfer learning [34] to EBPA2N to realize rapid aircraft detection with multi-resolution and multi-band SAR images. Furthermore, the network proposed in this paper is only tested for the aircraft detection, with satisfying detection accuracy. In the following research, the network will be extended for further experimental analysis of other man-made small targets (e.g., ships, buildings and vehicles). In addition, the motion error in SAR image [35] is also worth considering, because it will indeed bring additional impact on SAR target detection.

Discussion
At present, most of the mainstream target detection networks are designed for optical images, and the research on SAR image is relatively immature. This paper proposes an effective aircraft detection network for large-scale SAR images, which greatly improves the accuracy of aircraft detection and provides fast detection of aircraft.
In the design and development of the detection neural network, the main consideration lies in the trade-off between the detection accuracy and speed, because complex deep neural networks that achieve high accuracy usually suffer from huge computation intensity. The YOLOv5s network is light in computation intensity, so it can be deployed at front-end devices such as mobile terminals. This is the major reason why we combine YOLOv5s with SAR image analysis as the implementation of our deep neural network. The IEPA module is designed to address the correlation between discrete features of aircraft over a larger scope, and then the lightweight ERSA attention module is proposed to adaptively choose important features of aircraft, which systematically integrates attention

Conclusions
In this paper, a high-precision and efficient automatic aircraft detection network for SAR images is proposed, namely, EBPA2N. The performance improvement of EBPA2N mainly results from two innovative modules proposed in this paper. The IEPA module and ERSA module work in series, which effectively integrates the multi-scale context information of aircraft and greatly improves detection accuracy. The experimental results on three different types of airport from Gaofen-3 SAR images indicate that EBPA2N has strong and robust capabilities in extracting multi-scale and multi-direction aircraft image features, and it also can suppress the interference of the complex background very well.
By combining deep neural networks with geospatial analytics, BEPA2N can also be applied to detect other small man-made targets, such as ships, buildings and vehicles. We plan to purse these topics in our future study. BEPA2N will promote closer fusion of deep learning techniques and SAR image analytics, so as to speed up the research of intelligent target detection of SAR imagery.
All authors contributed extensively to this manuscript. All authors have read and agreed to the published version of the manuscript.