SEAN: A Simple and Efﬁcient Attention Network for Aircraft Detection in SAR Images

: Due to the unique imaging mechanism of synthetic aperture radar (SAR), which leads to a discrete state of aircraft targets in images, its detection performance is vulnerable to the inﬂuence of complex ground objects. Although existing deep learning detection algorithms show good performance, they generally use a feature pyramid neck design and large backbone network, which reduces the detection efﬁciency to some extent. To address these problems, we propose a simple and efﬁcient attention network (SEAN) in this paper, which takes YOLOv5s as the baseline. First, we shallow the depth of the backbone network and introduce a structural re-parameterization technique to increase the feature extraction capability of the backbone. Second, the neck architecture is designed by using a residual dilated module (RDM), a low-level semantic enhancement module (LSEM), and a localization attention module (LAM), substantially reducing the number of parameters and computation of the network. The results on the Gaofen-3 aircraft target dataset show that this method achieves 97.7% AP at a speed of 83.3 FPS on a Tesla M60, exceeding YOLOv5s by 1.3% AP and 8.7 FPS with 40.51% of the parameters and 86.25% of the FLOPs.


Introduction
Synthetic aperture radar (SAR) is an active microwave imaging sensor with all-day and all-weather capabilities; it is widely used in natural disaster monitoring, military reconnaissance, and urban planning [1,2]. As a typical target, aircraft have essential value in both military and civilian fields. Therefore, SAR aircraft interpretation has always been one of the research hotspots. With the continuous improvement of SAR imaging resolution, there are higher requirements for the accuracy and speed of aircraft target detection.
Traditional SAR target detection methods mainly include the constant false alarm rate (CFAR) algorithm, based on the statistical distribution of background clutter [3,4], and the algorithm of manually extracted image features [5,6]. The CFAR algorithm is primarily influenced by the statistical characteristics of background clutter, and it is difficult to accurately model the scene of complex objects, which generates more false alarms. The algorithm for manual extraction of image features divides detection into two separate processes: feature extraction and target classification. However, the process of manually extracting features is complex and highly dependent on parameter settings. Thus robustness and generalization are difficult to guarantee.
With the rapid development of deep learning methods and the increase of high-resolution SAR data, the emergence of convolutional neural network (CNNs) has brought many breakthroughs in tasks such as SAR image segmentation [7], land-cover classification [8], and target detection [9]. Data-driven models have good robustness and generalization. At the same time, this method can actively extract high-level features, avoiding the complicated work (1) The proposed SEAN is a SAR aircraft target detection network with a simple structure, high accuracy, and high speed. Compared with the typical target detection algorithms from recent years, SEAN has apparent advantages in detection accuracy and speed on the Gaofen-3 dataset. (2) An appropriate network size is selected to balance the detection accuracy and speed.
Then, the backbone network's depth is explored, proving that the C4 feature of the backbone network is more suitable for aircraft target detection. Furthermore, this paper uses a structural re-parameterization technique on the shallowed backbone to effectively enhance the feature extraction capability. (3) A simple and efficient neck of the network is designed, discarding the complex feature pyramid network design. It mainly consists of three modules. One is a residual dilated module (RDM) that integrates the multi-scale receptive field. The second is a low-level semantic enhancement module (LSEM) that enhances the scattered information of SAR images. Furthermore, the third is a location attention module (LAM) that refines the multi-feature information after fusion.

Deep Learning-Based Object Detection Algorithm
Object detection algorithms based on deep learning can be divided into two-stage [28,29] and one-stage [14,30,31] according to the number of detection stages. A two-stage detection algorithm has two detection heads. The first stage detects the candidate regions where there may be objects, and the second stage predicts the category and position of the object in the candidate regions. The two-stage detection algorithm represented by Faster-RCNN [29] usually has high detection accuracy. However, more complex structures have poorer detection speed than one-stage methods. The one-stage detection algorithm represented by YOLO [32] directly predicts the category and position of the object. This end-to-end method usually has faster detection speed and good detection accuracy. To obtain better feature extraction ability on large benchmark datasets such as ImageNet, a large backbone network such as ResNet101 is usually used. Recently, Vision Transformer (ViT) [33] as a backbone network has shown better feature extraction ability than CNN, which performs well on various visual downstream tasks [34]. However, this work shows that the performance of ViT exceeds that of CNN on the premise that there are hundreds of millions of data for pre-training. Otherwise, ViT's performance is not as good as CNN on datasets below a million. Therefore, in SAR aircraft detection with a small amount of data, CNN-based detection algorithms are mainly used for research [10][11][12]. Luo et al. [20] showed that the CSPDarknet backbone of the YOLO series is less affected by background interference, and it is more suitable for SAR aircraft target detection tasks than backbones such as ResNet.
For a long time, the YOLO series of algorithms [30][31][32][35][36][37] have always pursued the best balance between accuracy and speed, and the method in this paper is an improvement based on YOLOv5 6.0 [37] . YOLOv5 provides a total of five model sizes of n, s, m, l, and x according to the different widths and depths of the network. As shown in Figure 1, the network architecture of YOLOv5 6.0 can be divided into three parts: backbone, neck, and head. The backbone is an optimized CSPDarknet that has five stages for feature extraction and generates five scale features, and then sends the C3, C4, and C5 feature maps output by the last three stages to the neck. The neck is the optimized PANet [38], which has two branches for top-down and bottom-up multi-scale feature fusion. The head follows the coupled head of YOLOv3 [30] and detects objects of large, medium, and small scales based on nine preset anchor boxes. The loss function of the algorithm is composed of BCE as classification loss and objectness loss, and CIoU [39] as regression loss for bounding box prediction.

Structure of the Detector Neck
In the typical object detection algorithms in recent years, it is indispensable to use the feature pyramid network (FPN) [13] as the neck of the network. Early CNNs [32] only used the C5 feature for detection, and its performance for small target detection was generally not high. Since then, the proposed method of top-down delivery of higher-order semantic information by FPN has effectively solved the low performance of small target detection. FPN usually brings two benefits: (1) performs multi-scale feature fusion, and (2) assigns objects of different scales to different receptive field features for detection. However, people often attribute the effectiveness of FPN to its ability to perform multi-scale feature fusion. With the success of FPN, some effective improvement works have appeared one after another. PANet [38] adds a bottom-up line based on FPN, which can pass up the low-level semantic information for spatial localization for fusion. Subsequently, there have been some works on complex laterally connected feature pyramids, such as BiFPN [40] and NAS-FPN [41]. These aim to obtain good detection accuracy through better multi-scale feature fusion. Recent SAR aircraft target detection algorithms [9,11,12,14,[18][19][20][21][22] use an FPN as the neck. They usually introduce attention mechanisms [9,11,12,18] in the lateral connections to enhance the learning of channel and spatial information.
However, these FPN methods impose a large memory and computational burden, resulting in low efficiency for object detection. A recent work, YOLOF [42] showed that the significant effect brought by FPN mainly solves the optimization problem of multi-scale object detection in a divide-and-conquer way and does not rely on the part of multiscale feature fusion. Moreover, it proposes a simple neck design with a label-matching mechanism that can have excellent detection accuracy even with only the top C5 feature, but its detection performance for small objects is poor compared to that of YOLOv4 [35]. Such a finding is not accidental, and some recent work on object detection in ViT has also shown the non-essentiality of complex lateral connections. ViTDet [34] showed that operation without lateral connections achieved good performance only using the topmost feature by simply constructing a pyramid structure by downsampling and up-sampling. In addition, AdaMixer [43] designed an adaptive sampling decoder to replace the FPN that can converge quickly and has good detection performance. It is worth mentioning that when AdaMixer did the ablation experiment of the backbone, it was found that the accuracy of ResNet only using the C4 feature for detection was better than that of the C3 and C5 features, which is similar to the conclusion of the work on the shallowed backbone in this paper.

Attention Mechanism
In recent years, attention mechanisms have emerged to allow computer vision systems to mimic the human visual system and find salient regions in complex scenes naturally and efficiently. The attention mechanism in computer vision can be regarded as the process of dynamically adjusting the weights based on the features of the input image [44]. This mechanism is used as a plug-and-play module, which can exchange high performance with a small amount of computational overhead and is now widely used in various computer vision tasks.
The detection network usually adds attention mechanisms such as SE [45] and CBAM [46] to enhance performance. However, SE only considers the connection between internal channel information and ignores the importance of spatial information. At the same time, CBAM introduces the local pooling operation of spatial information to the channel, but it suffers from the inability to obtain a large range of dependent information. The proposed coordinate attention mechanism (CoordAtt) [47] alleviates the above problems by embedding large-scale location information into channel attention. The structure of CoordAtt is shown in Figure 2 and consists of two main steps: coordinate information embedding and coordinate attention generation. In the first step, the input feature map X is first aggregated into two separate orientation-aware feature maps along the horizontal and vertical directions through two one-dimensional global pooling operations. In the second step, the two feature maps are first spliced with embedded specific orientation information, and then, a 1 × 1 convolution and sigmoid function is used to output an intermediate feature map encoded with horizontal and vertical spatial information. The intermediate feature map is then split into two separate vectors along the spatial dimension, the channels are raised to the same dimension as the input feature map through two 1 × 1 convolutions, and the attention maps are output by the sigmoid function as g h and g w . Each attention map captures long-range dependencies of input feature maps along one spatial direction, and location information can thus be preserved in the generated attention map. For example, Formula (1) calculates the weight of y c of the cth channel of the output: The representation of attention regions can be emphasized by applying both attention maps to the input feature map by multiplication.

Overview of the Architecture of the Proposed SEAN
In order to detect aircraft accurately and quickly in SAR images with complex background interference, we propose a simple and efficient attention network (SEAN) architecture, as shown in Figure 3. Considering the trade-off between speed and accuracy, this paper selects YOLOv5s version 6.0 [37] as our baseline and makes a series of improvements for the SAR aircraft target detection task. The detailed network configuration of SEAN is shown in Table 1. In the table, n represents the stacked times of the module, and arguments represent the parameter information of the module, including input channel, output channel, kernel size, stride, and padding. The backbone's primary function is to obtain good feature extraction ability. In this paper, the depth of the backbone network is preliminarily explored for the SAR aircraft target detection task, and structural re-parameterization technology [27,48] is introduced to enhance its feature extraction ability. For the neck, different from the usual complex lateral connection design to achieve multi-scale detection and multi-feature fusion, this paper designs a simple neck consisting of three parts: a residual dilated module (RDM), a low-level semantic enhancement Module (LSEM), and a localization attention module (LAM). The following is a detailed introduction to the method in this paper.

Optimization of Backbone Network
SAR images are presented as single-channel grayscale images, which contain far less semantic information than the optical benchmark dataset. Moreover, aircraft are mainly displayed in the images as discrete strong scattering points, which are concentrated in the low-level semantic information. From the perspective of the receptive field, aircraft are usually a small target [49] in SAR images, and it is not necessary to use a deeper backbone to obtain a larger receptive field. In accordance with the above data characteristics, this section explores the network depth settings and introduces a structural re-parameterization technique to obtain better feature extraction capabilities.

Shallow Backbone
The backbone usually has five stages for hierarchical feature extraction. First, the resolution of the feature map of each higher stage is double downsampled, and the number of channels is doubled. Then, the last three stages of the backbone output C3, C4, and C5 features, which are used for multi-scale object detection. Since this paper does not use the complex FPN neck for target detection, it is necessary to explore at which stage the output feature map is more suitable for the aircraft detection task. Therefore, we make a preliminary exploration of the depth of the backbone. Note that the depth here refers to the number of stages selected. We use the C3, C4, and C5 output features with only one detection head for the aircraft detection task at a single resolution. We find that only the C4 feature map has good detection results, which may be attributed to the higher resolution of the C4 feature map being more suitable for medium and small aircraft detection tasks. Therefore, this paper uses the backbone without the fifth stage as the basis for subsequent work.

Backbone Re-Parameterization
Before each stage of the backbone performs feature extraction, the input feature map is downsampled to reduce the computational load of the subsequent network. However, downsampling reduces the resolution and inevitably loses some detailed information, which is not conducive to positioning aircraft targets. As shown in Figure 4a, the basic Conv block is downsampled by convolution with kernel 3 and stride 2, followed by the BN layer and SiLU activation function to maintain better gradient transfer. To better retain feature information during downsampling, this paper introduces a structural re-parameterization technique of RepVGG [27] to optimize the Conv block. The core idea of the structural re-parameterization technique is that the network realizes the equivalent conversion of the structure through the equivalent conversion of parameters. Specifically, a set of branches with a convolution kernel size of 1 are added to the original Conv block during model training. The RepConv block is formed to increase the diversity of downsampling, and the parameters of the added branch are fused into the Conv block using the re-parameterization technique during inference so that the structure of the inference is the same as that of the Conv block, which enhances the capability of the model without affecting the detection speed. Convolution in deep learning is a cross-correlation operation. Since this operation is linear, the parameters of the convolution kernel increase with additivity. Figure 4b shows the simplified RepConv re-parameterization process. During inference, firstly, the BN layer is just a simple linear mapping, and its parameters can be directly added to the convolution kernel to improve the inference speed. It can be assumed that the number of channels in the input feature map X is 1, and the number of channels after downsampling by RepConv is 2. Then, use A ∈ R 1×2×3×3 and B ∈ R 1×2×1×1 to represent the weight parameters of the 3 × 3 and 1 × 1 convolutional layers, respectively. For a 1 × 1 convolution kernel, it can be equated to a special 3 × 3 convolution kernel with only non-zero values at the center. So that the weights of the convolution layers of the two branches can be summed up by channels and the summed weights can be denoted as C ∈ R 1×2×3×3 , this paper uses "*" to represent the convolution operation; then, this process can be represented by Formula (2):

Neck Design of SEAN
FPN [13] is proposed to use a divide-and-conquer approach for multi-scale target detection, which alleviates the complex detection problem of small target objects. However, this complex lateral connection approach brings a significant memory and computation burden, which reduces detection efficiency. In this section, we wish to explore a simple and efficient network of necks to ensure accurate and fast SAR aircraft detection. We use the coordinate attention mechanism (CoordAtt) [47] as the base component of the neck to better capture the spatial localization information of the aircraft. In the following subsections, the three modules that make up the neck are described individually.

Residual Dilated Module (RDM)
Considering that too much downsampling brings a larger receptive field, it also loses many details required for small target detection, and we hope to perform the detection task well on a feature map with only one resolution size; the dilated encoder approach [42] is used to alleviate these problems. We use dilated convolution after the optimized backbone to increase the receptive field while maintaining the original feature map size. As shown in Figure 5, we use the Conv block to design a residual dilated module (RDM), which is formed by stacking two residual blocks with different expansion rates. The C4 feature map is the input to this module, which first enters the 1 × 1 convolution to reduce the channel dimension. Then, we use the dilated convolution to expand the receptive field, and we finally revert to the original channel number and fuse it with the input feature with the smaller receptive field. Figure 6 indicates the size of the aircraft targets covered by the receptive field in the feature map: (a) indicates that the receptive field of the C4 feature can cover most of the small-and medium-sized targets; (b) indicates the receptive field after dilated convolution can cover large-and medium-sized targets; and (c) indicates that the features of the two receptive fields can be fused to cover almost all target sizes through the RDM.

Low-Level Semantic Enhancement Module (LSEM)
In SAR images, aircraft usually appear near complex objects with strong scattering points, such as covered bridges and buildings, which cause great interference to aircraft detection. We hope to better utilize low-level semantic information to suppress this interference. Unfortunately, the existing CNN algorithms mainly extract high-level semantic features of objects. By contrast, the high-level semantic features of SAR images are far less abundant than those of optical images. In order to better learn the spatial location information and scattering characteristics of SAR aircraft targets, this paper draws on the idea of inception [50] and introduces the coordinate attention mechanism (CoordAtt) [47] to design a low-level semantic enhancement module (LSEM). The specific structure of the LSEM is shown in Figure 7. The Conv block is used as the fundamental component of design. First, the C3 feature map is used as the input to reduce the dimension through the Conv block with the convolution kernel of 1 on the four branches, and then feature extraction is performed in different ways for each branch. Finally, the output features of the four branches are stitched to achieve the effect of multi-feature fusion of low-level semantic information.

Localization Attention Module (LAM)
The fused feature map of the RDM and the LSEM already has enough information for aircraft target detection in complex environments. However, the feature information is redundant and easily confused due to surrounding objects. To this end, we design a localization attention module (LAM) to refine the fused feature map, hoping to achieve the effect that aircraft targets differ significantly from the surrounding features on different feature channels. As shown in Figure 8, the module first takes the fused output feature X as input through two branches. A branch first reduces the dimension of the input channel through a 1 × 1 Conv. It then refines the semantic features through a bottleneck block to perform a splicing operation with the other branch's 1 × 1 Conv channel dimensionreduction feature. The bottleneck block is a residual block formed by stacking two Conv blocks. Then, the output features are enhanced with spatial localization information by CoordAtt. Finally, the channel dimension and the nonlinearity of the network are maintained by a Conv block. In this way, LAM refines the features, strengthens the target's response on the feature map, and suppresses interference due to surrounding objects.

Dataset Description
The Gaofen-3 aircraft target dataset [51,52] in this paper has been collected by the Gaofen-3 satellite and consists of single-polarization SAR image with a resolution of 1 m in the C-band and includes multi-temporal phase maps of multiple airports. The dataset has a total of 2000 image slices, with image sizes ranging from 600 to 2048 pixels. It mainly includes seven types of civil aircraft, such as the Boeing 737, with a total of 6556 aircraft samples. Figure 9 shows from the histogram of the bounding box distribution that there is only one aircraft in many images, but at most there are 35 aircraft in a picture. Figure 10 shows a partial dataset slice, with the area marked by the green box the aircraft target. Furthermore, strong scattering points such as covered bridges and buildings are around the aircraft. By combining Figures 9-11, it can be seen that the aircraft targets in this dataset have uneven image quality, significant differences in the sizes of aircraft targets, dense target arrangement, and complex surrounding ground objects. This paper randomly divides the dataset into training, validation, and test sets at a ratio of 6:2:2 for experiments.

Experimental Parameter Settings
The training image size is 640 × 640 pixels, and simple data augmentations such as panning, scaling, cropping, and flipping are done on the samples before training to enhance the model's generalization. Four groups of anchor boxes are preset as: (16,14), (27,25), (52,50), (90, 83). Due to the corresponding changes to the backbone network in this paper, the network weights in all ablation experiments are initialized randomly. The optimizer for model training is SGD, with the momentum factor size of 0.937, the initial learning rate of 0.01, and the weight decay of 0.0005. We perform a learning rate warm-up in the first three epochs to maintain a better gradient. Each experiment is performed for 300 epochs, and the model with the best evaluation on the validation set is reserved for the final training result. All experiments are carried out under the Pytorch1.9 deep learning framework on an Ubuntu 16.04 system with two Tesla M60 (16 GB) GPUs.

Evaluation Metrics
In order to better evaluate the performance of each algorithm on aircraft target detection, we adopt some common evaluation metrics for object-detection tasks [9,53]: precision (P), recall (R), F1 score, and average precision (AP) to measure the detection performance of the algorithm; and model parameters (Params), floating-point operations per second (FLOPs), and frames per second (FPS) to measure model complexity and inference speed. In the comparison algorithm, we also draw the precision-recall (PR) curve to show the detection performance of each algorithm. IoU = 0.5 is used as the threshold for dividing positive and negative samples in the experimental evaluation. Precision (P) and recall (R) are, respectively, defined in Formulas (3) and (4), where TP, FP, TN, and FN denote true case, false positive case, true negative case, and false negative case, respectively.
F1 score is based on the harmonic mean of P and R. The larger the value of the score, the better the model's performance. It is defined in Formula (5): Average precision (AP) is the area enclosed by the PR curve and the coordinate axis. It comprehensively considers the effects of precision (P) and recall (R) to reflect the quality of the model. The larger the AP value, the better the model performance. Its definition is Formula (6): Model parameters (Params) measure the space complexity of the model and its corresponding memory resource occupation. For example, assuming that the current convolution layer uses K, M, and N to represent the size of the convolution kernel, the number of input channels, and the number of output channels, respectively, then this convolutional layer parameter quantity is defined in Formula (7): Floating-point operations per second (FLOPs) measures the time complexity of the model, which is a reference indicator of the calculation time. Assuming that the current convolutional layer is represented by K, M, N, H, and W for the convolutional kernel size, input channels, output channels, and the height and width of the output feature map, respectively, the FLOPs of this convolutional layer is defined in Formula (8): Frames per second (FPS) measures the overall detection speed of the algorithm and is defined in Formula (9), which gives the average detection time of an image. A larger value represents faster detection.

Selection of Preset Anchor Boxes
Since the method in this paper does not use FPN as the neck and only detects at one resolution of 16-times downsampling, if the original scale of the anchor box preset is used, anchor box recall on the dataset is low, and there are many missed detections. Therefore, it is necessary to redesign a set of anchor boxes based on the dataset's labeling information to optimize the model's performance. Figure 11 shows the aspect ratio of the aircraft bounding boxes relative to the images; it can be seen that the aspect ratio of the aircraft target is close to 1:1, and most of the aircraft are small relative to the image. From a relative scale perspective, a target is usually defined as small when the ratio of the target's bounding box area to the image area is less than 0.58% [49]. Data analysis finds that small targets account for 48.8% of the SAR aircraft dataset. For this reason, we determine the bounding box size of the training samples by the K-means clustering method and finally find four groups of anchor-box presets suitable for aircraft samples: (16,14), (27,25), (52,50), (90, 83).

Ablation Experiments and Analysis
This section is a series of ablation experiments on the SEAN network architecture proposed in this paper to evaluate the effectiveness of each module based on evaluation metrics. The experiments are divided into three parts: choice of model size and depth, module effectiveness, and comparison between different backbone networks and necks on YOLOv5s.
(1) Selection of Model Size and Depth: This paper conducts experiments using each of the five model sizes provided by YOLOv5 version 6.0. As shown in Table 2, YOLOv5s has the best trade-off between accuracy and speed, and a short training time (T-time). Secondly, this paper explores the depth of the model, that is, which stage output features are more suitable for SAR aircraft target detection. As shown in Table 3, detection using only the C4 feature has a sufficiently high AP of 95.4%, which is 14.3% and 1.1% higher than that of the C3 and C5 features, indicating that the C4 feature has a good trade-off between resolution and high-level semantic information. On the other hand, the FPS is low when only the C3 feature is used for detection, although the FLOPs are minimal. This is due to the lack of high-level semantic information in the C3 feature map for the binary classification of ground objects and aircraft targets, which results in many false alarms in the prediction. This leads to short network inference time when using only the C3 features for detection, but too many false alarms are generated and too much time is spent on NMS post-processing of the anchor boxes. Based on the above experimental analysis, we select the C4 feature map of YOLOv5s plus the head as the basis for subsequent experiments.
(2) Effectiveness of Each Module: In order to clearly express the contribution of each module of SEAN, we conduct experiments based on the C4 plus the head. The results are shown in Table 4. Firstly, the structural re-parameterization technique is introduced into the backbone, i.e., RepConv is used instead of simple convolutional downsampling. It can be seen that the model does not affect the detection FPS, and the AP is improved by 0.6%. Then, for the neck design, we use RDM to fuse different receptive-field scale features to achieve 0.3% AP improvement, and LSEM to enhance the learning of low-level semantic information to achieve 0.4% AP improvement. Since the semantic information of directly splicing the output features of both RDM and LSEM is more confusing and redundant, LAM is added in this paper to refine the semantic information, which improves the AP by 1% by enhancing aircraft positioning information. Moreover, the AP of the loss function of SIoU [54] using the frontier later decreases by 0.4%. Figure 12 shows (a) the actual detection map of the method in this paper; (b) in the absence of LAM, the aircraft target is easily confused with the surrounding features on these grayscale channel maps; and (c) with LAM, the corresponding performance of the aircraft target on the grayscale channel maps is significantly different from that of the surrounding features. It can be seen that LAM can refine the semantic features and effectively suppress the interference of complex features on aircraft target detection. The attention mechanism can compensate for the problem of strong locality and insufficient globality of CNN by obtaining global context information. Table 5 shows the effects of different attention mechanisms on the performance of LAM. We added different attention modules in the same position based on the LAM without attention mechanism to verify the effect. The results show that almost all of these attention mechanisms improve the performance of the network; specifically, the overall performance of our adopted CoordAtt is the best. In general, the algorithm in this paper significantly reduces the parameters and FLOPs compared to YOLOv5s and improves the AP by 1.3% and 8.7 FPS on the test set. Figure 13 shows the change to AP on the validation set with increasing epochs during model training. Since SEAN only uses one scale of detection head, the model fitting speed is slower than that of YOLOv5s, but the final AP value is better.    To better understand the points of interest of the SEAN model under complex ground conditions, this paper uses the Grad-CAM [56] technique for visualization. This is a backpropagation gradient of the information from the predicted anchor boxes, and the learning of the model is represented by the response of the parameters of the last convolution layer using a heat map. In Figure 14a-c represent three cases of difficulty detecting due to complex ground objects, dense aircraft arrangement, and small aircraft targets. The actual detection effect of the model is in the upper half, and the corresponding Grad-CAM feature visualization is in the lower half. The figure shows that the model can accurately detect the aircraft targets in all these scenarios. Furthermore, the heat map shows that the areas with aircraft targets are red. In contrast, other objects such as corridors, buildings, and runways are presented in cool colors, which proves that our algorithm has strong anti-interference ability and high detection accuracy in complex scenes.

Method P (%) R (%) F1 (%) AP (%) Params (M) FLOPs (G) T-Time (h) FPS
(3) Different Neck and Backbone: To further demonstrate the performance advantage of the proposed method, we substitute the backbone and neck of YOLOv5s for comparison with SEAN. For the backbone, we use ConvNeXt [57], and we train its tiny version with the same number of channels per module control as with YOLOv5s. Furthermore, the neck part is changed to FPN [13] and BiFPN [40]. Table 6 shows that our proposed algorithm has better accuracy and speed than YOLOv5s using different backbones and necks.

Comparative Experiments and Analysis
To further demonstrate the effectiveness of SEAN, we compare it to seven typical detection algorithms based on MMDetection [58]. As shown in Table 7, the long training time of relatively large backbone networks such as ResNet, especially two-stage algorithms, makes them prone to over-fitting on small-scale SAR data. Therefore, we use a pre-trained model for initialization in the backbone of these networks. In terms of detection speed, the table shows that models using ResNet50 as the backbone and FPN as the neck, such as Cascade R-CNN, have massive computational effort, resulting in significantly lower detection speeds. Moreover, detection speed is improved when YOLOF adopts the dilated encoder (D-en) as the neck. While YOLOv3, YOLOX-s, and SEAN (our method) all use a lightweight backbone, SEAN uses a simple neck to significantly improve detection speed. Regarding detection accuracy, YOLOv3, YOLOX-s, and SEAN (our method), which do not use ResNet50 as the backbone, are generally better than the other methods. As shown in Figure 15, the area of the PR curve of the SEAN algorithm is significantly larger than that of the other comparison algorithms, which means that the method in this paper has a significant detection accuracy advantage in actual detection. According to the above experiments, the proposed SEAN method has better average accuracy of 97.7%, a better F1 score of 94.9%, and faster detection speed of 83.3 FPS with fewer parameters and FLOPs than the typical algorithms on the SAR aircraft dataset.
In order to visualize the effectiveness of SEAN, we select some challenging detection scenarios from the test set to compare the algorithms. Figure 16 shows a small target scene, and SEAN accurately detects the small aircraft target. Figures 17 and 18 are two scenes with complex environments; SEAN has significantly fewer missed targets and false alarms compared to other models and can effectively avoid the influence of strong scattering points due to complex objects. Moreover, among the aircraft targets detected by SEAN, the bounding boxes completely wrap whole aircraft, and there are no situations in which only local components are detected.

Conclusions
In this paper, we propose a simple and efficient attention network (SEAN) for aircraft detection in SAR images, which avoids the previous deep backbone and complex, laterally connected FPN neck and improves the detection accuracy and speed while significantly reducing the parameters and the FLOPs in the network. Through experiments, SEAN achieves 97.7% AP and 83.3 FPS speed on the Gaofen-3 aircraft target dataset. The results show that this algorithm has apparent advantages in detection accuracy and speed for SAR aircraft targets in complex backgrounds compared to other typical target algorithms. It shows that the trained model with a large amount of SAR aircraft labeled data has a high enough detection accuracy. However, due to the difficulty of manual labeling of SAR aircraft targets and the small amount of SAR data, the detection of SAR aircraft with small samples and the use of self-supervised learning methods in the field of SAR is very promising. Furthermore, we also find that the C4 feature of the backbone is more suitable for aircraft detection, and it can also be used to conduct lightweight network research on SAR in the future.