Azimuth-Sensitive Object Detection of High-Resolution SAR Images in Complex Scenes by Using a Spatial Orientation Attention Enhancement Network

Ge, Ji; Wang, Chao; Zhang, Bo; Xu, Changgui; Wen, Xiaoyang

doi:10.3390/rs14092198

Open AccessArticle

Azimuth-Sensitive Object Detection of High-Resolution SAR Images in Complex Scenes by Using a Spatial Orientation Attention Enhancement Network

by

Ji Ge

^1,2,3,

Chao Wang

^1,2,3

,

Bo Zhang

^1,2,3,*,

Changgui Xu

^1,2,3 and

Xiaoyang Wen

^1,2,3

¹

Key Laboratory of Digital Earth Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

International Research Center of Big Data for Sustainable Development Goals, Beijing 100094, China

³

College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(9), 2198; https://doi.org/10.3390/rs14092198

Submission received: 17 April 2022 / Revised: 28 April 2022 / Accepted: 1 May 2022 / Published: 4 May 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The scattering features of objects in synthetic aperture radar (SAR) imagery are highly sensitive to different azimuth angles, and detecting azimuth-sensitive objects in complex scenes becomes a challenging task. To address this issue, we propose a novel framework called the spatial orientation attention enhancement network (SOAEN) by using aircraft detection in complex scenes of SAR imagery as a case study. Taking YOLOX as the basic framework, this framework introduces the inverted pyramid ConvMixer network (IPCN), the spatial-orientation-enhanced path aggregation feature pyramid network (SOEPAFPN), and the anchor-free decoupled head (AFDH) to achieve performance improvement. A spatial orientation attention module is proposed and introduced into the path aggregation feature pyramid network to form a new structure, the SOEPAFPN, for capturing feature transformations in different directions, highlighting object features and suppressing background effects; the IPCN is adapted to replace the backbone network of YOLOX for enhancing the multiscale feature extraction capability and reducing the computational complexity, while the AFDH is used to decouple object localization and classification to improve the efficiency and accuracy of object localization and classification. The experimental results of the multiple real complex scenes on Gaofen-3 1 m images show that the proposed method achieves the highest detection accuracy, with an average detection rate of 91.22% compared with the YOLO series networks.

Keywords:

synthetic aperture radar; object detection; deep learning; attention mechanisms

Graphical Abstract

1. Introduction

With the improving resolution of acquired synthetic aperture radar (SAR) images, the detection of small and medium-sized objects in SAR images has gradually become a research hotspot. Due to the special imaging mechanism of SAR, azimuth sensitivity is an important feature of SAR images [1,2]. The scattering features of azimuth-sensitive objects, such as aircraft, at different azimuths differ greatly, so even the same object shows different geometric structures and texture features in SAR images with varied imaging conditions, thereby creating great challenges for object detection.

In recent years, with the development of big data and artificial intelligence, deep learning has achieved excellent performance in object detection [3,4,5,6,7,8,9,10,11,12]. Many scholars have applied deep learning algorithms to SAR aircraft detection and achieved good results [13,14,15,16,17,18,19,20,21,22,23,24]. Wang et al. [13] and Guo et al. [14] used a convolutional neural network (CNN) method to detect aircraft’s candidates within suspicious areas. Diao et al. [15] combined the constant false alarm rate algorithm and fast region-based CNN (Fast R-CNN) to detect aircraft in the apron area of high-resolution SAR images. Zhang et al. [16] proposed a cascaded three-look network based on Faster R-CNN to locate aircraft in SAR images. He et al. [17] built a component-based multilayer parallel network for aircraft detection in SAR imagery, and the experiment conducted on TerraSAR-X data proved the feasibility of this method. Guo et al. [18] proposed an attention pyramid network based on the feature pyramid network to explore the scattering information enhancement of aircraft. Li et al. [19] proposed a YOLOv3-based lightweight detection model to suppress redundant information in complex environments by extracting grayscale features and enhancing spatial information. To suppress the multiscale scattering features of capturing aircraft, Luo et al. [20] proposed a YOLOv5-based bidirectional path aggregation attention network and then proposed an explainable artificial intelligence framework [21] to carry out the experimental verification of aircraft detection of small scenes on Gaofen-3 images. Wang et al. [22] proposed an automatic aircraft detection method of SAR images integrating weighted feature fusion and spatial attention module with a convolutional neural network, which effectively reduced the interference of negative samples under the condition of limiting runway regions. To extract refined aircraft features, Zhao et al. [23] introduced a pyramid attention dilated network based on RetinaNet by designing a multibranch dilated convolutional module in 2021, followed by an attentional feature refinement and alignment network [24] for SAR aircraft detection based on a single shot detector in 2022, and applied them to aircraft detection in runway areas. To sum up, the early deep learning aircraft detection algorithms lacked customized processing of SAR aircraft features [13,14,15,16]. There have been many preprocessing and postprocessing algorithms in the literature [14,15,16,17,18,20,21,22] that could not balance accuracy and speed. Moreover, most algorithms limited the detection to runway regions and lacked the experimental analysis of aircraft detection in complex scenes.

From the above studies, we can see that the azimuth of aircraft features is an important factor hindering the accuracy of aircraft detection. As we know, an aircraft is composed of several components with different scattering mechanisms. In the process of SAR imaging, with a change in scattering conditions, the imaging results of aircraft also change to different degrees. Moreover, due to the smoothing aircraft’s overall body and the flattened ground level, the phenomenon of multi-bounce always occurs in SAR images that brings redundant scattering outside of the aircraft’s region. A large number of strong background scattering points from the boarding corridor or associated equipment are usually distributed around the aircraft, which can easily be confused with the object scattering and are difficult to detect accurately. At present, the number of SAR aircraft samples is small and the scale variation is large. How to extract effective features from limited samples is a major challenge. The current studies tend to use the complex deep network as the backbone network to extract features and have achieved certain results, but there is the risk of overfitting, the decline of computing performance, and the reduction in migration and deployment ability caused by too many parameters. In terms of object localization methods, anchor-based strategies are widely used in the above research [13,14,15,16,17,18,19,20,21,22,23,24], leading to the problems of many preset parameters of anchor boxes and decreased model generalization ability.

Intending to solve the abovementioned challenges, we propose an aircraft detection algorithm in SAR imagery based on a spatial orientation attention enhancement network (SOAEN). For the above problems, the main contributions of this paper are summarized as follows:

(1) Focusing on the problems of large feature differences of azimuth-sensitive objects and serious interference of the complex background in SAR images, a spatial orientation attention module (SOAM) based on coordinate attention and spatial attention is proposed to enhance the extraction of spatial azimuth features and suppress the interference of backgrounds and is combined with the path aggregation feature pyramid network (PAFPN) [25] to improve multiscale feature fusion.

(2) Considering the small number and different scales of samples in the SAR dataset, the structure of the lightweight convolution neural network ConvMixer [26] is redesigned to meet the needs of multiscale feature extraction, yielding the inverted pyramid ConvMixer net (IPCN) as the backbone network of this framework. The patch embedding operation enables the network to focus on small objects at the beginning to facilitate the detection of small objects. The inverted pyramid structure extracts features at different levels to meet the needs of multiscale object detection.

(3) Aiming at many preset parameters and the insufficient generalization ability of the mainstream anchor-based algorithm in SAR object detection, taking the latest anchor-free model YOLOX [27] as the basic model, a novel lightweight anchor-free object detection network, SOAEN, is proposed. The SOAEN contains SOAMs and the IPCN we proposed and an anchor-free detection head, which can efficiently detect azimuth-sensitive objects in large-scale and high-resolution SAR images.

The rest of this paper is organized as follows. In Section 2, the proposed method of azimuth-sensitive object detection in high-resolution SAR images is introduced in detail. Section 3 presents the experimental results and corresponding analysis of ablation experiments and aircraft detection in real scenes with different networks. In Section 4, the experimental results are discussed and the future research direction is proposed. Section 5 briefly summarizes the results of this paper.

2. Methodology

2.1. Overall Architecture

Aiming at the key difficulties of SAR azimuth-sensitive object detection, this paper proposes a novel SAR azimuth-sensitive object detection network, the SOAEN, using YOLOX as the basic framework. The overall structure of the SOAEN is shown in Figure 1; the main components of the SOAEN are the IPCN, the spatial-orientation-enhanced PAFPN (SOEPAFPN), and an anchor-free decoupled head (AFDH) [27]. Considering the superior difficulty of SAR interpretation and the small volume and different scales of samples, to avoid redundancy and overfitting, a lightweight backbone network, namely the IPCN, is designed to extract features from images and output feature maps of three scales to the SOEPAFPN for further feature aggregation. The PAFPN can transfer abundant semantic features of images from top to bottom, transfer localization features of objects from bottom to top, and fuse features of different dimensions. During feature transfer and fusion in the PAFPN, we add SOAMs to enhance the extraction of spatial orientation information and multiscale information of the object and suppress background clutter interference. Finally, the outputs of the SOEPAFPN are divided into three branches and output into the AFDH to identify the class and localization of objects, thus reducing the preset of the parameters of the anchors that relate to SAR objects.

In the detection head, the classification and locating of the object are two separate tasks. The classification focuses more on the class information in the feature map, and the locating focuses more on the localization information of the bounding box. Therefore, decoupled feature information is required to complete the classification and localization of objects. In the early models of the YOLO series [28,29,30,31], the detection head is coupled, that is, the classification and the bounding box regression information come from the same feature map, thus limiting the accuracy. Therefore, the AFDH is used in our method to decouple object classification and bounding box regression, thus achieving a faster speed of convergence and improving the accuracy of classification and localization.

2.2. Spatial-Orientation-Enhanced PAFPN (SOEPAFPN) with Spatial Orientation Attention Modules (SOAM)

The aircraft shown in Figure 2 is a common SAR azimuth-sensitive object. An aircraft is composed of a cockpit, a fuselage, wings, engines, and a tail, which have different scattering mechanisms of their own. The scattering magnitude from the component would fluctuate as the aircraft is steering related to SAR’s line of sight. Thus, the scattering centers and geometric outlines of the same object with different azimuth angles in SAR images differ, and it is difficult to extract robust features of these objects. In addition, the airport background is complex and the scattering intensity from and the geometric appearance of some other objects, such as associated equipment and boarding corridor in the background, even exhibit similar scattering features and geometric shapes to the objects to be detected. To enhance the ability to extract the spatial orientation feature information of azimuth-sensitive objects and suppress the background influence, a spatial orientation attention module (SOAM) based on attention mechanisms is designed. In this structure, a coordinate attention mechanism (CAM) [32] and a multireceptive field spatial attention mechanism (MFSAM) are concatenated to tell the model to which information and locations it should pay more attention.

Considering the differences in scattering and geometric features of the same azimuth-sensitive object when imaging at different azimuths, the CAM is used to extract in different directions information to eliminate the influence of azimuth. Global average pooling is decomposed into two parallel one-dimensional coding processes in the X-direction and the Y-direction; that is, the horizontal and vertical features are aggregated into two independent azimuth-aware feature maps to capture the differences in azimuth-sensitive objects in different directions and integrate the spatial orientation information into the channel attention feature map. The specific structure is shown in Figure 3.

First, the input feature tensor

X \in ℝ^{C \times H \times W}

is subjected to one-dimensional adaptive average pooling along the horizontal and vertical directions to obtain two one-dimensional azimuth-aware tensors of the

c

-th channel,

t_{c_h}

and

t_{c_w}

. After features with precise encoding information in the X-direction and the Y-direction are obtained, they are concatenated together and then fed into 1×1 convolution to obtain intermediate features

k

representing the mixed direction information;

k

is then split into

k_{h}

and

k_{w}

along the spatial dimension, and the attention weights

f_{h}

and

f_{w}

are obtained by 1×1 convolution with the same channels as

X

, respectively. After that, the attention maps

f_{h}

and

f_{w}

are correspondingly multiplied with the input feature map

X

to obtain the attention of the azimuth and channels. Finally, the azimuth-channel attention feature map

y_{c}

is obtained. The working principle of CAM can be summarized as follows:

t_{c_h} = \frac{1}{W} \sum_{i = 0}^{W} x (h, i), t_{c_w} = \frac{1}{H} \sum_{j = 0}^{H} x (j, w)

(1)

k = δ (F_{1 \times 1} ([t_{c_h}, t_{c_w}]))

(2)

f_{h} = σ (F_{h_1 \times 1} (k_{h})), f_{w} = σ (F_{w_1 \times 1} (k_{w}))

(3)

y_{c} = x_{c} \times f_{h} \times f_{w}

(4)

where

F_{1 \times 1}

denotes the 1 × 1 convolution function;

[\cdot, \cdot]

denotes the concatenation operation;

δ

is the nonlinear activation function; and

σ

is the sigmoid function.

To enhance the ability to distinguish the object from the background, the MFSAM is added after coordinate channel attention to enhance the spatial information. The MFSAM first splits the input feature map into n groups (n = 4) for multiscale feature extraction. After splitting, to improve the expression ability at a more refined level, each group uses convolution layers with different receptive fields (k = 3, 5, 7, 9) and numbers of groups (g = 2, 4, 8, 16) to extract multiscale spatial feature maps. Then, each group of feature maps is concatenated to obtain a new spatial-level multiscale feature map. Finally, the average and maximum feature maps are obtained along the channel direction, and then through a 1 × 1 convolution to reduce the number of channels to 1, the sigmoid activation function is used to remap the attention weight to the value of 0~1 to include the mask of multirefined spatial features. The attention weight

s

is multiplied with the feature map

y_{c}

output by CAM to obtain the spatial-orientation-enhanced feature map

z_{c}

to highlight the spatial salient features of azimuth-sensitive objects. The working principle of the MFSAM can be summarized as follows:

y_{1}, y_{2}, \dots, y_{n} = split (y_{c})

(5)

p = [F_{G C} (y_{i})], i \in [1, n]

(6)

s = σ [F_{\max} (p), F_{avg} (p)]

(7)

z_{c} = s \times y_{c}

(8)

where,

y_{n}

denotes the tensor after splitting,

F_{G C}

denotes the grouping convolution function,

p

is the intermediate feature tensor after concatenation, and

F_{\max}

and

F_{avg}

denote the operation of calculating the maximum and average values, respectively.

As shown in Figure 1, the SOAMs and the PAFPN are combined and a new neck network is obtained, called the SOEPAFPN, which is able to eliminate the conflict of features between various scales and better integrate multiscale features. The feature pyramid network (FPN) [33] adopts a top–down mode to transfer high-level features downward, but the path between high-level features and low-level features is long, thus increasing the difficulty and burden of accessing accurate locating information. The PAFPN introduces the path aggregation (PA) [25] strategy to the FPN, thus creating a bottom–up path enhancement based on the FPN to shorten the information path, reduce the number of layers required to transmit features with various scales, reduce the cost of computing, and improve the locating accuracy. In particular, the SOAMs are connected to the concatenating process of upsampling and downsampling to reduce the loss of spatial orientation information after sampling.

2.3. Inverted Pyramid ConvMixer Net (IPCN)

The backbone network of the classic YOLOX is a large-depth network combining DarkNet53 [30] and the cross-stage partial network (CSPNet) [34], which is slow in computing speed and involves a risk of overfitting on a small number of samples. To balance accuracy and speed, this paper introduces the lightweight convolution model ConvMixer [26], which can achieve an accuracy comparable to that of large deep networks through a lightweight structure. As shown in Figure 4a, ConvMixer first embeds the input image into blocks and divides the image into fixed-size patches to retain local information. Then, the ConvMixer layer with a convolution structure can independently mix the spatial and channel information of the data after patch embedding.

Considering the characteristics of aircraft samples in SAR images, the structure of ConvMixer is redesigned and named the inverted pyramid ConvMixer network (IPCN), making it more suitable for SAR small object detection, as shown in Figure 4b. First, we use the convolution layer whose convolution kernel size and step size are equal to the patch size (patch size = 4) to patch-embed the image, thereby enabling the network to focus on the information of small-scale objects from the beginning. After patch embedding, to obtain a multiscale feature map, the ConvMixer block (CMB) is improved. Before entering the CMB, the image first passes through a convolution layer with a step size of 2. The purpose of the convolution layer is to expand the channel and reduce the dimension of features without losing the feature information, thereby avoiding the information loss of the pooling operation, which is important for retaining the feature information. Then, depthwise convolution (DWC) [35] with a large receptive field (the kernel size is 9) is used to mix spatial information and pointwise convolution (PWC) [35] is used to mix channel information with the output of the DWC and the original input. To meet the need of expanding the depth of the network, without changing the feature dimension in the internal information mixing part of the CMB, its structure can be reused to deepen the network and fully obtain the deep features under different scales. We stack the CMB according to different output scales and eliminate the deformation influence of fixed-size image input through the spatial pyramid pooling (SPP) [36] module. Finally, the inverted pyramid ConvMixer net is obtained to extract the multiscale features of SAR objects.

3. Experiments

To evaluate the performance of the proposed model and its ability to detect SAR azimuth-sensitive objects in real scenes, a series of experiments are carried out with aircraft as the detection object, including ablation experiments and real-scene test experiments. The four ablation experiments are employed to verify the effectiveness of the proposed method, and the superior performance of the SOAEN in complex scenes of SAR images is tested and verified by multiple real scenes and related detection indicators.

3.1. Dataset and Experiment Details

A total of 53 scenes of Gaofen-3 images with 1 m resolution containing airports are used in this experiment. To solve the insufficiency of manually labeled aircraft data, rotate, translate, flip, and mirror operations are used to expand the dataset. Finally, 3216 aircraft samples of 512×512 are obtained. The training set, the validation set, and the test set are divided according to the proportion of 8:1:1.

Our experiments are based on a single NVIDIA RTX 3090 GPU. In the experiment, the batch size is 8 and 300 epochs are trained. The optimizer is stochastic gradient descent (SGD) [37], the initial learning rate is set to 0.01, the weight decay is 0.0005, and the momentum is 0.9.

In the training phase, mosaic [30] and mixup [38] enhancement strategies are used to improve the robustness and detection performance of the model. The main idea of mosaic is to crop four images randomly and then make an image as training data; the mixup operation mixes two random pictures in proportion to generate a new image. However, the training image generated after data enhancement is not the real distribution of samples, and the crop operation of mosaic will introduce a large number of incomplete an-notation boxes. Data augmentation is closed in the last 15 rounds to avoid the influence of incomplete detection boxes and complete the final convergence under the real data distribution [27].

The loss curve of the SOAEN and YOLOX in our sample training phase is shown in Figure 5. The loss convergence speed is significantly accelerated after turning off the data augmentation. In general, the convergence speed of the SOAEN is better than that of YOLOX, and the training is more efficient.

3.2. Evaluation Index

To quantitatively analyze the detection performance, the two most commonly used evaluation indexes in SAR object detection are used: the detection rate (DR) [39] and the false alarm rate (FAR) [39].

The detection rate is the ratio of the number of correctly detected objects to the number of ground truths; this rate measures the ability of the model to correctly detect objects. The detection rate is defined as follows:

D R = \frac{N_{D T}}{N_{G T}}

(9)

where

N_{D T}

represents the number of correctly detected aircraft and

N_{G T}

represents the number of true aircraft. The subscripts DT and GT stand for detected true and ground truth, respectively.

The false alarm rate refers to the ratio of the number of objects with detection errors to the total number of model predictions; this rate is used to measure the robustness of the model. The false alarm rate is defined as follows:

F A R = \frac{N_{D F}}{N_{D T} + N_{D F}}

(10)

where

N_{D F}

represents the number of aircraft detected incorrectly and DF represents detected false.

In the ablation experiments, to evaluate the classification and localization performance and computational complexity of our designed modules, in terms of the definitions of VOC2007 and COCO metrics [40,41], we make the metrics as shown in Table 1:

3.3. Ablation Experiments

Ablation experiments are commonly used to reveal the effects of different modules on the model [18,19,23,24,27,31]. To verify the effectiveness of each module designed, four sets of ablation experiments are conducted on the test dataset, namely (1) using the baseline YOLOX; (2) replacing the backbone network with IPCN only; (3) replacing the FAPAN with the SOEPAFPN only; and (4) using the SOAEN.

The experimental results are shown in Table 2. In Experiment 2, the IPCN is used to replace CSPDarknet, the backbone network of baseline YOLOX. Compared with CSPDarknet, the AP obtained by the IPCN increases by 0.41% and the FLOPs decreases significantly from 99.40 G to 79.02 G. It shows that our proposed IPCN slightly outperforms CSPDarknet while maintaining light weight. Experiment 3 uses the SOEPAFPN combined with SOAMs to aggregate the spatial orientation information on the objects. Compared with YOLOX’s AP, YOLOX with SOEPAFPN’s AP increased by 0.84% and FLOPs increased by only 0.04 G. The increase in computation is small, but the performance is greatly improved, thereby showing that our proposed SOAM can effectively capture the spatial orientation features of objects. In Experiment 4, compared with baseline YOLOX, the combined use of the IPCN and the SOEPAFPN can increase AP, AP50, and AP75 by 1.8%, 1.16%, and 1.35%, respectively. The results of ablation experiments fully demonstrate that our proposed method can improve the detection performance compared with the YOLOX network.

3.4. Experimental Results and Analysis in Real Scenes

To evaluate the performance of the SOAEN in real scenes, we selected two typical airport areas from Shanghai Hongqiao Airport and Beijing Capital Airport as the test areas, and the data source is the 1 m resolution SAR images from the Gaofen-3 system. In particular, none of these images were included in the dataset, thereby verifying the generalization ability of models. The models used in the comparative experiment are anchor-based YOLOv5 and anchor-free YOLOX. The training parameters of the three models are consistent.

3.4.1. Analysis of Hongqiao Airport

Hongqiao Airport, Shanghai, China, is a large-scale civil airport with many types of aircraft and small parking distances. Affected by corridor bridges and goods, the backscattering characteristics of aircraft near the terminals are complex and diverse, thus increasing the difficulty of detection. Figure 6 and Figure 7 show the aircraft detection results in the apron and the terminal of Hongqiao Airport, respectively.

The area shown in Figure 6 is dominated by the apron, which is relatively open and contains less background clutter. It is relatively simple for the detector. The experimental results show that YOLO v5 exhibits 1 missed detection and 1 false alarm, YOLOX exhibits 2 missed detections and 3 false alarms, and the SOAEN displays no missed detections or false alarms. In this scenario, all three models were able to detect all aircraft in the open area, but both YOLOX and YOLOv5 exhibited missed detections near the terminals. In addition, there are 3 inaccurate boxes in YOLOv5 affected by buildings, while the situation is much better in YOLOX and the SOAEN. Three building areas are mistakenly detected as aircraft in YOLOX’s results, which indicates that the anchor-free model does not distinguish the background well although it improves the generalization ability. The SOAEN displays the best detection performance and can detect all aircraft without false alarms. However, the positioning of 1 detection box is not accurate enough. The only shortcoming is the existence of 1 incomplete object. As can be seen in the Figure 6, the fuselage structure of this aircraft is unclear and its texture is similar to the ground, but the result of the SOAEN still contains the main part of this aircraft. The results show that the SOAEN possesses good feature extraction ability and object distinguish ability.

The areas shown in Figure 7 are terminals with more dense aircraft arrangement in a complex background situation, which poses a great challenge for the detectors. The experimental results show that YOLOv5 displays 6 missed detections and no false alarm; YOLOX exhibits 3 missed detections and 2 false alarms; and the SOAEN exhibits only 1 missed detection and no false alarm. In this scenario, the missed detections of YOLOv5 and YOLOX are extremely serious, among which, YOLOv5 exhibits a poor generalization ability and the most missed detections; YOLOX, although with fewer missed detections, tends to mistakenly detect the background strong scattering area as an aircraft. There is only 1 missed detection in the result of the SOAEN. Affected by the SAR imaging azimuth, the geometry of this missed aircraft is missing compared to other aircraft and appears as discrete scattered points and the miss is thus acceptable. In particular, our SOAEN is able to detect objects missed by both YOLOv5 and YOLOX, which indicates that the combination of the IPCN and SOAMs greatly improves the feature extraction capability of YOLOX and SOAMs enable the network to be more focused on objects.

3.4.2. Analysis of Capital Airport

Compared with the terminal structure of Hongqiao Airport, the terminal structure of Beijing Capital Airport is more complex, with many mechanical or metal equipment around aircraft, thus showing a texture similar to that of aircraft, which is prone to being falsely detected. In addition, there are many strong scattering points in the open area away from buildings, which also tend to cause false alarms. In this airport, two terminal areas with completely different shapes are selected as test areas to verify the robustness of our model.

Figure 8 shows the test results in front of a terminal. The terminal building in this scenario is small, but the scattered clutters from the building are prominent, thus making it difficult for the aircraft to maintain its integrity. As can be seen from Figure 8, YOLOv5 exhibits 2 missed detections; YOLOX exhibits no missed detections but 1 false alarm because a small object was incorrectly detected as an aircraft. This shows that only feature scale transformation cannot help effectively distinguish small objects. The SOAEN’s results are extremely close to the ground truth, without false alarms or missed detections, indicating that the patch-embedding operation of the IPCN can enhance feature extraction for small objects.

Figure 9 shows the aircraft detection results in complex scenes. In this scene, the terminal area is large and there are many background clutters in the open area. The aircraft are of different scales, and the scattering characteristics vary greatly due to the different azimuth of SAR, which makes detection a big challenge. The results show that YOLOv5 exhibits 4 missed detections and 1 false alarm; YOLOX exhibits 2 missed detections and 3 false alarms; and SOAEN exhibits only 1 missed detection and 1 false alarm. In this scenario, YOLOv5 exhibits more serious miss detection and misidentifies a background strong scattering area as an aircraft; in the blue box, the tail of the aircraft is incorrectly identified as a complete object. In the YOLOX result, all 3 false alarms are near buildings and show strong scattering points. In contrast, the result of the SOAEN is much better, with aircraft of different scales and shapes well detected. The only missed aircraft is also not detected by the remaining two models because the aircraft is heavily affected by background clutter and its geometric and scattering features are not prominent. The results show that the SOAEN can comprehensively capture the complex scattering features exhibited by the azimuth-sensitive objects and displays good robustness in complex scenes.

3.5. Performance of Different SAR Object Detection Algorithms

For a more intuitive performance comparison, Table 3 gives the overall evaluation indexes of each network for different airports. In terms of the DR, YOLOv5 displays the worst performance. The average DR value is only 86.21%, which indicates that the model displays weak feature learning ability and generalization ability for aircraft. With an average DR of 87.44%, YOLOX outperforms YOLOv5 in detection. However, the average FAR of YOLOX is 17.63%, which is 2.66% higher than that of YOLOv5. The detection and false alarm rates of the proposed SOAEN are more balanced. The average DR reaches 91.22%, which is 3.78% and 5.01% higher than those of YOLOX and YOLOv5, respectively, and the average FAR is only 13.60%. Therefore, the SOAEN displays more reliable detection performance.

4. Discussion

In the framework design for SAR azimuth-sensitive object detection in SAR images, the generalization advantage of the anchor-free algorithm and a design that requires fewer parameters are the characteristics we want to use. YOLOX meeting the above conditions is selected as the infrastructure. The experimental results (Figure 6, Figure 7, Figure 8 and Figure 9) show that the generalization ability and the bounding box accuracy of YOLOX are slightly better than those of anchor-based YOLOv5. However, due to the lack of customized modules for SAR, although the generalization ability of YOLOX is stronger, YOLOX also leads to more false alarms. Considering the problem of feature heterogeneity and complex background interference of SAR azimuth-sensitive objects under different azimuths, the SOAM is designed to eliminate the influence of the azimuth angle, suppress background interference, and strengthen feature expression. This module integrates coordinate channel attention and spatial attention. As Table 2 shows, our proposed module displays a certain improvement on the original YOLOX. In addition, an overly complex feature extraction network is not what we want. Complex deep networks have high computational cost and repeatedly extract features, which are prone to overfitting when one is dealing with a limited number of SAR samples, resulting in a decline in accuracy. In view of the feature extraction of small objects, and the balance between accuracy and computational cost, we designed a lightweight feature extraction network (IPCN), and the experiments show that the network’s feature extraction ability is sufficient. In this paper, our proposed framework is applied to aircraft detection in Gaofen-3 images and satisfactory detection accuracy is achieved. In future research, the network will be extended to further experimental analysis of other azimuth-sensitive objects (such as ships and vehicles).

5. Conclusions

In this paper, the SOAEN is proposed to detect azimuth-sensitive objects in SAR images with high efficiency and good robustness. The SOAEN consists of three main components: a lightweight backbone, the IPCN; a feature fusion neck, the SOEPAFPN; and a detection head, the AFDH. The performance improvement of the SOAEN mainly results from the two innovative algorithms, the IPCN and the SOAM, proposed in this paper. The IPCN is responsible for extracting effective multiscale features at less computational cost. The SOAM in the SOEPAFPN is implemented based on two key mechanisms, the CAM and the MFSAM. The CAM can capture the feature changes of objects in different directions, which is suitable for the characteristics of azimuth-sensitive objects; the MFSAM can highlight the features of objects and suppress interference. Thus, the SOAM can help the PAFPN effectively integrate the spatial orientation features of objects and reduce the interference of background clutter. The ablation experiments on the Gaofen-3 test dataset have confirmed the effectiveness of the proposed modules. Four typical areas of Shanghai Hongqiao Airport and Beijing Capital Airport were selected to carry out aircraft detection experiments to verify the robustness of our method in real complex scenes. The accuracy of our method is 3.78% and 5.01% higher than that of anchor-free YOLOX and anchor-based YOLOv5, respectively.

In future research, we will continue to explore the application of more algorithms in SAR azimuth-sensitive object detection; for example, transformer series algorithms display great potential in image processing.

Author Contributions

Conceptualization, J.G. and B.Z.; methodology, J.G.; software, J.G. and C.X.; validation, J.G., X.W., and C.W.; formal analysis, J.G.; investigation, C.X., B.Z., and C.W.; resources, C.W.; data curation, C.W.; writing—original draft preparation, J.G.; writing—review and editing, B.Z., X.W., and C.W.; visualization, J.G.; supervision, B.Z.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China through Grant No. 41930110.

Data Availability Statement

Not applicable.

Acknowledgments

We sincerely thank the China Centre for Resources Satellite Data and Application for providing data for this study and the anonymous reviewers for their critical comments and suggestions for improving the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ding, B.; Wen, G.; Huang, X.; Ma, C.; Yang, X. Target recognition in SAR images by exploiting the azimuth sensitivity. Remote Sens. Lett. 2017, 8, 821–830. [Google Scholar] [CrossRef]
Chen, J.; Zhang, B.; Wang, C. Backscattering feature analysis and recognition of civilian aircraft in TerraSAR-X images. IEEE Geosci. Remote Sens. Lett. 2014, 12, 796–800. [Google Scholar] [CrossRef]
Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, X.; Liu, G.; Zhang, C.; Atkinson, P.M.; Tan, X.; Jian, X.; Zhou, X.; Li, Y. Two-phase object-based deep learning for multi-temporal SAR image change detection. Remote Sens. 2020, 12, 548. [Google Scholar] [CrossRef] [Green Version]
Chen, C.; He, C.; Hu, C.; Pei, H.; Jiao, L. A deep neural network based on an attention mechanism for SAR ship detection in multiscale and complex scenarios. IEEE Access 2019, 7, 104848–104863. [Google Scholar] [CrossRef]
Li, J.; Qu, C.; Shao, J. Ship detection in SAR images based on an improved faster R-CNN. In Proceedings of the 2017 SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), Beijing, China, 13–14 November 2017; pp. 1–6. [Google Scholar]
Kang, M.; Leng, X.; Lin, Z.; Ji, K. A modified faster R-CNN based on CFAR algorithm for SAR ship detection. In Proceedings of the 2017 International Workshop on Remote Sensing with Intelligent Processing (RSIP), Shanghai, China, 19–21 May 2017; pp. 1–4. [Google Scholar]
Tang, G.; Zhuge, Y.; Claramunt, C.; Men, S. N-Yolo: A SAR ship detection using noise-classifying and complete-target extraction. Remote Sens. 2021, 13, 871. [Google Scholar] [CrossRef]
Wang, J.; Lin, Y.; Guo, J.; Zhuang, L. SSS-YOLO: Towards more accurate detection for small ships in SAR image. Remote Sens. Lett. 2021, 12, 93–102. [Google Scholar] [CrossRef]
Wang, Z.; Du, L.; Mao, J.; Liu, B.; Yang, D. SAR target detection based on SSD with data augmentation and transfer learning. IEEE Geosci. Remote Sens. Lett. 2018, 16, 150–154. [Google Scholar] [CrossRef]
Chang, Y.-L.; Anagaw, A.; Chang, L.; Wang, Y.C.; Hsiao, C.-Y.; Lee, W.-H. Ship Detection Based on YOLOv2 for SAR Imagery. Remote Sens. 2019, 11, 786. [Google Scholar] [CrossRef] [Green Version]
Wu, Z.; Hou, B.; Ren, B.; Ren, Z.; Wang, S.; Jiao, L. A deep detection network based on interaction of instance segmentation and object detection for SAR images. Remote Sens. 2021, 13, 2582. [Google Scholar] [CrossRef]
Wang, S.; Gao, X.; Sun, H.; Zheng, X.; Sun, X. An aircraft detection method based on convolutional neural networks in high-resolution SAR images. J. Radars 2017, 6, 195–203. [Google Scholar]
Guo, Q.; Wang, H.; Kang, L.; Li, Z.; Xu, F. Aircraft target detection from spaceborne SAR image. In Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1168–1171. [Google Scholar]
Diao, W.; Dou, F.; Fu, K.; Sun, X. Aircraft detection in SAR images using saliency based location regression network. In Proceedings of the 2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 23–27 July 2018; pp. 2334–2337. [Google Scholar]
Zhang, L.; Li, C.; Zhao, L.; Xiong, B.; Quan, S.; Kuang, G. A cascaded three-look network for aircraft detection in SAR images. Remote Sens. Lett. 2020, 11, 57–65. [Google Scholar] [CrossRef]
He, C.; Tu, M.; Xiong, D.; Tu, F.; Liao, M. Adaptive component selection-based discriminative model for object detection in high-resolution SAR imagery. ISPRS Int. J. Geo-Inf. 2018, 7, 72. [Google Scholar] [CrossRef] [Green Version]
Guo, Q.; Wang, H.; Xu, F. Scattering Enhanced Attention Pyramid Network for Aircraft Detection in SAR Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7570–7587. [Google Scholar] [CrossRef]
Li, M.; Wen, G.; Huang, X.; Li, K.; Lin, S. A Lightweight Detection Model for SAR Aircraft in a Complex Environment. Remote Sens. 2021, 13, 5020. [Google Scholar] [CrossRef]
Luo, R.; Chen, L.; Xing, J.; Yuan, Z.; Tan, S.; Cai, X.; Wang, J. A Fast Aircraft Detection Method for SAR Images Based on Efficient Bidirectional Path Aggregated Attention Network. Remote Sens. 2021, 13, 2940. [Google Scholar] [CrossRef]
Luo, R.; Xing, J.; Chen, L.; Pan, Z.; Cai, X.; Li, Z.; Wang, J.; Ford, A. Glassboxing Deep Learning to Enhance Aircraft Detection from SAR Imagery. Remote Sens. 2021, 13, 3650. [Google Scholar] [CrossRef]
Wang, J.; Xiao, H.; Chen, L.; Xing, J.; Pan, Z.; Luo, R.; Cai, X. Integrating Weighted Feature Fusion and the Spatial Attention Module with Convolutional Neural Networks for Automatic Aircraft Detection from SAR Images. Remote Sens. 2021, 13, 910. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Li, C.; Kuang, G. Pyramid attention dilated network for aircraft detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 662–666. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Liu, Z.; Hu, D.; Kuang, G.; Liu, L. Attentional Feature Refinement and Alignment Network for Aircraft Detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Trockman, A.; Kolter, J.Z. Patches Are All You Need? arXiv 2022, arXiv:2201.09792. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [Green Version]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
Everingham, M.; Zisserman, A.; Williams, C.K.; Van Gool, L.; Allan, M.; Bishop, C.M.; Chapelle, O.; Dalal, N.; Deselaers, T.; Dorkó, G. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. 2008. Available online: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html (accessed on 1 March 2022).
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]

Figure 1. The overall architecture of the spatial orientation attention enhancement network (SOAEN).

Figure 2. Aircraft with different azimuth angles in Gaofen-3 1 m images. (a) Aircraft exhibiting discrete strong scattering points. (b) Aircraft exhibiting geometric contours. (c) Aircraft exhibiting fuzzy contours. (d) Aircraft exhibiting fuzzy contours and isolated strong scattering points.

Figure 3. The structure of the spatial orientation attention module (SOAM).

Figure 4. (a) The structure of ConvMixer. (b) The structure of the inverted pyramid ConvMixer net (ICPN).

Figure 5. Loss curves of the SOAEN and YOLOX during sample training.

Figure 6. The results for scene I of Shanghai Hongqiao Airport. (a) The ground truth from Gaofen-3. (b–d) The detection results by YOLOv5, YOLOX, and SOAEN, respectively. The correct detection, the missed detection, false alarms, and inaccurate bounding boxes are indicated by red, green, yellow, and blue boxes, respectively.

Figure 7. The results for scene II of Shanghai Hongqiao Airport. (a) The ground truth from Gaofen-3. (b–d) The detection results by YOLOv5, YOLOX, and SOAEN, respectively. The correct detection, the missed detection, the false alarms, and the inaccurate bounding boxes are indicated by red, green, yellow, and blue boxes, respectively.

Figure 8. The results for scene I of Beijing Capital Airport. (a) The ground truth from Gaofen-3. (b–d) The detection results by YOLOv5, YOLOX, and SOAEN, respectively. The correct detection, the missed detection, the false alarms, and the inaccurate bounding boxes are indicated by red, green, yellow, and blue boxes, respectively.

Figure 9. The results for scene II of Beijing Capital Airport. (a) The ground truth from Gaofen-3. (b–d) The detection results by YOLOv5, YOLOX, and SOAEN, respectively. The correct detection, the missed detection, the false alarms, and the inaccurate bounding boxes are indicated by red, green, yellow, and blue boxes, respectively.

Table 1. Metrics of ablation experiments.

Metrics	Meaning
AP *	IoU = 0.5:0.05:0.95
AP₅₀	IoU = 0.5
AP₇₅	IoU = 0.75
FLOPs	Floating point operations

* Average precision (AP): A metric calculated at various Intersection over Union (IoU).

Table 2. Results of ablation experiments.

ID	Backbone		Neck		Metrics
ID	CSPDarknet	IPCN	PAFPN	SOEPAFPN	AP (%)	AP₅₀ (%)	AP₇₅ (%)	FLOPs (G)
Experiment 1	✓		✓		59.06	89.12	68.49	99.40
Experiment 2		✓	✓		59.47	89.39	68.61	79.02
Experiment 3	✓			✓	59.90	89.60	68.98	99.44
Experiment 4		✓		✓	60.86	90.28	69.84	79.06

Table 3. Comparison of test performance between different models.

Models	Regions	N_GT	N_DT	N_DF		DR (%)	FAR (%)
SOAEN	Capital Airport	81	73	14		90.12	16.09
	Hongqiao Airport	78	72	9		92.31	11.11
					Mean	91.22	13.60
YOLOX	Capital Airport	81	70	18		86.42	20.45
	Hongqiao Airport	78	69	12		88.46	14.81
					Mean	87.44	17.63
YOLOV5	Capital Airport	81	68	15		83.95	18.07
	Hongqiao Airport	78	69	10		88.46	12.66
					Mean	86.21	15.37

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ge, J.; Wang, C.; Zhang, B.; Xu, C.; Wen, X. Azimuth-Sensitive Object Detection of High-Resolution SAR Images in Complex Scenes by Using a Spatial Orientation Attention Enhancement Network. Remote Sens. 2022, 14, 2198. https://doi.org/10.3390/rs14092198

AMA Style

Ge J, Wang C, Zhang B, Xu C, Wen X. Azimuth-Sensitive Object Detection of High-Resolution SAR Images in Complex Scenes by Using a Spatial Orientation Attention Enhancement Network. Remote Sensing. 2022; 14(9):2198. https://doi.org/10.3390/rs14092198

Chicago/Turabian Style

Ge, Ji, Chao Wang, Bo Zhang, Changgui Xu, and Xiaoyang Wen. 2022. "Azimuth-Sensitive Object Detection of High-Resolution SAR Images in Complex Scenes by Using a Spatial Orientation Attention Enhancement Network" Remote Sensing 14, no. 9: 2198. https://doi.org/10.3390/rs14092198

APA Style

Ge, J., Wang, C., Zhang, B., Xu, C., & Wen, X. (2022). Azimuth-Sensitive Object Detection of High-Resolution SAR Images in Complex Scenes by Using a Spatial Orientation Attention Enhancement Network. Remote Sensing, 14(9), 2198. https://doi.org/10.3390/rs14092198

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Azimuth-Sensitive Object Detection of High-Resolution SAR Images in Complex Scenes by Using a Spatial Orientation Attention Enhancement Network

Abstract

1. Introduction

2. Methodology

2.1. Overall Architecture

2.2. Spatial-Orientation-Enhanced PAFPN (SOEPAFPN) with Spatial Orientation Attention Modules (SOAM)

2.3. Inverted Pyramid ConvMixer Net (IPCN)

3. Experiments

3.1. Dataset and Experiment Details

3.2. Evaluation Index

3.3. Ablation Experiments

3.4. Experimental Results and Analysis in Real Scenes

3.4.1. Analysis of Hongqiao Airport

3.4.2. Analysis of Capital Airport

3.5. Performance of Different SAR Object Detection Algorithms

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI