Ship Detection in SAR Images Based on Feature Enhancement Swin Transformer and Adjacent Feature Fusion

: Convolutional neural networks (CNNs) have achieved milestones in object detection of synthetic aperture radar (SAR) images. Recently, vision transformers and their variants have shown great promise in detection tasks. However, ship detection in SAR images remains a substantial challenge because of the characteristics of strong scattering, multi-scale, and complex backgrounds of ship objects in SAR images. This paper proposes an enhancement Swin transformer detection network, named ESTDNet, to complete the ship detection in SAR images to solve the above problems. We adopt the Swin transformer of Cascade-R-CNN (Cascade R-CNN Swin) as a benchmark model in ESTDNet. Based on this, we built two modules in ESTDNet: the feature enhancement Swin transformer (FESwin) module for improving feature extraction capability and the adjacent feature fusion (AFF) module for optimizing feature pyramids. Firstly, the FESwin module is employed as the backbone network, aggregating contextual information about perceptions before and after the Swin transformer model using CNN. It uses single-point channel information interaction as the primary and local spatial information interaction as the secondary for scale fusion based on capturing visual dependence through self-attention, which improves spatial-to-channel feature expression and increases the utilization of ship information from SAR images. Secondly, the AFF module is a weighted selection fusion of each high-level feature in the feature pyramid with its adjacent shallow-level features using learnable adaptive weights, allowing the ship information of SAR images to be focused on the feature maps at more scales and improving the recognition and localization capability for ships in SAR images. Finally, the ablation study conducted on the SSDD dataset validates the effectiveness of the two components proposed in the ESTDNet detector. Moreover, the experiments executed on two public datasets consisting of SSDD and SARShip demonstrate that the ESTDNet detector outperforms the state-of-the-art methods, which provides a new idea for ship detection in SAR images. This paper describes a ship detection network combining transformer and CNN to achieve an excellent detection performance in SAR images. We ﬁrst introduce a CNN based on Swin transformer [21] to build a feature enhancement Swin transformer (FESwin) module to better extract the features of ships. FESwin not only has the global context information perception and spatial information extraction capabilities of a transformer, but also the local information feature extractability and aggregate channel feature information of CNN. In addition, aiming to overcome the shortcoming of feature fusion, we construct a feature pyramid network architecture for ship detection, namely the adjacent feature fusion (AFF) module. A weighted selection fusion of high-level features and adjacent shallow features, with the fusion proportion using adaptive weights that can be learned during the training phase of the ship detection network, so that AFF has a powerful feature fusion capability. On the other hand, we introduce Cascade R-CNN [16] as the detection head to improve the detection accuracy of the ship detection network. Finally, we combine FESwin, AFF, and Cascade R-CNN methods to build an enhancement Swin transformer detection network (ESTDNet). We experimentally validate the design of ESTDNet on SSDD and SARShip datasets. The results show that ESTDNet has signiﬁcantly improved multi-scale detection performance in complex backgrounds. This paper focuses on the optimal design of the backbone and neck parts of the object detection framework. Therefore, we use the Cascade R-CNN Swin framework as the baseline model of our method, and we can ﬂexibly embed our methods in any other object detection framework as a functional module. We summarize the contributions of this work below. we mainly improve the feature extraction capability by using FESwin as the backbone network and employing AFF to replace the original FPN for feature fusion.


Introduction
Synthetic aperture radar (SAR) has the advantages of all-weather, all-day, anti-jamming, far detection, and high concealment [1]. As a remote sensing image data source, SAR images are widely used in scientific research, military reconnaissance, disaster monitoring, resource planning, and natural environment protection [2]. SAR ship detection, as a fundamental marine mission, has an essential value in maritime resource management and maritime emergency rescue. Therefore, ship detection in SAR images is an attractive research topic.
In recent years, deep learning technology has presented a diversified development trend in the field of image processing [3][4][5]. Deep learning-based object detection techniques have been developing rapidly, and many common detection methods are classified

1.
A FESwin module is proposed as a backbone network to extract ship feature information. The module not only has the excellent spatial feature information processing capability of the Swin transformer but also uses CNN to enhance the association among feature map channels. It effectively suppresses the problem of insufficient feature extraction caused by strong scattering of SAR objects, obtaining more significant feature information at different scales, and enhances the transmission capability of feature information.

2.
We construct an AFF module that allows shallow feature information in the feature pyramid to be selectively fused into adjacent higher-level feature information adaptively. The idea of learnable weights and proximity fusion reduces the huge information difference between the bottom and higher-level features and alleviate the problem of attentional dispersion in feature maps.

3.
A ship detector with SAR images is constructed by combining the FESwin module with the AFF module. The effects of FESwin and AFF on ESTDNet were verified separately for both models to improve performance. Experiments on SSDD and SARShip datasets show that ESTDNet can detect ships better in SAR images with higher detection accuracy.
The rest of this paper is arranged as follows: Section 2 describes the proposed network. Section 3 analyses the experimental results of the network proposed in this paper and compares them with other algorithms. Section 4 discusses some phenomena according to the experimental results. Finally, Section 5 gives conclusions about this paper.

The Proposed Method
This paper proposes a SAR ship detection algorithm called ESTDNet based on feature enhancement Swin transformer and adjacent feature fusion, which uses Cascade R-CNN Swin as the benchmark model. Firstly, combining the advantages of CNN structure and Swin transformer, FESwin is innovatively proposed as a new backbone network. Second, the FPN is replaced with AFF to reconstruct the multi-scale feature pyramid. In the testing phase, we use the common objects in context (COCO) metrics as the evaluation standard. The overall framework is shown in Figure 1. The algorithm proposed in this paper will be explained in detail from three aspects. standard. The overall framework is shown in Figure 1. The algorithm proposed in this paper will be explained in detail from three aspects.  Figure 1. The overview of ESTDNet. Compared to Cascade R-CNN Swin, we mainly improve the feature extraction capability by using FESwin as the backbone network and employing AFF to replace the original FPN for feature fusion.

FESwin Backbone Network
In order to obtain a feature map with richer information about the ships, the feature information can be transferred to a deeper level of the model. In this paper, we propose the feature enhancement Swin transformer, called FESwin, as the backbone network for feature extraction, as shown in Figure 2. Our method is based on the use of the transformer idea, hierarchical structure design, and window attention mechanism to establish the association between image features. Firstly, the contextual information before and after the Swin transformer block of each stage is fused using a skip connection. Secondly, we enhance the inter-channel interaction of information after fusion. It optimizes the feature extraction capability and has better ship detection accuracy in SAR images.  [21] and feature enhancement for feature extraction, and the remaining three modules that are responsible for scaling the model.
As shown in Figure 2, the FESwin backbone network is composed of a Swin transformer and feature enhancement module. Relying on the hierarchical design of the Swin transformer, the feature enhancement module is introduced in each feature extraction stage. The feature maps of each stage are feature enhanced again, and the more expressive and informative feature maps are used as the output of the current stage. Meanwhile, the

FESwin Backbone Network
In order to obtain a feature map with richer information about the ships, the feature information can be transferred to a deeper level of the model. In this paper, we propose the feature enhancement Swin transformer, called FESwin, as the backbone network for feature extraction, as shown in Figure 2. Our method is based on the use of the transformer idea, hierarchical structure design, and window attention mechanism to establish the association between image features. Firstly, the contextual information before and after the Swin transformer block of each stage is fused using a skip connection. Secondly, we enhance the inter-channel interaction of information after fusion. It optimizes the feature extraction capability and has better ship detection accuracy in SAR images. standard. The overall framework is shown in Figure 1. The algorithm proposed in this paper will be explained in detail from three aspects.

FESwin Backbone Network
In order to obtain a feature map with richer information about the ships, the feature information can be transferred to a deeper level of the model. In this paper, we propose the feature enhancement Swin transformer, called FESwin, as the backbone network for feature extraction, as shown in Figure 2. Our method is based on the use of the transformer idea, hierarchical structure design, and window attention mechanism to establish the association between image features. Firstly, the contextual information before and after the Swin transformer block of each stage is fused using a skip connection. Secondly, we enhance the inter-channel interaction of information after fusion. It optimizes the feature extraction capability and has better ship detection accuracy in SAR images.  [21] and feature enhancement for feature extraction, and the remaining three modules that are responsible for scaling the model.
As shown in Figure 2, the FESwin backbone network is composed of a Swin transformer and feature enhancement module. Relying on the hierarchical design of the Swin transformer, the feature enhancement module is introduced in each feature extraction stage. The feature maps of each stage are feature enhanced again, and the more expressive and informative feature maps are used as the output of the current stage. Meanwhile, the  [21] and feature enhancement for feature extraction, and the remaining three modules that are responsible for scaling the model. Figure 2, the FESwin backbone network is composed of a Swin transformer and feature enhancement module. Relying on the hierarchical design of the Swin transformer, the feature enhancement module is introduced in each feature extraction stage. The feature maps of each stage are feature enhanced again, and the more expressive and informative feature maps are used as the output of the current stage. Meanwhile, the output of each stage of FESwin is used as the input of the next stage to obtain more advanced semantic information, enhancing the whole backbone network. In our study, we found that the Swin transformer performs attentional operations at each stage with relatively independent information between each dimension and weak interaction among Remote Sens. 2022, 14, 3186 5 of 23 feature channels. Although it performs the integration of spatial and channel information, the correlation between feature channels of a single stage is weak. Therefore, the feature enhancement module is proposed in the backbone to aggregate the contextual information of different perceptions before and after the Swin transformer block and takes advantage of the CNN to enhance the channel information interaction. It enables further integration of channel and spatial information to enhance model representation.

As shown in
We propose a feature enhancement module, as shown in Figure 3. The feature maps before and after feature extraction are fused by using a skip connection. After that, the fusion is carried out in equal proportion using single-point channel information interaction as the primary method and local spatial information interaction as the supplement. Local spatial information interaction using convolutional layers and activation functions to enhance local perception and obtain a larger perceptual field. Channel information interaction is performed by point-wise convolution at each spatial location, and cross-channel information aggregation is performed for each patch. Finally, the two are fused by weighting to enhance the feature extraction ability, which makes the model have a better expression ability.
output of each stage of FESwin is used as the input of the next stage to obtain more advanced semantic information, enhancing the whole backbone network. In our study, we found that the Swin transformer performs attentional operations at each stage with relatively independent information between each dimension and weak interaction among feature channels. Although it performs the integration of spatial and channel information, the correlation between feature channels of a single stage is weak. Therefore, the feature enhancement module is proposed in the backbone to aggregate the contextual information of different perceptions before and after the Swin transformer block and takes advantage of the CNN to enhance the channel information interaction. It enables further integration of channel and spatial information to enhance model representation.
We propose a feature enhancement module, as shown in Figure 3. The feature maps before and after feature extraction are fused by using a skip connection. After that, the fusion is carried out in equal proportion using single-point channel information interaction as the primary method and local spatial information interaction as the supplement. Local spatial information interaction using convolutional layers and activation functions to enhance local perception and obtain a larger perceptual field. Channel information interaction is performed by point-wise convolution at each spatial location, and cross-channel information aggregation is performed for each patch. Finally, the two are fused by weighting to enhance the feature extraction ability, which makes the model have a better expression ability. We can represent the feature maps before and after the Swin transformer block as X and Y, and make channel information integration and spatial information integration after the fusion of the two feature maps. We use point-wise convolution (pwconv) in channel information integration, allowing point-by-point channel information at each spatial location to be used interactively. We denote the output by C H W C(X) R   and the definition is shown in Equation (1). Besides the channel information, in terms of spatial information, we integrate each point in a single channel with the neighboring location points. We replace the output by   We can represent the feature maps before and after the Swin transformer block as X and Y, and make channel information integration and spatial information integration after the fusion of the two feature maps. We use point-wise convolution (pwconv) in channel information integration, allowing point-by-point channel information at each spatial location to be used interactively. We denote the output by C(X) ∈ R C×H×W and the definition is shown in Equation (1). Besides the channel information, in terms of spatial information, we integrate each point in a single channel with the neighboring location points. We replace the output by S(X) ∈ R C×H×W and the definition is shown in Equation (2).
C(X) = LN(pwconv 2 (δ(LN(pwconv 1 (X + Y))))) (1) S(X) = LN(conv 2 (δ(LN(conv 1 (X + Y))))) where pwconv 1 (X) ∈ R C/4×H×W (4 is the channel compression rate) represents the dimensionality reduction and pwconv 2 (X) ∈ R C×H×W for dimensionality increase, δ is ReLU and LN indicates layer normalization. The convolutional kernel sizes of conv1 and conv2 are (C/4) × C × 3 × 3 and C × (C/4) × 3 × 3, respectively. By calculating C(X) and S(X) after the same shape as the input features, both can retain the fine details in the original features to different degrees. We use Z to express the resultant feature map after the weighted fusion of spatial and channel features, and the weights required for fusion are obtained by (X + Y) using the activation function sigmoid, denoted as m(X + Y) ∈ R C×H×W . In the weighted fusion method, the sum of the weights of the feature mappings is restricted to 1 while using ⊗ to denote multiplication by elements, and the above calculation process is shown in Equation (3)

AFF Module
The architecture of the proposed AFF is shown in Figure 4, with two main optimizations based on the FPN. One part is the bottom-up augmentation of the adjacent layer features, and the bottom-up augmentation only associates the fusion of feature information between the current layer and its adjacent shallow layers. No association is made outside of adjacent feature layers, and the fused content is relatively independent. The other part is the weighted selection fusion between layers using learnable adaptive weights to obtain excellent fusion results. The AFF module combines the advantages of FPN and PANet [42], while the upward fusion of adjacent layers avoids the problem of excessive semantic information gap between multiple layers. The AFF module alleviates the loss of feature information and feature information attention dispersion. It enables the ship information to gain attention in feature maps with different scales. and LN indicates layer normalization. The convolutional kernel sizes of conv1 and conv2 are (C/4) × C × 3 × 3 and C × (C/4) × 3 × 3, respectively. By calculating C(X) and S(X) after the same shape as the input features, both can retain the fine details in the original features to different degrees. We use Z to express the resultant feature map after the weighted fusion of spatial and channel features, and the weights required for fusion are obtained by (X + Y) using the activation function sigmoid, denoted as In the weighted fusion method, the sum of the weights of the feature mappings is restricted to 1 while using  to denote multiplication by elements, and the above calculation process is shown in Equation (3)

AFF Module
The architecture of the proposed AFF is shown in Figure 4, with two main optimizations based on the FPN. One part is the bottom-up augmentation of the adjacent layer features, and the bottom-up augmentation only associates the fusion of feature information between the current layer and its adjacent shallow layers. No association is made outside of adjacent feature layers, and the fused content is relatively independent. The other part is the weighted selection fusion between layers using learnable adaptive weights to obtain excellent fusion results. The AFF module combines the advantages of FPN and PANet [42], while the upward fusion of adjacent layers avoids the problem of excessive semantic information gap between multiple layers. The AFF module alleviates the loss of feature information and feature information attention dispersion. It enables the ship information to gain attention in feature maps with different scales.    Figure 4 shows that the result obtained from the initial feature pyramid can be expressed as {L1, L2, L3, L4}. For L2, L3, and L4 high-level feature mappings are fused with its proximity feature mappings, and the shallow-level features {L1, L2, L3} are expressed as {L1 , L2 , L3 } through the unified scale of downsampling. The L2, L3, and L4 layers are then used to generate the learnable weights α1, α2, and α3. Each weight is learned independently to form adaptive fusion parameters for each feature mapping. Finally, the two adjacent layers of feature mappings are multiplied by mutually opposing learnable weights respectively, and then the results are cumulated. The final feature mapping of the output results is represented by using {P1, P2, P3, P4}, and the computational structure can be written sequentially as: Remote Sens. 2022, 14, 3186 7 of 23 In using the AFF module, it is important to note that the extension for adjacent fusion does not operate on L1. Second, the sum of the weights used for two adjacent feature mappings is controlled to be 1, to ensure the stability of the model training.

Architecture of ESTDNet
The detailed architecture of ESTDNet is shown in Figure 5, which is the application of FESwin and AFF modules in Cascade R-CNN Swin. Because FESwin is structurally complex, we show the network model characteristics in Table 1. FESwin is used as the backbone network for feature extraction, and the AFF module is used as the neck for feature fusion. The output of the FESwin module is the input of the AFF module, and the output of the AFF module is paired with Cascade R-CNN for category prediction and object position prediction to complete ship detection in SAR images.
can be written sequentially as: In using the AFF module, it is important to note that the extension for adjacent fusion does not operate on L1. Second, the sum of the weights used for two adjacent feature mappings is controlled to be 1, to ensure the stability of the model training.

Architecture of ESTDNet
The detailed architecture of ESTDNet is shown in Figure 5, which is the application of FESwin and AFF modules in Cascade R-CNN Swin. Because FESwin is structurally complex, we show the network model characteristics in Table 1. FESwin is used as the backbone network for feature extraction, and the AFF module is used as the neck for feature fusion. The output of the FESwin module is the input of the AFF module, and the output of the AFF module is paired with Cascade R-CNN for category prediction and object position prediction to complete ship detection in SAR images.

Experiment Settings
Because ESTDNet is built on the Cascade R-CNN Swin, ESTDNet is an end-to-end networking model. We use the experimental result of Cascade R-CNN Swin as our baseline. The initial learning rate is set to 0.001, the optimizer is SGD, and the thresholds of NMS are set to 0.5. After statistical analysis of data sets, the input image size of the SSDD dataset was set as 672 × 672, and that of the SARShip dataset as 256 × 256. Although the image size setting of the two datasets is different, ESTDNet changes the size of the network layer as the image size grows. We use image flipping to enhance the number of samples during the training process, which is used to improve the diversity of the training dataset. In addition, our model does not use a pre-trained model but is trained from scratch.

Experiment Datasets
We use two datasets, SSDD and SARShip, to verify the effectiveness of the proposed method in this paper. The SSDD dataset consists of 1160 images of a total of 2540 ships. The SARShip dataset has 39,729 images, consisting of a total of 50,885 ships, and is composed of 102 HSPA-3 images and 108 Sentinel 1 satellite images that have been processed and cropped to a size of 256 × 256 pixels. The resolution of the pictures is 3 m, 5 m, 8 m, and 10 m, respectively. Port terminals, offshore waters, and far seas are some of the scenes covered in the images. Ship types include tankers, cargo ships, large container ships, and small fishing boats. We randomly divide the two datasets into training and test sets in the ratio of 8:2. Next, we use the COCO dataset annotation format to process the bounding boxes and label annotations and convert the original dataset label storage file to JSON file format for storage.
In order to visualize the number of ships of different sizes in the dataset, this paper counts the ships of different sizes according to the definition of the COCO metric [47] and displays them in the form of histograms, as shown in Figure 6. Figure 6a,b shows the number of large, medium, and small ships in the SSDD and SARShip datasets.
image size setting of the two datasets is different, ESTDNet changes the size of the network layer as the image size grows. We use image flipping to enhance the number of samples during the training process, which is used to improve the diversity of the training dataset. In addition, our model does not use a pre-trained model but is trained from scratch.

Experiment Datasets
We use two datasets, SSDD and SARShip, to verify the effectiveness of the proposed method in this paper. The SSDD dataset consists of 1160 images of a total of 2540 ships. The SARShip dataset has 39,729 images, consisting of a total of 50,885 ships, and is composed of 102 HSPA-3 images and 108 Sentinel 1 satellite images that have been processed and cropped to a size of 256 × 256 pixels. The resolution of the pictures is 3m, 5m, 8m, and 10m, respectively. Port terminals, offshore waters, and far seas are some of the scenes covered in the images. Ship types include tankers, cargo ships, large container ships, and small fishing boats. We randomly divide the two datasets into training and test sets in the ratio of 8:2. Next, we use the COCO dataset annotation format to process the bounding boxes and label annotations and convert the original dataset label storage file to JSON file format for storage.
In order to visualize the number of ships of different sizes in the dataset, this paper counts the ships of different sizes according to the definition of the COCO metric [47] and displays them in the form of histograms, as shown in Figure 6. Figure 6a,b shows the number of large, medium, and small ships in the SSDD and SARShip datasets.

Experiments on the SSDD Dataset
We conducted a large number of experiments on the SSDD dataset with the aforementioned configured parameter settings, and the corresponding experimental results are displayed in Table 2. The results of ESTDNet were all better than the baseline results. Among them, AP, AP 50 , AP 75 , AP S , AP M and AP L increased by 2.8%, 2.3%, 4.4%, 1.9%, 3.2%, and 9%, respectively. Compared to Cascade R-CNN Swin, ESTDNet can effectively improve the detection performance of ships in SAR images. According to the 2.8% and 4.4% improvements in AP and AP 75 , ESTDNet improves positioning accuracy and makes positioning more accurate in ship detection. As shown in Table 3, we performed the experiments on the SSDD dataset for some state-of-the-art object detection methods. The experimental results showed that ESTDNet obtained the best AP results of 59.4% compared to other methods. Compared to other methods, YOLOF achieved better performance on medium ships, with AP M reaching 67.4%, 0.6% higher than ESTDNet. However, the AP S and AP L of small and large ships are lower than ESTDNet by 3.6% and 3.9%. In addition, ESTDNet is 6.4% higher than YOLOF on AP 75 , indicating that ESTDNet can obtain more accurate ship position information than YOLOF. Therefore, according to the experimental results of multiple detection methods on the SSDD dataset, ESTDNet can effectively detect ships in SAR images, and can obtain more accurate ship position information, and its comprehensive detection performance surpasses other excellent methods.  Figure 7 presents the experimental results of both the proposed and compared methods. The ground truths, detection results, missed detection results and the false detection results are indicated with green, red, yellow, and blue boxes, respectively. In order to show the detection effect of various methods more intuitively, we selected five images with complex backgrounds in the near-shore region to demonstrate the detection effect. Since the distribution of ships on the near shore is sparse compared to the distribution of ships in far-sea areas, the ships in SAR images are incredibly similar to the coastal background. Therefore, ship detection is more complicated and better reflects the performance of the detection method. As shown in Figure 7, all methods suffer from some degree of missed detection. Tood and ATSS have the highest number of missed detections and the worst performance in the near-shore region. YOLOv3, YOLOF, and DETR missed detection significantly when small and medium ships were dense. Secondly, PAA and Faster RCNN have more false detections. Because of the high similarity between the ship and the coastal background, Faster RCNN detected the background as the ship object in some images. The remaining methods all have a small number of false detections caused by overlapping detection at dense ship locations. The experimental visualization shows that the effectiveness of the detection method is somewhat compromised in near-shore areas, especially when ships are densely arranged. Especially the small and medium-sized ships near the coast are more difficult to accurately identify. Among all the algorithms, ESTDNet has more detection coverage and has the best performance when only individual images are missed and no background is falsely detected as a ship, which indicates that the overall detection performance of ESTDNet is better than other methods.
detection significantly when small and medium ships were dense. Secondly, PAA and Faster RCNN have more false detections. Because of the high similarity between the ship and the coastal background, Faster RCNN detected the background as the ship object in some images. The remaining methods all have a small number of false detections caused by overlapping detection at dense ship locations. The experimental visualization shows that the effectiveness of the detection method is somewhat compromised in nearshore areas, especially when ships are densely arranged. Especially the small and medium-sized ships near the coast are more difficult to accurately identify. Among all the algorithms, ESTDNet has more detection coverage and has the best performance when only individual images are missed and no background is falsely detected as a ship, which indicates that the overall detection performance of ESTDNet is better than other methods.

Experiments on the SARShip Dataset
The results of the proposed method on SARSship are shown in Table 4. The table shows that ESTDNet exceeds the Cascade R-CNN Swin baseline in all COCO metrics. Among them, ESTDNet has improved the accuracy of APM and APS by 3.7% and 2.9%, and APL possesses a considerable improvement of 11.1%, proving that ESTDNet can improve the inspection performance of different scales of ships at the same time. Next, the AP75 as well as AP50 metrics improve by 6.7% and 1.6%, indicating that the ESTDNet method is able to obtain more accurate information about the ship's position. The AP accuracy is higher than the Cascade R-CNN Swin benchmark model by 3.5%, indicating the excellent overall performance of ESTDNet. The above results validate the FESwin and AFF proposed in this paper, which make an important contribution to the extraction, fusion, and transmission of feature information in SAR images.  We compare the detection performance of ESTDNet with other methods for the SAR-Ship dataset in Table 5. Among all methods, ESTDNet has the best AP metric result of 60.1%, indicating that ESTDNet has a better AP value. The index accuracy rates of AP, AP50, and AP75 show that ESTDNet can obtain more accurate ship position information and detect more ship objects than other methods. Secondly, the numerical displays of the indicators of APS, APM, and APL also show that ESTDNet has a more robust detection performance for large, medium, and small ships. In conclusion, compared with other methods, ESTDNet's evaluation of the COCO performance index is relatively balanced, effectively detected most ships, and obtained more accurate detection position information.

Experiments on the SARShip Dataset
The results of the proposed method on SARSship are shown in Table 4. The table shows that ESTDNet exceeds the Cascade R-CNN Swin baseline in all COCO metrics. Among them, ESTDNet has improved the accuracy of AP M and AP S by 3.7% and 2.9%, and APL possesses a considerable improvement of 11.1%, proving that ESTDNet can improve the inspection performance of different scales of ships at the same time. Next, the AP 75 as well as AP 50 metrics improve by 6.7% and 1.6%, indicating that the ESTDNet method is able to obtain more accurate information about the ship's position. The AP accuracy is higher than the Cascade R-CNN Swin benchmark model by 3.5%, indicating the excellent overall performance of ESTDNet. The above results validate the FESwin and AFF proposed in this paper, which make an important contribution to the extraction, fusion, and transmission of feature information in SAR images. We compare the detection performance of ESTDNet with other methods for the SAR-Ship dataset in Table 5. Among all methods, ESTDNet has the best AP metric result of 60.1%, indicating that ESTDNet has a better AP value. The index accuracy rates of AP, AP 50 , and AP 75 show that ESTDNet can obtain more accurate ship position information and detect more ship objects than other methods. Secondly, the numerical displays of the indicators of AP S , AP M , and AP L also show that ESTDNet has a more robust detection performance for large, medium, and small ships. In conclusion, compared with other methods, ESTDNet's evaluation of the COCO performance index is relatively balanced, effectively detected most ships, and obtained more accurate detection position information.
The experimental results in terms of compared methods, which were conducted on the SARShip dataset, are illustrated in Figure 8. The ground truths, detection results, missed detection results and the false detection results are indicated with green, red, yellow, and blue boxes, respectively. To make the detection results more representative, we selected five detection images of objects at different scales with different backgrounds. From the results, in the near-shore large and medium ship detection, most methods can detect ships, but there are many false detections. The Faster RCNN, PAA, and ATSS detection methods have a large number of false detections. The Tood, YOLOv3, Deformable DETR, and Cascade R-CNN Swin methods have missed detections. DETR, YOLOF, and ESTDNet have excellent detection results. Because of the relatively dense distribution of small ships, it is difficult to detect all the ships. The compared detection methods all have a certain degree of missed detections, with Tood, ATSS, and Deformable DETR detection methods having the most serious missed detections. Compared with other detection methods, ESTDNet has a relatively low number of false and missed detections, and the prediction accuracy per ship is higher than most detection methods, indicating the superiority of ESTDNet.

Ablation Experiments
The ablation experiments of ESTDNet, as exhibited in Table 6, are performed on the SSDD dataset, with the Cascade R-CNN acting as the baseline. We use the FESwin module and AFF module for the ablation study. The detection performance metrics of ESTDNet were all effectively improved using the FESwin module, and the improvement was even more pronounced for large ships compared to the Cascade R-CNN Swin. Compared with Cascade R-CNN Swin, the detection performance of the ESTDNet with only the AFF module can acquire significant improvement with regard to the large ships, while there is a slight enhancement in detection performance for the other. In addition, in order to ensure

Ablation Experiments
The ablation experiments of ESTDNet, as exhibited in Table 6, are performed on the SSDD dataset, with the Cascade R-CNN acting as the baseline. We use the FESwin module and AFF module for the ablation study. The detection performance metrics of ESTDNet were all effectively improved using the FESwin module, and the improvement was even more pronounced for large ships compared to the Cascade R-CNN Swin. Compared with Cascade R-CNN Swin, the detection performance of the ESTDNet with only the AFF module can acquire significant improvement with regard to the large ships, while there is a slight enhancement in detection performance for the other. In addition, in order to ensure the stability of the experimental results. We conducted 20 experiments on ESTDNet and recorded the average detection accuracy of multiple experiments in Table 6. The values in parentheses indicate the standard deviation of the results of multiple experiments. In this paper, FESwin is the ESTDNet backbone network module with superior feature extraction capability, due to the fact that both global-local and spatial-channel characteristics are comprehensively considered. As shown in Table 6, the detection performance of large, medium, and small ships is improved using the FESwin module compared to the Cascade R-CNN Swin. Among them, the most significant improvement in large ship inspection AP L increased by 7.6%, and the AP S and AP M of small and medium-sized ship inspection also increased by 1.1% and 3.1%. In addition, the targeting is more precise, with AP, AP 50 , and AP 75 all boasting 2.2%, 2.3%, and 3.6% improvements. Using the AFF module compared with Cascade R-CNN Swin increases the information mobility between the underlying features and the higher-level features, with 0.9%, 0.8%, 1.5%, 0.8%, and 0.8% improvement in AP, AP 50 , AP 75 , AP S , and AP M metrics, and 2.9% improvement in AP L for large ship objects. The full ESTDNet brings a 2.8%, 2.3%, 4.4%, 1.9%, 3.2%, and 9% improvement in each metric compared to Cascade R-CNN Swin. The above phenomenon shows that the use of the FESwin module has an important contribution to the detection accuracy improvement of ESTDNet, and the combination with the AFF module has improved the detection accuracy to different degrees without depleting the improvement effect of the FESwin module. Therefore, both the FESwin module and the AFF module of ESTDNet can improve the model detection performance, and the combination of the two can yield a more robust ship detection model for SAR images.

Comparison of Inference Time
We compare the inference time of ESTDNet with other methods. Table 7 shows that the inference time of ESTDNet is higher than the baseline Cascade R-CNN Swin, with about nine milliseconds more inference per image and 1.5 images per second less processing. This is because ESTDNet increases some computation modules on the Cascade R-CNN Swin baseline for better accuracy. In addition, ESTDNet is 13.5 milliseconds faster than the inference time of Deformable DETR on the SSDD dataset and 11 milliseconds slower than the inference time of Deformable DETR on the SARShip dataset. The inference speed of CNN's methods is slightly higher than transformer's methods, and ESTDNet is comparable to other the inference times of transformer methods are similar.

Discussion
We verify the superiority of ESTDNet by conducting several experiments using the SSDD and SARship datasets. The ablation experiments of FESwin and AFF modules on the SSDD dataset have proved that each of them can improve ship detection performance, and the combination of both can achieve better detection results. In order to observe more intuitively the enhancement effect of the two modules, we show the output of the two modules separately with heat maps.

FESwin Module Effect Validation
In this paper, we compare the effect of feature extraction between FESwin and Swin transformers, and use a heat map for verification. Figure 9 visualizes the results of feature extraction at four different scales for Swin transformer and FESwin, respectively. The more highlighted color in the heat map indicates the more feature attention received in the feature map. In order to illustrate the effect more cleanly, we selected six images of large, medium, and small ship objects including far sea and near shore. As shown in Figure 9, FESwin enhances the feature information that originally existed only in the first stage for medium and small ships, solving the problem of losing feature information as the model deepens. FESwin makes it possible for medium and small ship objects to have ship feature information of interest in at least stages one, two, and three, providing multiple scales of information support for detection. For large ship objects, FESwin expands the feature information focus that initially existed only in phases 1 and 4 to feature information focus in all four phases. Second, near-shore images are more difficult to detect than far-sea images. When the Swin transformer performs feature extraction, there are many false concerns as the depth of the model increases because of the extensive similarity between the object and the background. FESwin dramatically eases these problems and reduces the need for erroneous focus in the coastal context. In summary, the heat map shows that FESwin optimizes the performance of the feature extraction model, brings the ship into focus in more multi-scale feature maps, enhances the utilization of ship information in SAR images, and improves the feature representation capability of the feature extraction model.

AFF Module Effect Validation
For the AFF effect, this paper uses the heat map for verification. Figure 10 visualizes the results of the five-scale feature fusion for FPN and AFF, respectively. The more highlighted color in the heat map indicates the more feature attention received in the feature map. In order to illustrate the effect more cleanly, we selected six images of large, medium, and small ship objects including far-sea and near-shore areas. The Swin transformer is used as the backbone network to ensure the fairness of the experiment. From the results in Figure 10, the useful feature information concerns in the FPN when detecting far-sea images are only present in the fourth and fifth layers, and the feature information mainly comes from the feature extraction in the fourth stage. FPN suffers from long feature information transmission paths, excessive information gaps between high and low layers, and attention scattering in the first, second, and third layers after feature fusion, resulting in the inability to provide effective ship feature information. AFF uses adjacent lower-level features to complement the higher-level features, so that attention to ship feature information appears in all the second, third, fourth, and fifth-level feature maps, effectively alleviating the problem of attention dispersion. Secondly, the degree of attention received by the ship is consistent with the scale distribution of the ship, which is beneficial to ship detection. Because the near-shore background is complex, ship detection in near-shore is not friendly to the feature extraction model. When using FPN for multi-scale feature fusion, there are many incorrect feature focus sites, and there is also the problem that the practical feature focus information is concentrated in four or five layers. AFF performs adjacency fusion of feature maps from bottom to top, reducing the focus on relative errors in each layer, avoiding the focus on coastal background feature information, and focusing attention effectively. Taken together, AFF enhances the information flow between feature

AFF Module Effect Validation
For the AFF effect, this paper uses the heat map for verification. Figure 10 visualizes the results of the five-scale feature fusion for FPN and AFF, respectively. The more highlighted color in the heat map indicates the more feature attention received in the feature map. In order to illustrate the effect more cleanly, we selected six images of large, medium, and small ship objects including far-sea and near-shore areas. The Swin transformer is used as the backbone network to ensure the fairness of the experiment. From the results in Figure 10, the useful feature information concerns in the FPN when detecting far-sea images are only present in the fourth and fifth layers, and the feature information mainly comes from the feature extraction in the fourth stage. FPN suffers from long feature information transmission paths, excessive information gaps between high and low layers, and attention scattering in the first, second, and third layers after feature fusion, resulting in the inability to provide effective ship feature information. AFF uses adjacent lower-level features to complement the higher-level features, so that attention to ship feature information appears in all the second, third, fourth, and fifth-level feature maps, effectively alleviating the problem of attention dispersion. Secondly, the degree of attention received by the ship is consistent with the scale distribution of the ship, which is beneficial to ship detection. Because the near-shore background is complex, ship detection in near-shore is not friendly to the feature extraction model. When using FPN for multi-scale feature fusion, there are many incorrect feature focus sites, and there is also the problem that the practical feature focus information is concentrated in four or five layers. AFF performs adjacency fusion of feature maps from bottom to top, reducing the focus on relative errors in each layer, avoiding the focus on coastal background feature information, and focusing attention effectively. Taken together, AFF enhances the information flow between feature maps at each scale, reduces the information difference between the bottom features and the top features, and alleviates the problem of attention scattering in multiple scales. A multiscale feature map with richer feature information is obtained using AFF, allowing ship information in SAR images to be used effectively.
We can see from the results in Figures 9 and 10 that the increase in background complexity and ship density impacts the detection results. In the future, we will consider combining geometric features to construct a ship detection network with a stronger feature representation to solve the background complexity problem. On the other hand, ship objects in SAR images show the multi-angle distribution and no overlap between objects, and we will try to use a rotate anchor detection method to solve the problem better.
comes from the feature extraction in the fourth stage. FPN suffers from long feature infor-mation transmission paths, excessive information gaps between high and low layers, and attention scattering in the first, second, and third layers after feature fusion, resulting in the inability to provide effective ship feature information. AFF uses adjacent lower-level features to complement the higher-level features, so that attention to ship feature information appears in all the second, third, fourth, and fifth-level feature maps, effectively alleviating the problem of attention dispersion. Secondly, the degree of attention received by the ship is consistent with the scale distribution of the ship, which is beneficial to ship detection. Because the near-shore background is complex, ship detection in near-shore is not friendly to the feature extraction model. When using FPN for multi-scale feature fusion, there are many incorrect feature focus sites, and there is also the problem that the practical feature focus information is concentrated in four or five layers. AFF performs adjacency fusion of feature maps from bottom to top, reducing the focus on relative errors in each layer, avoiding the focus on coastal background feature information, and focusing attention effectively. Taken together, AFF enhances the information flow between feature maps at each scale, reduces the information difference between the bottom features and the top features, and alleviates the problem of attention scattering in multiple scales. A multi-scale feature map with richer feature information is obtained using AFF, allowing ship information in SAR images to be used effectively.   We can see from the results in Figures 9 and 10 that the increase in background complexity and ship density impacts the detection results. In the future, we will consider combining geometric features to construct a ship detection network with a stronger feature representation to solve the background complexity problem. On the other hand, ship objects in SAR images show the multi-angle distribution and no overlap between objects, and we will try to use a rotate anchor detection method to solve the problem better.

Conclusions
In this paper, we propose ESTDNet for ship detection of SAR images. FESwin and AFF are essential components of ESTDNet, where the FESwin module is responsible for the feature extraction work of the images to obtain more feature information. The AFF module is more beneficial for fusing the extracted ship feature information. We use

Conclusions
In this paper, we propose ESTDNet for ship detection of SAR images. FESwin and AFF are essential components of ESTDNet, where the FESwin module is responsible for the feature extraction work of the images to obtain more feature information. The AFF module is more beneficial for fusing the extracted ship feature information. We use ablation experiments to confirm the effectiveness of these two modules. The ESTDNet based on FESwin and AFF can improve the accuracy of ship detection in SAR images. Moreover, we conduct experiments on the SSDD and SARShip datasets. The results reveal that ESTDNet achieves higher other detection performance than other methods and is a superior ship detection method in SAR images. This is of great importance in aviation, aerospace, military, and civil fields.
ESTDNet is a combination of transformer and CNN detection methods. In the future, our research will consider reducing the computational complexity caused by the transformer model. In addition, we will investigate the light-weighting of the transformer model. Data Availability Statement: The public datasets are used in this study, no new data are created or analyzed. Data sharing is not applicable to this article.