SII-Net: Spatial Information Integration Network for Small Target Detection in SAR Images

: Ship detection based on synthetic aperture radar (SAR) images has made a breakthrough in recent years. However, small ships, which may be regarded as speckle noise, pose enormous challenges to the accurate detection of SAR images. In order to enhance the detection performance of small ships in SAR images, a novel detection method named a spatial information integration network (SII-Net) is proposed in this paper. First, a channel-location attention mechanism (CLAM) module which extracts position information along with two spatial directions is proposed to enhance the detection ability of the backbone network. Second, a high-level features enhancement module (HLEM) is customized to reduce the loss of small target location information in high-level features via using multiple pooling layers. Third, in the feature fusion stage, a reﬁned branch is presented to distinguish the location information between the target and the surrounding region by highlighting the feature representation of the target. The public datasets LS-SSDD-v1.0, SSDD and SAR-Ship-Dataset are used to conduct ship detection tests. Extensive experiments show that the SII-Net outperforms state-of-the-art small target detectors and achieves the highest detection accuracy, especially when the target size is less than 30 pixels by 30 pixels.


Introduction
Synthetic aperture radar (SAR), with the characteristics of all-day and all-weather work, has broad application prospects in both the military and civilian fields. Among the SAR applications, ship detection plays an important role in maritime management and monitoring. However, compared with optical images, the process of SAR images is more difficult due to their lower resolution. Therefore, the accurate location of SAR ships with relatively small pixels remains a significant challenge.
In the early detection of SAR images, Leng et al. [1] proposed the constant false alarm rate (CFAR) algorithm, which could adaptively determine the detection threshold. In addition, Liu et al. [2] proposed a method to separate the sea and land to opt for the detection area automatically. Nevertheless, these traditional methods have cumbersome calculation processes and poor migration capabilities. In recent years, a series of SAR ship detection methods based on deep learning (DL) [3][4][5][6] have achieved good performance. DL-based methods of object detection are mainly divided into two patterns, i.e., one-stage detector and two-stage detector. To ensure the real-time effect of recognition, some onestage [7][8][9][10][11][12] methods are chosen to detect SAR ships. However, studies [13] have shown that one-stage methods are more likely to produce more localization errors in small target detection.
For the two-stage methods [14][15][16], although they can obtain high detection accuracy in the offshore area, they are prone to plenty of missing detections and false alarms in the inshore area. Therefore, some scholars choose the attention mechanism as the starting point for research, aiming to reinforce the feature extraction ability of the backbone network. For example, the squeeze-and-excitation (SE) [17] mechanism utilized by Lin et al. [18] was adopted to improve the feature extraction capability of the network. However, the SE module only considered the correlation between channels and ignored the importance of location information. The ARPN [19] network combined a convolutional block attention module (CBAM) [20] proposed by Zhao et al. to suppress the adverse effects of the surrounding environment. However, CBAM modules could only capture local correlations and could not build long-range dependence of the feature. In addition, Fu et al. [21] proposed the feature balancing and refinement network (FBR-Net), which used an attentionguided balanced pyramid to effectively improve the detection performance of small ships in complex scenes. However, the FBR-Net was designed to semantically balance multiple features at different levels, without considering location information. The inconsistency across different levels makes the networks pay more attention to the objects with sufficient obvious semantic features, it also causes the detection problem of small targets. Aiming to solve this problem, many scholars use multiscale feature extraction methods [22][23][24] to address this imbalance issue. For instance, the Lite Faster R-CNN proposed by Li et al. [25] was adopted to extract feature information by a parallel multi-scale convolution operation. A lightweight SAR ship detector proposed by Zhang et al. [26] was applied to share the features of different detection scales through up-sampling and down-sampling methods. A multi-domain fusion network proposed by Li et al. [27] was used to attain excellent results in multi-scale target detection under complex backgrounds. A dense connection module presented by Deng et al. [28] was adopted to obtain the feature information of each layer in the network and to predict multiscale ship proposals from several intermediate layers.
These improved methods based on feature imbalance effectively improve the detection accuracy of small ship targets in SAR images. However, on the one hand, these methods did not notice the location information loss of high-level small targets. Therefore, the small object has little information for location refinement when it is mapped to the final feature map. On the other hand, these methods did not focus on reducing the similarity between the target and the surrounding background. However, some local areas of the ship targets in the SAR image have a similar scattering mechanism to the surrounding areas, and lots of false alarms may be generated in the inshore area. Some scholars [29,30] choose multi-modal detection methods to improve the detection accuracy of ships in SAR images. Although the multi-modal methods can improve the stability of the prediction results, the design of these methods is too complicated, and there is not much improvement in small target detection. This still causes a huge obstacle to the precise positioning of the small ship target.
To tackle the above-mentioned phenomenon, a small ship detection method based on a spatial information integration network (SII-Net) is proposed in this paper. At present, the improvement of small target detection based on SAR images is mostly from two directions, the use of attention mechanism in the backbone network or enhancement in the feature fusion stage. The SII-Net is also designed from these two perspectives while noting the high-level feature loss during feature extraction. SII-Net includes three key components: the channel-location attention mechanism (CLAM), high-level features enhancement module (HLEM), and a refined branch. Taking into account the fusion of location and channel information, the CLAM module is designed to extract feature information from two spatial directions. At the same time, we put forward the HLEM module to remedy the location information loss of small targets at a high level. Additionally, we design a feature refinement path to enhance the difference between the target and the background, aiming to better reduce the interference of background noise. The detection results illustrate the superiority of SII-Net in comparison with other state-of-the-art CNN-based methods on the small target dataset LS-SSDD-v1.0 [31] and other large public datasets (i.e., SSDD [32], SAR-Ship-Dataset [33]). The main contributions of this paper are as follows: (1) We propose a channel-location attention mechanism (CLAM) that embeds positional information into channel information along with two spatial directions. By modeling inter-channel similarity and spatial correlation, the CLAM yields impressive performance in enhancing the feature extraction ability of the backbone. (2) To address the location information loss of small targets at the high level, a welldesigned module called the high-level features enhancement module (HLEM) is customized to upgrade the performance of the high feature layer of the backbone network by multiscale pooling operation. (3) Considering the fact that inshore ship targets are susceptible to interference from surrounding objects, a new refined branch of feature is proposed to optimize the features after fusing each feature layer. The refined branch can make the network enhance the difference between target and background to effectively distinguish the target from the background.
The rest of this paper is organized as follows. In Section 2, we describe the overall architecture of SII-Net and its improvements in detail. In Section 3, the experiment results are introduced. Ablation experiments are presented in Section 4. Finally, some conclusions are drawn in Section 5.

The Motivation of the Proposed Method
Nowadays, ship detection in SAR images is prone to several problems due to noise interference. As shown in Figure 1, ground truths are marked by green boxes, and detection results are marked by red boxes. Figure 1a shows the problem of inaccurate localization, the pink circles represent the ships with the inaccurate bounding box. Figure 1b shows the problem of the missing ships which are represented by the yellow circles. Figure 1c shows the problem of false detections which are represented by the purple circles. To solve these problems, the current algorithms are usually improved in the backbone network or feature fusion stage.
To highlight significant information, attention mechanism modules are usually inserted into the backbone network [34][35][36]. For example, the once-popular channel-based attention module, SE, was applied to integrate the channel information by the global pooling [37] method. However, the SE module only paid attention to the channel information, ignoring the importance of the location information in detection. After that, the improved CBAM module calculated the channel attention while calculating the spatial attention, but the CBAM module used large-scale convolution kernels to extract spatial information locally. The CBAM module did not provide a good improvement on the long-distance dependence which is required for small ship detection.
In addition, although high-level features contain a great deal of semantic information, the location information of small targets is easily lost at a high level due to the continuous down-sampling operation.
In addition, a series of [7,[38][39][40] improving networks based on FPN [41] were put forward to strengthen the detection ability of ship targets, but these improvements usually used fine-grained feature pyramids to consider multiscale features. These improved methods based on multiscale features can enhance the detection performance for offshore ships, but for targets whose characteristics are not particularly obvious in complex backgrounds, the effect of improvement is not particularly prominent.
Inspired by these algorithms, we propose the SII-Net which contains three innovative modules, i.e., CALM, HLEM, and a refined branch. The SII-Net is mainly designed to improve the detection ability of small targets. The detailed implementation of the entire method is introduced in the next section.
Remote Sens. 2022, 14, x FOR PEER REVIEW 4 Inspired by these algorithms, we propose the SII-Net which contains three innov modules, i.e., CALM, HLEM, and a refined branch. The SII-Net is mainly designed t prove the detection ability of small targets. The detailed implementation of the method is introduced in the next section.

Overview of the Processing Scheme
As shown in Figure 2, the SII-Net consists of four parts, i.e., the preprocessing part, the backbone network, the feature fusion module, and the detection head. Our innovative work is reflected in three parts, the CLAM module, the HLEM module and a refined branch. To effectively use the location information of the target, we design a new attention mechanism, CLAM, which can encode spatial information in horizontal and vertical directions. After the extraction of spatial features, a carefully designed HLEM module is proposed to mitigate the loss of high-level location information and improve the ability to detect small targets by parallel multiscale pooling layers. To avoid the interference of inshore targets by the scattering mechanism of surrounding objects, we propose a refined path branch to highlight the target characteristics and enhance the difference between the target and the background. The SII-Net has a good reference significance for other scholars who research small target detection. Next, the flowchart of the SII-Net will be introduced in detail. At first, the input SAR images are preprocessed by adding false samples and using the Scale Match [42] method. By using preprocessing methods, the effect of the pre-train network and the quality of feature extraction can be improved. After preprocessing, the SAR images are input to the backbone network, and the CLAM module is applied to enhance feature extraction capability for the backbone network by collecting location information from two spatial directions. After this, the HLEM module is used to enhance high-level features of the backbone network. The features are then input into PANet [43] for processing, and the refined path is adopted to optimize the output of the PANet to distinguish the target from the background. Later, the information collected by the above operations is sent to the classification sub-network and the anchor regression sub-network for discrimination and localization tasks respectively. Finally, the results of SII-Net are output through nonmaximum suppression (NMS) [44] operation.

Overview of the Processing Scheme
As shown in Figure 2, the SII-Net consists of four parts, i.e., the preprocessing part, the backbone network, the feature fusion module, and the detection head. Our innovative work is reflected in three parts, the CLAM module, the HLEM module and a refined branch. To effectively use the location information of the target, we design a new attention mechanism, CLAM, which can encode spatial information in horizontal and vertical directions. After the extraction of spatial features, a carefully designed HLEM module is proposed to mitigate the loss of high-level location information and improve the ability to detect small targets by parallel multiscale pooling layers. To avoid the interference of inshore targets by the scattering mechanism of surrounding objects, we propose a refined path branch to highlight the target characteristics and enhance the difference between the target and the background. The SII-Net has a good reference significance for other scholars who research small target detection. Next, the flowchart of the SII-Net will be introduced in detail. At first, the input SAR images are preprocessed by adding false samples and using the Scale Match [42] method. By using preprocessing methods, the effect of the pre-train network and the quality of feature extraction can be improved. After preprocessing, the SAR images are input to the backbone network, and the CLAM module is applied to enhance feature extraction capability for the backbone network by collecting location information from two spatial directions. After this, the HLEM module is used to enhance high-level features of the backbone network. The features are then input into PANet [43] for processing, and the refined path is adopted to optimize the output of the PANet to distinguish the target from the background. Later, the information collected by the above operations is sent to the classification sub-network and the anchor regression sub-network for discrimination and localization tasks respectively. Finally, the results of SII-Net are output through non-maximum suppression (NMS) [44] operation.

Scale Match
The pre-training model of the current algorithm is usually obtained from the ImageNet dataset [45] or COCO dataset [46], ignoring the scale difference between the pre-training dataset and the training dataset. This situation leads to the effect of the pre-training model Remote Sens. 2022, 14, 442 6 of 20 being extremely weak. Noting this shortcoming, we choose to use the scale match method. Before training, the scale match method is performed on the pre-training dataset and the training dataset by changing the image size of the pre-training dataset. The scale match method is used to make the scales of all targets in the pre-training dataset align with the scale distributions of all targets in the training dataset.

Add False Samples
Several pure background pictures may lead to false alarms during the detection in the LS-SSDD-v1.0 dataset. To allow pure background pictures to participate in the training, we adopt the strategy of adding a false sample with 1 pixel in each SAR image to enhance the robustness of the network model.

Channel-Location Attention Mechanism (CLAM)
We use ResNet-50 as the backbone, as it mainly includes four residual modules. By using the residual modules, the flow of information is increased. In addition, the gradient vanishing problem and the degradation problem caused by networks that are too deep can be effectively avoided. The attention mechanism module CLAM is designed to improve the target location capability of the backbone network. The overall structure of CLAM is shown in Figure 3. It can be seen from the figure that the CLAM module extracts features from the X direction and the Y direction, respectively, and then summarizes features. Specifically, in the first stage, the two spatial extents of pooled kernels (H, 1) and (1, W) are used to integrate each channel information of the feature along the horizontal direction and the vertical direction respectively, and the feature is then decomposed into two parallel onedimensional codes. Extracting features along two spatial directions can effectively extract location information and enhance the expressional ability of location features. Since there is a large amount of background information on the SAR ship images, an average pooling operation is used to retain the background information. In addition, the adaptive maximum pooling operation is also performed to highlight the main information of the target and enhance the texture information of the feature. The two pooling methods can realize the full use of the feature information. After the average pooling and maximum pooling is operated in each direction, the feature maps turn into P avg_h ∈ R C * H * 1 ,P max_h ∈ R C * H * 1 , P avg_w ∈ R C * 1 * W and P max_w ∈ R C * 1 * W . Here C, H, and W, respectively, denotes the number of input feature channels, the height of the input plane in pixels, and the width of the input plane in pixels, P avg_h denotes the result of the average pooling operation of features along the vertical direction, P max_h denotes the result of the maximum pooling operation of the features along the vertical direction, P avg_w represents the result of averaging pooling of features along the horizontal direction, and P max_w represents the result of the maximum pooling operation of the features along the horizontal direction. The output features of two directions can be expressed as: where A 1 , A 2 , A 3 , and A 4 represent the weight of the current step, b 1 , b 2 , b 3 , and b 4 represent the bias of the current step and represent the output feature maps in the horizontal and vertical directions respectively, and F denotes the input feature map. Based on this method, accurate position information can be used to effectively capture spatial structures. After the pooling operations, a convolution layer combination module is used to encode local spatial information. In the second stage, to make better use of representations with global receptive fields and accurate location information generated in the first stage, the processed feature maps of the two directions are cascaded to perform the global integration Remote Sens. 2022, 14, 442 7 of 20 of features. Later, a convolution layer with kernel size 1 × 1 is used to transform and generate intermediate feature maps. The horizontal tensor and the vertical tensor are then segmented along the spatial dimension, respectively. The convolution operation is then used again, so that the number of output channels in the two directions is resumed the same as the input F. Finally, the two coefficients are multiplied with the previous input F. Now, the channel information processing and spatial information embedding are completed. Moreover, the CLAM module has a strong robustness and generalization ability, and it is easy to plug CLAM into any location of the network for feature enhancement. The final output of the CLAM module can be expressed as: where F x2 and F y2 denote the final calculation results in the x-direction and y-direction respectively.
vertical directions respectively, and F denotes the input feature map. Based on this method, accurate position information can be used to effectively capture spatial structures. After the pooling operations, a convolution layer combination module is used to encode local spatial information. In the second stage, to make better use of representations with global receptive fields and accurate location information generated in the first stage, the processed feature maps of the two directions are cascaded to perform the global integration of features. Later, a convolution layer with kernel size 1 × 1 is used to transform and generate intermediate feature maps. The horizontal tensor and the vertical tensor are then segmented along the spatial dimension, respectively. The convolution operation is then used again, so that the number of output channels in the two directions is resumed the same as the input F. Finally, the two coefficients are multiplied with the previous input F. Now, the channel information processing and spatial information embedding are completed. Moreover, the CLAM module has a strong robustness and generalization ability, and it is easy to plug CLAM into any location of the network for feature enhancement. The final output of the CLAM module can be expressed as: where and denote the final calculation results in the x-direction and y-direction respectively.

High-Level Features Enhancement Module (HLEM)
Inspired by the SPP-Net [47], we designed the HLEM module to improve the ability to detect small targets by compensating for the loss of location information at a high-level feature map of the backbone. The SPP-Net is used to realize the extraction of multi-scale features. Based on the scale features of small targets, smaller pooling kernels are adopted by the HLEM module to capture the location information of small targets more accurately.
In the HLEM module, the method of lateral splicing features in SII-Net is not selected, instead, the features processed by multiple pooling layers are spliced with the original features along with the channel level, making up for the missing position features more fully. Its specific operation process is shown in Figure 4. Firstly, a convolution layer with kernel size 3 × 3 is used for down-sampling. Secondly, the processed feature maps are Remote Sens. 2022, 14, 442 8 of 20 sent to pooling layers of different scales for pooling operations, the multiscale pooling layers can not only help our network extract location information but also extract different fine-grained features. To avoid destroying global features, the HLEM module then directly splices the parallel multi-level pooling layer and uses a 3 × 3 convolution to integrate the channel dimension. Lastly, the HLEM module is fused with the highest-level output feature of the backbone network. The characteristics output of HLEM can be represented as: where P max_i denotes the result of the maximum pooling operation, i denotes kernel size of the pooling, and '+' represents the activation function.
feature map of the backbone. The SPP-Net is used to realize the extraction of multi-scale features. Based on the scale features of small targets, smaller pooling kernels are adopted by the HLEM module to capture the location information of small targets more accurately.
In the HLEM module, the method of lateral splicing features in SII-Net is not selected, instead, the features processed by multiple pooling layers are spliced with the original features along with the channel level, making up for the missing position features more fully. Its specific operation process is shown in Figure 4. Firstly, a convolution layer with kernel size 3 × 3 is used for down-sampling. Secondly, the processed feature maps are sent to pooling layers of different scales for pooling operations, the multiscale pooling layers can not only help our network extract location information but also extract different finegrained features. To avoid destroying global features, the HLEM module then directly splices the parallel multi-level pooling layer and uses a 3 × 3 convolution to integrate the channel dimension. Lastly, the HLEM module is fused with the highest-level output feature of the backbone network. The characteristics output of HLEM can be represented as: Conv where _ denotes the result of the maximum pooling operation, i denotes kernel size of the pooling, and '+' represents the activation function.

A Refined Branch
At present, the network mainly relies on the feature pyramid to improve the recognition ability of small targets. SAR ships show different characteristics at different levels of the pyramid. The PANet is improved based on the FPN, which adds a bottom-up pyramid after the FPN to deliver effective positioning information for the underlying features. However, for the SAR images, if PANet is used alone in the detection process, the small targets are still easily submerged by surrounding noise. To further refine the char-

A Refined Branch
At present, the network mainly relies on the feature pyramid to improve the recognition ability of small targets. SAR ships show different characteristics at different levels of the pyramid. The PANet is improved based on the FPN, which adds a bottom-up pyramid after the FPN to deliver effective positioning information for the underlying features. However, for the SAR images, if PANet is used alone in the detection process, the small targets are still easily submerged by surrounding noise. To further refine the characteristics output of the PANet and effectively enhance the ability of the network to distinguish targets from the background, we design a new refined branch. The specific operation of the refine branch is shown in Figure 5. Its main process of feature refinement is as follows. The PANet network has five layers of output, i.e., C 1 , C 2 , C 3 , C 4 and C 5 . At first, any feature layer of the output is selected (the middle layer C 3 is recommended), and the maximum pooling operation is used for up-sampling of the low-level features (C 1 and C 2 ) which are under the selected layer, and the interpolation method is used for processing the high-level features (C 4 and C 5 ) which are above the selected layer. In this way, the scale of each feature layer can be converted to the same scale as our selected feature layer. Next, the processed features are Remote Sens. 2022, 14, 442 9 of 20 accumulated to achieve feature fusion, and the fused feature has rich location information and detailed information. The fused feature can be represented as: where f input_i denotes the input feature, i represents the number of the current feature layer, r represents the number of the feature layer that we appoint, I near denotes the interpolation operation in the near mode, and P max represents the maximum pooling operation. C denotes the total number of feature layers. The non-local [48] method, which can realize the weighting of all position features when calculating the feature response of a certain position, is then used to further refine the fused feature. After this, the adaptive maximum pooling operation is employed to enhance the texture information of the targets. In the end, features processed by pooling are merged with the PANet to highlight target features and reduce ambient background noise infection. After the refinement of the features, as shown in Figure 6, the network will make the features of the target more visible, focusing on the target area and greatly reducing the focus on the background. The final output is expressed as: where F PANet denotes the output feature of the PANet, and N represents the non-local operations.
any feature layer of the output is selected (the middle layer is recommended), and the maximum pooling operation is used for up-sampling of the low-level features ( and ) which are under the selected layer, and the interpolation method is used for processing the high-level features ( and ) which are above the selected layer. In this way, the scale of each feature layer can be converted to the same scale as our selected feature layer. Next, the processed features are accumulated to achieve feature fusion, and the fused feature has rich location information and detailed information. The fused feature can be represented as: where _ denotes the input feature, i represents the number of the current feature layer, r represents the number of the feature layer that we appoint, denotes the interpolation operation in the near mode, and represents the maximum pooling operation. C denotes the total number of feature layers. The non-local [48] method, which can realize the weighting of all position features when calculating the feature response of a certain position, is then used to further refine the fused feature. After this, the adaptive maximum pooling operation is employed to enhance the texture information of the targets. In the end, features processed by pooling are merged with the PANet to highlight target features and reduce ambient background noise infection. After the refinement of the features, as shown in Figure 6, the network will make the features of the target more visible, focusing on the target area and greatly reducing the focus on the background. The final output is expressed as: where denotes the output feature of the PANet, and N represents the non-local operations.     [31], the training set has 6000 images, and the test set has 3000 images. We set the ratio of the training set and test set to 6:3. This is mainly used for small ship detection. In the same way as the original reports in [33], we randomly set the ratio of the training set, validation set, and the test set to 7:2:1. Figure 7. It can be seen that the targets with a pixel value of less than 30 pixels by 30 pixels account for a large proportion of the overall targets in LS-SSDD-v1.0. However, in SSDD and SAR-Ship-Dataset, small-size targets do not occupy the main part.
There are 15 large-scale images with 24,000 × 16,000 pixels in LS-SSDD-v1.0 from Sentinel-1 (the first 10 images are selected as a training set, and the remaining are selected as a test set). The 15 large-scale images are cut into 9000 sub-images with 800 × 800 pixels by the publisher of the dataset. At the same time, it contains a wealth of pure background images. SAR ships in LS-SSDD-v1.0 are provided with various resolutions around 5m, and VV and VH polarizations. According to the setting of the original reports in [31], the training set has 6000 images, and the test set has 3000 images. We set the ratio of the training set and test set to 6:3. This is mainly used for small ship detection. In the same way as the original reports in [33], we randomly set the ratio of the training set, validation set, and the test set to 7:2:1. Figure 7. It can be seen that the targets with a pixel value of less than 30 pixels by 30 pixels account for a large proportion of the overall targets in LS-SSDD-v1.0. However, in SSDD and SAR-Ship-Dataset, small-size targets do not occupy the main part.

Evaluation Criterions
We adopted evaluation indices, including precision (p), recall (r), and mean average precision (mAP), to evaluate the detection performance of different detection methods, i.e., where TP represents the number of true positives, FP represents the number of false positives, and FN denotes the number of positive negatives. Because the mAP considers both precision and recall, it is used to measure the final detection accuracy: The p(r) denotes the precision-recall curve.

Implement Details
For the experimental platform, we used lntel ® Xeon(R) Gold 5118, 2.3 GHz twelve-core processor, 17.6 GiB memory, NVIDIA GeForce GTX 1080ti 12 g graphics card. The software environment that we used was the ubuntu 16.04 64-bit operating system. The programming language used was python 3.7. The GPU computing platform is PyTorch 1.2.0. CUDA 10.1 and cuDNN 7.6. SII-Net uses the stochastic gradient descent (SGD) optimizer with an initial learning rate of 1 × 10 −2 , optimizer momentum of 0.9, and weight decay of 1 × 10 −4 . Moreover, the learning rate is reduced by 10 times per epoch from 8-epoch to 11-epoch to ensure an adequate loss reduction. Images in LS-SSDD-v1.0, SSDD, SAR-Ship-Dataset are resized as the 1000 × 1000, 700 × 700, and 400 × 400 image sizes for training. SII-Net and the other SAR ship detectors are implemented under the MMDetection [49] toolbox to ensure fairness of the result. When performing our detection method in the test stage, we set the score threshold as 0.5.

Results on Small Target Dataset LS-SSDD-v1.0
Since the SII-Net network is aimed at improving the detection effect of small-size targets, we first conducted experiments on the small target dataset LS-SSDD-v1.0 to prove the effectiveness of our algorithm. Table 1 shows the quantitative comparison with the other 14 competitive SAR ship detectors on LS-SSDD-v1.0. Whether it is for algorithms with the attention mechanism modules or the algorithm that uses LS-SSDD-v1.0 publicly, SII-Net achieves the best 76.1% mAP on the entire scene. The second-best detector is SA Faster R-CNN+PBHT, which is the best performing method among the public results at this stage, but SA Faster R-CNN+PBHT is still lower than SII-Net by~1.0% mAP. To further verify the superiority of our algorithm on small target detection, we used the detector Double-Head R-CNN (achieved the best performance among all approaches opened source codes) and the SII-Net network to evaluate the object in the LS-SSDD-v1.0 dataset at four scales. It defines the targets with a scale between (0,900) as small targets and targets with a scale between (0,100) as extremely small targets. The evaluation results are shown in Tables 2 and 3. Compared with the Double-Head R-CNN, SII-Net has improved the detection effect of targets the scale of which is in the range of (0,100), (100,400), and (400,900) by 9.6% mAP, 11.9% mAP, and 2.4% mAP, respectively. It can be concluded that the improvement effect of SII-Net on small target detection is more prominent, highlighting the best detection performance of our detector on small target detection. This shows that SII-Net is efficient in integrating spatial feature information and improving the detection result of small targets. In addition, we also used the SII-Net algorithm to test the objects in inshore scenes and offshore scenes on the LS-SSDD-v1.0 dataset. As shown in Tables 4 and 5, SII-Net all achieves the best detection performance. The inshore detection environment is much more complex than the offshore, and the improvement in inshore target detection confirms that the SII-Net owns strong robustness in complex scenarios and can successfully suppress the surrounding noise of the target.  Figure 8 shows the qualitative results on LS-SSDD-v1.0. In Figure 8, SII-Net is compared with Faster R-CNN, DCN, and Double-Head R-CNN in inshore scenes. Ground truths are marked by green boxes and detection results are marked by red boxes. Compared with other methods, the new correct detection results of SII-Net are marked by a blue circle. Figure 8a,e,I,m,q represent the results of Faster R-CNN. Figure 8b,f,j,n,r represent the results of DCN. Figure 8c,g,k,o,s represent the results of Double-Head R-CNN. Figure 8d,h,l,p,t represent the results of SII-Net. There are many noise interferences in the figures, and it is very difficult to accurately locate small targets. However, it can be seen that while the SII-Net achieves the best detection results, no false alarms are generated significantly. This is mainly because the refine path branch we designed can effectively highlight the target features and reduce the similarity between the target and the background.

Results on Other Datasets
To verify the generalization of SII-Net on other mainstream SAR image ship datasets, we also performed SII-Net on the other two datasets. On SSDD, as shown in Table 6, our algorithm has achieved the best 95.5% mAP (on the entire scenes), and the secondbest one is 95.2% mAP from Quad-FPN. On SAR-Ship-Dataset, as shown in Table 7, our algorithm achieved the third-best 93.2% mAP (on the entire scenes), and the best detector is 94.3% mAP from Quad-FPN. For these two datasets, the superiority of the SII-Net is not particularly outstanding. This is because the proportion of small targets in these two datasets is not large (as shown in Figure 7). However, SII-Net is mainly aimed at small targets. Especially for SAR-Ship-Dataset, due to it being relatively simple, the detection results of each method at this stage are close. For the most studied dataset SSDD in the field of SAR image detection, SII-Net achieves the best detection results, which shows that SII-Net has excellent robustness and generalization ability. detection, SII-Net achieves the best detection results, which shows that SII-Net has excellent robustness and generalization ability.

Ablation Experiment
In this section, we take LS-SSDD-v1.0 as an example, where 'O' means that the module is not added, and 'P' means that the module is added. We first discuss the overall effectiveness of the algorithm. It can be seen from Table 8 that in the process of module accumulation, the detection accuracy of the algorithm gradually changes from 71.9% to 76.1%. This shows the validity of our overall structural design. To visually assess the effectiveness of the three innovation modules, we visualize their intermediate heatmaps. Figure 9a,f represent the original image. Figure 9b,g represent the heatmaps of the basic algorithm. Figure 9c,h represent the heatmaps of the CLAM module. Figure 9d,i represent the heatmaps of the HLEM module. Figure 9e,j represent the heatmaps of the refined branch. As shown in Figure 9, the brighter color in heatmaps represents the target area predicted by the network. It can be seen that, compared with the base algorithm, the three innovation modules can not only reduce the noise background in the image but also accurately locate the target.
To visually assess the effectiveness of the three innovation modules, we visualize their intermediate heatmaps. Figure 9a,f represent the original image. Figure 9b,g represent the heatmaps of the basic algorithm. Figure 9c,h represent the heatmaps of the CLAM module. Figure 9d,i represent the heatmaps of the HLEM module. Figure 9e,j represent the heatmaps of the refined branch. As shown in Figure 9, the brighter color in heatmaps represents the target area predicted by the network. It can be seen that, compared with the base algorithm, the three innovation modules can not only reduce the noise background in the image but also accurately locate the target.  Tables 9 and 10 show the results of the CLAM module ablation experiment, when the other two modules are added and not added together, the CLAM module has improved the overall test results. This proves that the CLAM module effectively enhances the extraction of location information and improves the detection ability of the backbone network.  Tables 11 and 12 shows the results of the ablation experiment of the HLEM module, when the other two modules are added and not added together, the HLEM module has improved the overall effect of the algorithm to a certain extent. This shows that the HLEM module can explicitly compensate for the loss of high-level location information in the detection process and improve the detection ability of the high feature layer for small targets.  Tables 9 and 10 show the results of the CLAM module ablation experiment, when the other two modules are added and not added together, the CLAM module has improved the overall test results. This proves that the CLAM module effectively enhances the extraction of location information and improves the detection ability of the backbone network.  Tables 11 and 12 shows the results of the ablation experiment of the HLEM module, when the other two modules are added and not added together, the HLEM module has improved the overall effect of the algorithm to a certain extent. This shows that the HLEM module can explicitly compensate for the loss of high-level location information in the detection process and improve the detection ability of the high feature layer for small targets.

Effectiveness of R-Branch
Tables 13 and 14 shows the results of the ablation experiment of the R-branch, when the other two modules are added and not added together, the R-branch has improved the ability to detect small targets in the network. The results show that the anti-noise performance of the R-branch and its effectiveness in enhancing the difference between target and background.

Conclusions
Aiming at the problems of inaccurate target location and the interference of complex backgrounds in synthetic aperture radar (SAR) image detection, a novel target detector SII-Net based on spatial information integration is proposed in this paper. Specifically, the channel-location attention mechanism (CLAM) module is proposed to help the backbone network achieve more accurate positioning of the target. Furthermore, the high-level features enhancement module (HLEM) is customized to compensate for the loss of location features of high-level small targets. The ablation experiments present that the addition of the CLAM module and the HLEM module achieve 2.4% mAP and 3.3% mAP improvements compared with the baseline model, respectively. Moreover, a new feature refinement branch is presented to distinguish the feature information of the target and the background by enhancing the difference between them, so that the false alarms and missed detections can be effectively reduced. The heatmaps show that the target and the background are greatly distinguished by the feature refinement branch. Quantitative experiments demonstrate that the SII-Net surpasses all SOTA algorithms by a large margin when performing small target detection, as it breaks through the problem with a low detection accuracy of the small target. Qualitative comparisons show that the SII-Net achieves more visually pleasant detection results.