A Deep Detection Network Based on Interaction of Instance Segmentation and Object Detection for SAR Images

: Ship detection is a challenging task for synthetic aperture radar (SAR) images. Ships have arbitrary directionality and multiple scales in SAR images. Furthermore, there is a lot of clutter near the ships. Traditional detection algorithms are not robust to these situations and easily cause redundancy in the detection area. With the continuous improvement in resolution, the traditional algorithms cannot achieve high-precision ship detection in SAR images. An increasing number of deep learning algorithms have been applied to SAR ship detection. In this study, a new ship detection network, known as the instance segmentation assisted ship detection network (ISASDNet), is presented. ISASDNet is a two-stage detection network with two branches. A branch is called an object branch and can extract object-level information to obtain positioning bounding boxes and classiﬁcation results. Another branch called the pixel branch can be utilized for instance segmentation. In the pixel branch, the designed global relational inference layer maps the features to interaction space to learn the relationship between ship and background. The global reasoning module (GRM) based on global relational inference layers can better extract the instance segmentation results of ships. A mask assisted ship detection module (MASDM) is behind the two branches. The MASDM can improve detection results by interacting with the outputs of the two branches. In addition, a strategy is designed to extract the mask of SAR ships, which enables ISASDNet to perform object detection training and instance segmentation training at the same time. Experiments carried out two different datasets demonstrated the superiority of ISASDNet over other networks.


Introduction
Synthetic aperture radar (SAR) is an active microwave imaging equipment that can work all day and in all weather [1]. As the performance of the SAR system gradually improves, more and more high-resolution and high-quality SAR images can be acquired. Many countries have developed their own SAR systems, such as TerraSAR-X, COSMOS-SkyMed, RADARSAT-2, ALOS-PALSAR, Sentinel-1, and Gaofen-3 [2]. The application value of SAR is increasing in various fields. Ship detection is very meaningful, as it can provide basic information for ship traffic management [3,4], the fishing industry [5,6], and safe navigation [7][8][9]. The SAR system can continuously observe the sea area for a long time without the interference from clouds, fog, rainfall, or snowfall. Therefore, SAR ship detection has attracted the attention of researchers in various countries.
There is a lot of noise in SAR images, which affects ship detection. Ships parked in the port are affected by land clutter, which makes ship detection more difficult. Moreover, small ships are easy to ignore, while dense ships always appear as a bright spot in SAR images, which makes it difficult to identify individual ships. In most cases, the backscattered signal of ships is much larger than that on the sea surface. Ships are brighter than the surrounding In the existing open SAR ship datasets, only the vertical bounding box is used as the label because the backscattered signal of a ship is much larger than that on the sea surface. With the help of label information, the ship's pixels can be easily separated from the background through some image processing. Although a mask obtained in this way is not very accurate, it can roughly locate the outline of the ship. Hence, ship detection and ship instance segmentation can be trained simultaneously. In this work, an instance segmentation assisted ship detection network (ISASDNet) is presented. ISASDNet is based on the Mask R-CNN and is a two-stage detection network that has two branches. Pixel-level information can be extracted to promote ship detection in ISASDNet. The contributions of this work are summarized below. 1. A strategy to extract the mask is designed. The approximate ship contour is obtained by a threshold-based segmentation method. Through this strategy, the mask does not exceed the bounding box, which is the labels of the dataset. Ship detection and ship instance segmentation can thus be carried out synchronously when training the network.
2. A global reasoning module is designed to improve the accuracy of predicting the ship mask. Features are mapped to the interaction space. The relationship between ship and background can be regarded as a two-node graph. The global relationship can be obtained through this module. 3. A module that uses the results of instance segmentation to enhance object detection is designed. The positioning coordinate can be regarded as a classification task. After obtaining rough target recognition results and instance segmentation results, the posterior probability of coordinate points is calculated to obtain the final result. This module can mine the relationship between pixel-level information and object-level information.
The rest of this paper is arranged as follows. Some related work is introduced in Section 2. The proposed network is described in Section 3. Then, the proposed network is compared with other algorithms in Section 4. The results of some analytical experiments are also presented in Section 4 Finally, Section 5 concludes the paper.

Traditional Ship Detection Methods
Due to the different radar reflection characteristics between the ship and sea water, in SAR images, the ship is bright while the water is dark. However, the contrast between the ships and the sea background changes constantly, and this requires the ship detection algorithm to have adaptability and maintain a constant false alarm rate. The CFAR algorithm is based on a statistical model, is widely employed by researchers, and has the characteristics of fast speed, adaptive threshold, simple design, etc. Wang et al. used the internal Hermitian product to obtain a new detector [21]. The detector can apply a threshold to obtain SAR ship discrimination. A data-driven Parzen window kernel function was utilized to approximate the histograms of SAR images in [22]. SAR ship objects have also been filtered by the given threshold of the CFAR. An et al. proposed a modified iterative truncation algorithm for the CFAR [23]. This method searched target pixels and their four-connected neighborhood pixels to estimate local sea clutter distributions. Li et al. used weighted information entropy to describe the statistical characteristics of super pixels [24]. They separated the ship target from the background super pixels by changing the threshold value. A novel decomposition approach was presented to analyze scattering between ships and sea background in [25]. This decomposition approach has been combined with CFAR to detect ships. Lang et al. extracted the representation of pixels by using spatial and intensity information, and the separability of ship and background was significantly improved [26]. In addition to statistical methods, the traditional SAR ship detection includes multi-scale-based methods [27], template matching methods [28], full polarization-based methods [29,30], etc. Pastina et al. purposed a processing chain based on the cell average and generalized likelihood ratio test for the detection of ships [27]. Ouchi et al. proposed that the multilook images of ships have higher coherence than that of surrounding sea surface [31]. They used a small moving window to calculate the cross-correlation value between the two images, which can extract the ships. Tello et al. proposed an approach for ship detection based on the analysis of SAR images by discrete wavelet transform [32]. Although these traditional methods have achieved good results, their robustness decreases as SAR image resolution increases.

Object Detection Using CNNs
With the development of deep learning, CNNs have shown strong performance in object detection. All kinds of detectors based on the CNN are composed of two parts: a backbone and a head [33]. The backbone is usually the VGG [34], ResNet [35], or DenseNet [36]. The head is classified into two types: the one-stage algorithm, and the two-stage algorithm. The R-CNN [16], fast R-CNN [17], and faster R-CNN [37] are the most typical two-stage algorithms. Two-stage algorithms produce region proposals that may contain objects, and then classify and calibrate the region proposal to produce the final detection results. The most representative one-stage algorithms are the YOLO [14] series, SSD [15], and RetinaNet [38]. Unlike two-stage algorithms, the one-stage algorithms do not generate proposal boxes. In addition, many embedded network modules have been designed. For example, Lin et al. exploited the inherent multi-scale convolutional network, called a feature pyramid network (FPN), to improve the accuracy of object detection [39]. Furthermore, a path aggregation network can boost the quality of prediction by accelerating information flow and integrating the features of different levels [40].

Ship Detection for Optical Remote Sensing Image
The imaging mechanism of optical image is different from that of SAR image. Optical image is the image data obtained by visible light sensors. Optical image usually contains gray information of multiple bands. It includes a lot of color information, shape information, texture information, etc. Researchers have designed a large number of algorithms for ship detection in optical remote sensing images. A network that searched the head of a ship globally and generated smaller proposal boxes was proposed for inshore ship detection [41]. Yan et al. presented a data enhancement strategy using simulated ship images to augment the positive training samples [42]. Additionally, this strategy improved the training accuracy of faster R-CNN. For ship rotation detection, a dual branch regression network was designed, which can extract features with different aspect ratios and integrate multi-scale features [43]. Liu et al. designed a multiregion feature-fusion module to improve faster R-CNN, and used multitask learning to classify, locate, and regress ships [44]. The network proposed by Ma et al. can generate rotating region suggestions by constructing central region prediction and orientation classification [45]. Feng et al. proposed a ship detection and classification framework by introducing a new sequence local context module [46].

SAR Ship Detection with Deep Learning
SAR can observe the earth all day, unlike optical sensors that cannot work at night. Ships in SAR images are different from those in optical images. SAR image lacks color information, texture information, shape information, etc. Additionally, there is a lot of noise in SAR image. It is difficult for researchers without relevant knowledge to label SAR images. This results in the scarcity of labeled SAR ship data. Thus, ship detection in SAR images is more challenging. Many deep learning algorithms have been applied to ship detection in SAR images. Fan et al. embedded a multi-level features extractor into the Faster R-CNN for polarimetric SAR ship detection [47]. A dense attention pyramid network that densely connected the attention convolutional module to each feature map was presented for SAR ship detection [18]. Meanwhile, a fully convolutional network was designed for pixel-wise ship detection in polarimetric SAR images [48]. A spatial attention block and a split convolution block were embedded in the feature pyramid network [49]. The feature pyramid network can accurately detect ship objects against a complex background. Wei et al. designed a high-resolution feature pyramid network that connected high-to-low resolution features for ship detection [50]. A multi-scale adaptive recalibration network was proposed to solve the problem of ships with different sizes and dense berthing [51]. Hou et al. proposed a one-stage SAR object detection method to address the low confidence of candidates and false positives [52]. Kang et al. proposed an algorithm combining CFAR with faster R-CNN [53]. This method used the object proposals generated by faster R-CNN as the protection window of CFAR to extract small objects. Zou et al. designed a generative adversarial network with multi-scale loss term and combined it with YOLOv3 to improve the accuracy of SAR ship detection [54].

Methodology
In this section, the two-branch instance segmentation assisted ship detection network (ISASDNet) is firstly introduced. In the existing open datasets of SAR ship detection, only the bounding boxes are used as labels. In order to train ISASDNet, we designed a strategy that extracts the mask of ships. Then, we describe two modules that can improve the performance of ISASDNet.

Architecture
ISASDNet can perform object detection and instance segmentation simultaneously. It has two branches, as shown in Figure 2. The backbone network of ISASDNet is a combination of the ResNet, FPN, and region proposal network (RPN) [37]. The ResNet and FPN can extract multi-scale features of a SAR image. Meanwhile, RPN can generate the region proposal for the ships. The object branch focuses on locations and object categories and is composed of full connection layers and ROIAlign layers that fuse multi-scale features. The pixel branch is a global reasoning module (GRM) focused on pixel-level information to predict masks of ships. Our network is expected to learn a graph like global relationship between the ship region and background region, so as to represent the large contrast between ships and the sea surface. The GRM can map features to interaction space to learn the interaction between the ship and background. In interaction space, the relationship between the ship and background is regarded as a two-node graph. The adjacency metrics learned from graph can well describe the relationship. The output of the object branch is a rough object detection result, while the pixel branch outputs the instance segmentation results of ships. The module behind the two branches of ISASDNet is called the mask assisted ship detection module (MASDM). In the MASDM, the target localization can be regarded as a classification task. The outputs from the two branches are fed into the MASDM. Finally, according to Bayes theorem, the MASDM can adjust the bounding boxes to obtain the final results. Figure 2. Architecture of ISASDNet, which has a two-branch structure, i.e., an object branch and a pixel branch. The backbone network is a combination of ResNet, FPN, and RPN. The object branch can obtain rough object detection results. The pixel branch is based on global relational inference and can obtain the instance segmentation results. MASDM can fuse the results from these two branches to obtain more accurate object detection results.

Mask Extraction Strategy
Vertical bounding boxes are usually available as labels instead of masks in SAR ship detection datasets. In fact, the pixels of SAR ships are quite different from the background. Through some well-designed strategies, it is easy to extract the ship's contour as the mask with the help of label information from the datasets. The mask extraction strategy consists of four steps (described below), and the processing rendering is shown in Figure 3.  Figure 3a is the original image, and the green bounding box is the label. The length and width of the given labels are each expanded by 50% to cut original image. The sliced image containing the object is as shown in Figure 3b. The area beyond the sliced image is filled with padding. In this way, the background around the ships in the original image can be preserved more, which is convenient for subsequent threshold segmentation, corrosion, and expansion operations

Thresholding
We use an adaptive threshold segmentation method to generate binary images. Even if there are all kinds of clutter in a SAR image, the pixel value of the ship is larger than that of the background. Sort all pixels by pixel value in sliced image. We assume that β of the pixels with the highest pixel value are ships, and the rest comprise the background. β is a hyperparameter. Therefore, the threshold T 1 is defined as follows: where I(·) is the indicator function used for counting the number of pixels, P represents the set of all pixel values in sliced image, p represents each pixel value in the set P, and sgn(·) is a signed function. In the experimental part, we analyzed the data of the training set. When β was between 0.35 and 0.4, the ship pixels can be extracted completely. In order to ensure the integrity of the ships, β was determined to be 0.4. In other words, 40% of the pixels with the highest pixel value were regarded as ships. Ridler et al. proposed the IsoData threshold segmentation algorithm [55], which is a classical clustering method. It uses the variance within and between clusters to inform further clustering. When the number of samples in a cluster is too small or the distance between two clusters is too close, the clusters are merged; meanwhile, when the inner variance of a cluster is too large, the cluster is split. In the IsoData threshold segmentation algorithm, a random threshold is given to segment the image into objects and background firstly. Then, the mean value of these two parts is calculated and iterated until the threshold is greater than the composite mean value. The threshold T 2 can be calculated quickly by skimage.threshold_isodata in Python [56]. The final threshold T is: The binary image ( Figure 3c) is obtained according to T.

Morphological Processing
In the binary image, there are still a lot of clutter and noise in the target contour area. These noises and clutter can be removed by twice average filtering. Then, corrosion and expansion can make the mask more complete (see Figure 3d).

Output Mask
Since it is still possible for the mask to go beyond the label box, only the mask in the bounding box is reserved. The final mask is shown in Figure 3e.

Global Reasoning Module
Relational reasoning between distant regions of arbitrary shape is crucial for object detection [57]. The CNN has shown extraordinary ability in many computer vision tasks and is good at calculating local relations. However, it needs to stack multiple convolutions to capture the global relations between remote areas. Humans can easily understand the relationship between different regions of an image. The structure of a graph can better describe the relationship between different regions than a figure. The graph convolution network can project the region of interest into an interaction space to infer the global relationship [58]. This reasoning process is similar to human cognition. Chen et al. proposed a graph-based unit that can transform features in coordinate space and interaction space for global reasoning [57]. This unit combines a graph convolution with an ordinary convolution and can be easily embedded in various networks. In SAR images, ships are sometimes very small, and too many convolution layers may reduce the accuracy and speed of the inference process. The task of SAR ship instance segmentation can be regarded as two-class segmentation. The relationship between the ship and background can be regarded as a two-node graph. The graph is formed by projecting features from coordinate space to interaction space, which can better infer the global relationship between ship and background. The features in the interaction space are mapped back to the coordinate space by back projection. This global relationship is conducive to image segmentation. Therefore, a global reasoning module (GRM) embedded in ISASDNet is devised to improve the accuracy of ship instance segmentation.
The global relational inference layers are based on the unit proposed in [57]. Figure 4 presents the structure of the global relational inference layer. An input feature F ∈ R L×C is fed into a global relational inference layer, where L = W × H. W is the width of the feature, H is the height of the feature, and C is the channel of the feature. A function is used to reduce the input dimension to half of the original dimension. To improve the calculation speed and the capacity of the projection function, φ(·) can be done by convolution. Similarly, a projection weight B can be obtained by convolution: A new projection feature V is obtained in interaction space: Here, N = 2 is the number of nodes that represent the ship and the background. The nodes need to interact to learn the relationship between the ship and the background. G and A represent the N × N node adjacency matrix, and U represents the state update function. The graph interaction is defined as follows: This interaction between the nodes can be completed by two 1D convolution layers along channel-wise and node-wise directions. Then, node-feature Z ∈ R N× C 2 needs to be transformed into feature W ∈ R L× C 2 in coordinate space. The reverse projection matrix D can be regarded as the transposition of B. Similar to the transformation from coordinate space to interaction space, the transformation from interaction space to coordinate space can also be completed by convolution operation. W adds dimension to obtain feature W ∈ R L×C . The proposed GRM is shown in Figure 5. It is composed of an ROIAlign layer, a convolution layer, four global relational inference layers, and four deconvolution layers. The ROIAlign layer can fuse multi-scale features from the backbone network. Meanwhile, the global relational inference layers can capture global relationships between the ship and the background. Lastly, the four deconvolution layers transform features into the same size as the original image.

Mask Assisted Ship Detection Module
The object branch can provide a rough bounding box for the ship. By calculating the predicted mask of the pixel branch, the vertical minimum circumscribed rectangle can also be obtained as the bounding box. The MASDM is designed to improve detection results by using the bounding boxes from these two branches. The determination of coordinates can be regarded as a task of discrete variable classification. The bounding box can be simplified to the calculation of four coordinates (left, top, right, and bottom). Here, we only analyze the abscissa x of the upper-left corner. The ordinate is calculated in the same way, and the other three coordinates are calculated similarly. The location x is the argmax of the probability of a coordinate: where X is the random variable for the coordinate of left, X O = j means that the abscissa of the left boundary is j in the object branch, X P = k means that the abscissa of the left boundary is k in the pixel branch, and P(X = i|X O = j, X P = k) denotes the posterior probability given the result of object branch X O = j and the result of pixel branch X P = k. According to Bayes theorem, Equation (7) can be transformed into where P(X = i) and P(X O = j, X P = k|X = i) are the prior and likelihood probabilities, respectively, and w is the width of the image. A Gaussian distribution is used to calculate probability P(X = i): where α is the normalization coefficient. This Gaussian distribution is related to the image size and the results of two branches. Therefore, where γ is a weight factor, w O is the width of the proposal box from the object branch, and w P is the width of the proposal box from pixel branch.
Assuming that X O = j and X P = k are independent, the likelihood probability can be defined as follows: It is hard to calculate P(X O = j|X = i) and P(X P = k|X = i) directly. Thus, two 1D convolution kernels are learned to calculate them. First, we flatten the predicted mask to get a vector of length w. The flatten process is shown in Figure 6 and should be carried out for each ship. A 1D convolution kernel slides on the flattened vector to get the probability P(X P = k|X = i) of each point. The bottom row of images in Figure 6 shows the flatten process of the object branch. The object branch produces detection boxes for each ship. This flatten process needs to be applied to each ship. All parts of the original image beyond detection box are set to zero. Then, each column is maximized to produce a vector of length w. Another 1D convolution kernel slides on the vector to obtain P(X O = j|X = i). The length m of the convolution kernel is a super parameter. The coordinate with the largest probability value is the final result. The four coordinates of the bounding box can be calculated in this way.

Loss Function
The proposed ISASDNet is trained by the following loss function: where λ 1 and λ 2 are weight coefficients. The classification loss, bounding box loss and mask loss are identical as those defined in [20]. Specifically, L cls is log loss: where p is the predicted class and u is the true class. Suppose that v = v x , v y , v w , v h is the ground truth of bounding box and t = t x , t y , t w , t h is the predicted result. For bounding box regression, L box is defined as: in which The L mask is defined as the average binary cross-entropy loss using a per-pixel sigmoid function: where y i is the ground truth of the ith pixel, and y i is segmentation result of the ith pixel.

Experiments
In this section, experiments with the proposed ISASDNet are conducted on two datasets. Our network is compared with the other state of the art deep learning algorithms. The Common Objects in Context (COCO) metrics are used to evaluate performance. Two traditional algorithms are also compared with our algorithm. Finally, the proposed modules are analyzed, and several groups of performance comparison experiments are carried out.

Datasets and Evaluation Metrics
In order to promote the development of object detection in SAR images, Wang et al. constructed a ship detection dataset called SAR-Ship-Dataset [59]. In this dataset, 102 Gaofen-3 images and 108 Sentinel-1 images are divided into 43,819 image chips with a length and width of 256. All image chips are labeled by SAR experts according to the Pascal Visual Object Classes (PASCAL VOC) standard. The resolution of these images involves 3 m, 5 m, 8 m and 10 m with different imaging modes. Moreover, these images contain complex environments, such as ports, inshore waters, and islands, and the ships are distributed in many forms, including independent cruise and fleet navigation. The top row of Figure 7 shows some example images from this dataset. The other dataset used in our experiments is the SAR ship detection dataset (SSDD) [60]. The SSDD dataset also follows the PASCAL VOC standard. SSDD contains all kinds of SAR images with different polarizations, and sea conditions. There are 1160 images and 2540 ships in SSDD. The resolutions of SAR images are 1 m, 3 m, 5 m, 7 m, and 10 m. The bottom row of Figure 7 shows some example images of SSDD.
COCO [61] metrics constitute a classic evaluation standard for object detection and image segmentation. In various object detection competitions, COCO metrics are often used to measure algorithm performance. The intersection over union (IOU) is the core of COCO metrics, and refers to the intersection ratio between the bounding box predicted by algorithms and the ground truth: where B p is the predicted result, and B g is the ground truth. According to the preset IOU threshold, the precision and recall rate can be calculated by where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. For single class target ship detection, the mean average precision (AP) is defined as follows [50]: where r represents recall and p(r) denotes the precision value that recall = r corresponds to.
In COCO metrics, different IOU thresholds produce different AP values. Table 1 lists the COCO metrics. The result of traditional algorithm is a segmentation image. The foreground is ship pixel, and the other is background pixel. We use a pixel level figure of merit (FoM) [26] to evaluate the traditional algorithms. FoM is defined as where true positive (tp) denotes the number of pixels belonging to ships which is correctly detected, the false positive (fp) denotes the number of false alarm pixels, and the false negative (fn) denotes missing ship pixels. Ground truth and the output results of ISASDNet are boxes. We regard the pixels in the boxes as ship pixels and the pixels outside the boxes as background pixels.

Experiment Results
In order to verify the performance of ISASDNet, we conducted experiments on the two aforementioned datasets. We also compared ISASDNet with other algorithms: Faster R-CNN, Mask R-CNN, YOLOv3, YOLOv4 [33], SSD, M2Det [62], RefineDet [63], D2Det [64], cell averaging CFAR (CA-CFAR) [65], and visual attention model (VAM) [66]. CA-CFAR is an improved CFAR algorithm based on statistics. VAM uses saliency maps to cluster and extract ships. CA-CAFR and VAM are traditional algorithms, and the other algorithms are deep learning algorithms. When training the networks, 70% of the images were randomly selected to constitute the training set while the remaining 30% constituted the test set. All experiments were implemented in Python3.7 with a Quadro P5200 GPU.
Since the backbone networks have an important influence on the result of object detection, ISASDNet uses two different backbone networks to extract features, namely ResNet50 and ResNet101. Like ISASDNet, Faster R-CNN, and Mask R-CNN use ResNet50 and ResNet101 to act as backbone networks. Their backbone networks also include RPN and FPN. The mask that is extracted through the designed strategy is introduced during in training of Mask R-CNN. VGG16 is the backbone network of M2Det and RefineDet. Different sizes of ships, inshore scenes, and close arrangement of ships increase the difficulty of detection. Still, Faster R-CNN has good performance and can identify ships in different scenes. When the backbone network of Faster R-CNN is ResNet50, it has a lower false detection rate and missed detection rate, and higher ship detection confidence. When training Mask R-CNN, instance segmentation loss was introduced, which affects ship detection. For nearshore ships and dense ships, the detection results of Mask R-CNN are worse than those of Faster R-CNN. For example, Mask R-CNN identifies some nearshore buildings as ships. As the classical single-stage detection algorithm, YOLOv3 and SSD transform target detection into a regression problem. Their accuracies are lower than those of the two-stage algorithms. Both YOLOv3 and SSD have a lot of missed detection, and their recognition confidence is lower than that of Mask R-CNN. YOLOv4, which has better performance than YOLOv3, is a very advanced single-stage algorithm. The detection results of YOLOv4 are very good, and there are few false checks. RefineDet, which is a variant of SSD, has a good recognition effect for ships located in open waters, but easily misses nearshore ships. Although M2Det also has good performance, it is easy to find that the recognition result of M2Det cannot completely include the ships. D2Det is a new two-stage detection method introducing a dense local regression to improve ship detection accuracy. However, there are still a few false checks in its results. The last two rows in Figure 9 present the results of our proposed ISASDNet. ISASDNet can detect ships well under various complicated conditions; moreover, ISASDNet has the fewest number of false detections and missed detections of all algorithms.  The quantitative analysis results are shown in Table 2. Overall, ISASDNet with the ResNet50 backbone has the best performance, with an AP value of 0.601. The AP of ISASDNet with the ResNet101 backbone is 0.596, which is only 0.005 lower than the optimal highest value. YOLOv4 has the second-best detection performance. Its AP is 0.596.   50 and AP 75 are also very important evaluation indices. Although the AP 50 of all algorithms is greater than 0.750, that of ISASDNet is the best. Regardless of whether the backbone network is ResNet50 or ResNet101, the AP 50 values of ISASDNet are higher than 0.95. Meanwhile, D2Det and ISASDNet with the ResNet101 backbone have the highest AP 75 of all compared algorithms. Although their results are very close, the performance of ISASDNet is a little better than that of D2Det. Objects can be divided by their size into small object, medium object, and large object. In SAR-Ship-Dataset, small objects account for 60.0% of all objects, medium object accounts for 39.7%, and large objects only account for 0.3%. Therefore, the accurate detection of small and medium objects is more important. As can be seen from Table 2, ISASDNet has the largest AP S value. The AP S of ISASDNet with the ResNet50backbone is 0.615, and the AP S of ISASDNet with ResNet101backbone is 0.609. For medium objects, Faster R-CNN with the ResNet50 backbone has the best result, with an AP M of 0.631. The proposed ISASDNet also has good results, the AP M values of ISASDNet are higher than 0.58. Although ISASDNet is not as good as Faster R-CNN, it also has good results for large objects. In summary, Table 2 shows that ISASDNet has better performance than the other deep learning algorithms.

Results on SAR-Ship-Dataset
The proposed ISASDNet is also compared with the traditional algorithms CA-CFAR and VAM. When calculating FoM, the results of ISASDNet and ground truth are transformed into binary segmentation images. In other words, the pixels in the predicted boxes and the ground truth boxes are regarded as ship pixels, and other pixels outside boxes are background pixels. When ISASDNet produces prediction results, the confidence is set to 0.5. Figure 10 shows the results of CA-CFAR and VAM on the SAR-Ship-Dataset. Two traditional algorithms can extract the ships far away from the coast, but they are less effective in detecting ships near shore. CA-CFAR and VAM will mistakenly detect land as ships, and small ships cannot be extracted completely. Table 3 shows the FoM of each algorithm. The FoM of ISASDNet with ResNet101 backbone is highest. The FoM of CA-CFAR and VAM are 0.1103 and 0.1691. The performance of ISASDNet is much better than that of traditional algorithms.

Results on SSDD
We also compared the performances of ISASDNet and other algorithms on the SSDD dataset. Figures 11 and 12 show the detection results of deep learning algorithms, and Table 4 presents the quantitative analysis. From the experimental results, it can be seen that YOLOv4 and D2Det detect ships very well. Faster R-CNN is slightly inferior to YOLOv4 and D2Det. Compared with Faster R-CNN, Mask R-CNN has a poor recognition result for closely arranged ships. Moreover, Mask R-CNN often has false detection on closely arranged ships and inshore ships. YOLOv3 and SSD have poor recognition results for small ships and nearshore ships, and often has missed detection. While the detection results of RefineDet and M2Det are slightly better than those of YOLOv3 and SSD, they also often make mistakes and omissions. ISASDNet achieves the best results compared with the other algorithms. Even in the complex offshore situation, ISASDNet has good detection results. As can be seen from Table 4, all algorithms produce good results on the SSDD dataset. All AP values are higher than 0.480, all AP 50 are higher than 0.850, and all AP 75 are higher than 0.50. However, ISASDNet with the ResNet101 backbone has the best performance (with an AP value of 0.627); meanwhile, the AP value of ISASDNet with the ResNet50 backbone is 0.610. The AP values of Faster R-CNN are 0.587 and 0.579 with backbones ResNet50 and ResNet101, respectively. The AP S of Mask R-CNN with ResNet50 and Mask R-CNN with ResNet101 are 0.557 and 0.563, respectively. The AP S of YOLOv3, SSD, RefineDet and M2Det are lower, hitting 0.508, 0.481, 0.588, and 0.498, respectively. The performance of YOLOv4 and D2Det is only a little worse than that our algorithm, with AP values being 0.601 and 0.594. Through these results, it can be concluded that ISASDNet has better and more robust detection performance.   Figure 13 shows the results of CA-CFAR and VAM on the SSDD. VAM can extract more complete ship contour than CA-CFAR, and reduce the false alarm rate. Both CA-CFAR and VAM are easy to mix coastal ships with land. In complex scenes, the detection rate of traditional algorithm is still relatively low. Table 5 shows the results of quantitative analysis on the SSDD. ISASDNet is much better than CA-CFAR and VAM. The FoM of ISASDNet with ResNet101 backbone is 0.6632, which is the highest.

. Ablation Experiment and Parameter Analysis
The proposed GRM and MASDM have a great impact on ISASDNet. We conducted an ablation experiment with the GRM and MASDM, and analyzed the experimental results obtained in various situations. The proposed ISASDNet was based on Mask R-CNN. Thus, Mask R-CNN whose backbone was ResNet50 was regarded as the baseline (Case 1). For the GRM analysis, the mask branch of Mask R-CNN was replaced by the GRM module (Case 2). In order to evaluate the contribution of the MASDM to ship detection, the pixel branch of ISASDNet was replaced by the mask branch of Mask R-CNN (Case 3). In other words, the network in Case 3 had an extra MASDM module than the network in Case 1. Case 4 involves the complete ISASDNet. We carried out ship detection experiments on the SAR-Ship-Dataset. Table 6 shows the results obtained in the four cases. As can be seen from Table 6, the results from Case 1 and Case 2 are similar. The AP value in Case 2 is only 0.013 higher than that in Case 1. In Case 1 and Case 2, ISASDNet has no MASDM. ISASDNet cannot fuse the features extracted from segmentation task and those extracted from the detection task. Thus, there is a big gap between the results obtained in these cases and those obtained in Case 3 and Case 4. The AP value in Case 3 is 0.106 higher than that in Case 2. The MASDM can interact with the information from the object branch and pixel branch to promote the final detection results. It is easy to conclude that the best result is obtained in Case 4. In Case 4, ISASDNet can extract the segmentation result better than the mask branch in Case 3. This makes the AP result in Case 4 higher than that in Case 3. Figure 14 shows the object detection and instance segmentation results obtained in Case 3 and Case 4. Although the segmentation results in these two cases cannot extract the ship contour very well, the segmentation results obtained in Case 4 are better than those obtained in Case 3. Thus, the GRM promotes the results of ship detection to a certain extent. In the MASDM, the length m of two 1D convolution kernels is very important since it affects the probability value of each point. We used ISASDNet with different m values to train on SAR-Ship-Dataset. Figure 15 shows the results. When the length m of the convolution kernel is 3, the MASDM can achieve the best ship detection.

Performance Comparison
In this subsection, we compared the performance of ISASDNet and other deep learning algorithms from various aspects. First, we counted the inference time of each algorithm. Then, we used different amount of data to train each algorithm and compared their performance. Finally, we added noise to the test image to evaluate the robustness of these algorithms. Table 7 shows the inference time of each algorithm. The inference time is obtained in a Quadro P5200 GPU. Obviously, the one-stage object detection algorithms are faster than the two-stage algorithms. The inference time of these one-stage algorithms is less than 0.1 s. Faster R-CNN, Mask R-CNN, D2Det, and the proposed ISASDNet are two-stage algorithms. The inference time of these algorithms will be different when using different backbone networks. The inference speed of ISASDNet is slower than other algorithms, because the well-designed GRM module and MASDM module increase the inference time. Experiments with different amounts of data were verified on the SAR-Ship-Dataset. We used 55%, 60%, and 65% data to train the deep learning algorithms, respectively. Table 8 shows the AP of each algorithm under different data volume. From the three groups of experiments, ISASDNet has the best performance. Whether the backbone network is ResNet101 or ResNet50, the AP value of ISASDNet is higher than other algorithms. When the training data set is 55% of the total, the AP value of ISASDNet is not less than 0.58. Faster R-CNN, YOLOv4 and D2Det are slightly worse than ISASDNet. Their AP values are higher than 0.50. The performance of Mask R-CNN, YOLOv3, SSD, RefineDet, and M2Det are not satisfactory. Their AP values are lower than 0.50. With the increase of the amount of data, the accuracy of each algorithm is increasing. When the training data set is 60% of the total, ISASDNet can produce the highest AP value of 0.59. The performance of D2Det is better than other algorithms, but worse than ISASDNet. YOLOv4 and Faster R-CNN also have good performance, and their AP values are higher than 0.55. While the AP values of Mask R-CNN, YOLOv3, SSD, RefineDet, and M2Det are lower than 0.50. When the training data set is 65% of the total, the performance of ISASDNet is still better than other algorithms.
In order to verify the robustness of the deep learning algorithms, Gaussian noise was added to the test images. The three groups of noise were Gaussian noise with mean value of 0 and variance of 0.1, 0.2 and 0.3, respectively. Noise experiments were carried out on SAR-Ship-Dataset. 65% of the images were selected to constitute the training set. Figure 16 shows the ISASDNet with ResNet101 backbone detection results under different noises. The noise mainly affects the detection of small targets by ISASDNet. Many small ships have been missed, and the prediction box may be offset. Table 9 shows the AP values of each algorithm for images with different noises. In general, the greater the variance of the added noise, the lower the detection accuracy of the algorithms. When the variance of noise is 0.1, the AP value of ISASDNet with ResNet101 backbone is 0.559. When the variance of noise is 0.3, the AP value of ISASDNet with ResNet101 backbone is reduced to 0.518. The AP value of D2Det decreases from 0.546 to 0.508 with the increase of noise. Additionally, the AP value of YOLOv4 decreases from 0.546 to 0.503. While the accuracy of other algorithms will decrease more with the increase of noise. Throughout Table 9, we can conclude that the AP value of ISASDNet is higher than that of other algorithms under different noises. This also proves that ISASDNet has better robustness.

Conclusions
In this study, ISASDNet is proposed for SAR ship detection. ISASDNet, which has a two-branch structure, can use instance segmentation to promote ship detection. In the SAR image, the brightness of the ship is obviously higher than that of the sea surface. Therefore, the designed global relational inference layer maps features interaction space to learn the interaction between ship and background. A GRM based on global relational inference layers can extract the instance segmentation results of ships; meanwhile, the designed MASDM integrates the information of the object branch and the pixel branch to improve the accuracy of ship detection. We also design a strategy to extract the mask of the SAR ship to train ISASDNet. Experimental results on SAR-Ship-Dataset and the SSDD dataset prove that ISASDNet is better than other algorithms. Ablation experiments show that the GRM and MASDM can effectively improve the detection rate. In addition, the performance of ISASDNet is better than that of other algorithms when using different amount of data to train various algorithms. The noise experiments also proved ISASDNet is more robust than other algorithms.
In future work, we will focus on ship instance segmentation and ship detection for more complex remote sensing. We expect to apply graph convolution neural network, transformer, and semi supervised learning to SAR image. According to the characteristics of SAR image, we hope to design better object detection algorithm and semantic segmentation algorithm.