Efﬁcient Instance Segmentation Paradigm for Interpreting SAR and Optical Images

: Instance segmentation in remote sensing images is challenging due to the object-level discrimination and pixel-level segmentation for the objects. In remote sensing applications, instance segmentation adopts the instance-aware mask, rather than horizontal bounding box and oriented bounding box in object detection, or category-aware mask in semantic segmentation, to interpret the objects with the boundaries. Despite these distinct advantages, versatile instance segmentation methods are still to be discovered for remote sensing images. In this paper, an efﬁcient instance segmentation paradigm (EISP) for interpreting the synthetic aperture radar (SAR) and optical images is proposed. EISP mainly consists of the Swin Transformer to construct the hierarchical features of SAR and optical images, the context information ﬂow (CIF) for interweaving the semantic features from the bounding box branch to mask branch, and the conﬂuent loss function for reﬁning the predicted masks. Experimental conclusions can be drawn on the PSeg-SSDD (Polygon Segmentation— SAR Ship Detection Dataset) and NWPU VHR-10 instance segmentation dataset (optical dataset): (1) Swin-L, CIF, and conﬂuent loss function in EISP acts on the whole instance segmentation utility; (2) EISP* exceeds vanilla mask R-CNN 4.2% AP value on PSeg-SSDD and 11.2% AP on NWPU VHR-10 instance segmentation dataset; (3) The poorly segmented masks, false alarms, missing segmentations, and aliasing masks can be avoided to a great extent for EISP* in segmenting the SAR and optical images; (4) EISP* achieves the highest instance segmentation AP value compared to the state-of-the-art instance segmentation methods.


Introduction
Thanks to the advances brought about by remote sensing (RS) technology, the capacity and quality of synthetic aperture radar (SAR) and optical images have significantly improved, which, to some extent, assists researchers in characterizing the targets in highresolution Earth observation. Meanwhile, the interpretation of SAR and optical images exerts essential influence on various applications, e.g., urban management, land changes, and environmental monitoring [1][2][3][4]. Correspondingly, with the dramatically increased volume of RS images, efficient and universal methods in interpreting SAR and optical images have raised the attention of the RS field.
In recent years, the deep convolutional neural network (DCNN) has been applied in various fields that have benefited from its advantages such as automatic feature extraction, end-to-end training capability, minimal prior knowledge demand, etc. As for improve the localization accuracy and preserve the spatial details of semantic segmentation in RS images.
Compared to object detection and semantic segmentation, instance segmentation in SAR and optical images inherits the characteristic of pixel-level prediction in semantic segmentation and supplements the localization and interclass classification in object detection, which provides comprehensive interpretation to SAR and optical images. However, related works in the RS field are scarce. In terms of the general instance segmentation methods, Su et al. proposed the high-quality instance segmentation network (HQ-ISNet) to interpret RS images under the complex background [17]. As for high-resolution aerial images, consistent proposals of instance segmentation network (CPISNet) integrates the cascaded detection branches and residual convolution networks to precisely segment the aerial instances [26]. Inspired by object detection, Chen et al. designed the instance segmentation network with the bounding box attention module and bounding box filter module [27].
In this paper, to resolve the instance segmentation task under the complex background and the situation of densely distributed small objects in SAR and optical images, we proposed the efficient instance segmentation paradigm (EISP) for interpreting SAR and optical images. EISP inherits the top-down instance segmentation paradigm and introduces three main components for the counterpart characteristics of SAR and optical images. First, the Swin Transformer is adopted for extracting the hierarchical feature maps of SAR and optical images and to model the long-range dependencies of the small objects in SAR and optical images with non-overlapping window based self-attention). Second, the flattened features for object detection are transferred by a context information flow (CIF) module to interact with the features for mask prediction. Third, the proposed confluent loss function can converge the predicted segmentation masks with the combination of distribution, regional, and boundary manner for general segmentation tasks.
The main contributions of this paper are summarized as below: • EISP is proposed for efficient instance segmentation of remote sensing images. • Effects of Swin Transformer, CIF, and confluent loss function to the EISP are individually verified, which boost the integral network performance. • EISP achieves the highest AP value of instance segmentation in remote sensing images compared to the other state-of-the-art methods.

Semantic Segmentation
Semantic segmentation is a subtask of image segmentation. It aims at endowing each pixel of the input image a semantic category. Each pixel within a certain semantic category is marked by the same color. By defining and detailing the space of the fully convolutional networks (FCNs), Long et al. first apply it into the dense prediction task of semantic segmentation [28]. In the field of biomedical image segmentation, Ronneberger et al. proposed the medical image segmentation network which consists of a contracting path for capturing context and a symmetric expanding path for precise location, termed U-Net [29]. Inheriting the encoder and decoder architecture, Badrinarayanan et al. mapped the low-resolution feature maps in the encoder to full-input-resolution feature maps in the decoder for pixel-wise classification [30]. In a pyramid scene parsing network (PSPNet), Zhao et al. introduced the which provides global prior representation for pixel-level prediction [31]. By further exploiting Deeplab v3 [32], Chen et al. proposed Deeplab v3+, which refines the segmentation results along the object boundaries [33].

Instance Segmentation
Distinguished from semantic segmentation, instance segmentation performs pixelwise prediction in an image and enables the discrimination of objects within the same category. Instance segmentation methods can be divided into three categories, including top-down methods, bottom-up methods, and direct methods. As stated literally, top-down methods follow the formula of detect first, then segment. Based on the object detection architecture of faster R-CNN [34] (with prior bounding box detection), He et al. parallels a mask branch to the object detection branch for mask prediction, termed mask R-CNN [35]. Following the original architecture of mask R-CNN, mask scoring R-CNN calibrates the misalignment between mask quality and mask score [36]. Analogous to the process of mask R-CNN, cascade mask R-CNN [37] parallels a mask branch to the object detection branch in each stage of cascade R-CNN [37] for precise instance segmentation. To bridge the gap of limited performance gain by simply integrating the mask branch in cascade mask R-CNN, hybrid task cascade (HTC) [38] interweaves the mask branches in cascade mask R-CNN for joint multi-stage processing and adopts a fully convolutional branch to provide spatial context. Moreover, SCNet [39] incorporates feature relay and global contextual information to further reinforce the reciprocal relationships of object detection and instance segmentation in cascaded architectures.
The two-stage process in top-down instance segmentation methods slows down the segmentation speed. In contrast, bottom-up instance segmentation methods segment the objects directly and they are superior in segmentation speed. By generating a set of prototype masks, Yolact [40] predicted the mask coefficients of each instance for instance segmentation. BlendMask [41] implemented instance segmentation by combining instancelevel information with semantic information with lower-level fine-granularity. Prior to the center classification and distance regression, PolarMask [42] generated the instance mask by predicting the object contour in a polar coordinate. Inspired by mask R-CNN, conditional convolutions, for instance segmentation (CondInst [43]), achieved fast inference speed via dynamically-generated conditional convolutions and FCNs. Segmenting objects by locations (SOLO [44]) viewed instance segmentation as the task of assigning categories to each pixel within an instance according to the instance's location and size. Despite the fast inference speed, bottom-up instance segmentation methods are inferior in segmentation precision to top-down instance segmentation methods.

The Proposed Method
An overview of the proposed EISP is illustrated in Figure 1. It consists of the Shifted Windows (Swin) Transformer [45] to extract the hierarchical features of the input SAR and optical images, the region proposal network (RPN [46,47]) and region of interest (RoI) extractor to generate the region proposals, the context information flow (CIF) to interweave the semantic features from the bounding box branch to mask branch, and the confluent loss function to refine the predicted masks.

Swin Transformer
Transformers use the attention mechanism to model the long-range dependencies in the data, and they achieve tremendous success in the natural language processing (NLP) domain. Here, we introduce the Swin Transformer to extract the multilevel features of SAR and optical images. The Swin Transformer computes the self-attention within the non-overlapping local windows to reduce the network complexity, and constructs the hierarchical architecture to capture the multilevel feature maps for multiscale segmentation. As the small objects occupy the vast majority of satellite objects, the non-overlapping window-based self-attention in Swin Transformer can effectively capture the long-range dependencies of them, due to the relatively large object-to-background ratio, and eliminate interference from the complex background at the same time. The overall architecture of the Swin Transformer is illustrated in Figure 2. It contains the operations of patch partition, linear embedding, Swin Transformer block, and patch merging. Given the input RS image with the size of H × W × 3, patch partition transforms it into image patches with the size of H/4 × W/4 × 48 by the non-overlapping shifting window. Then, the linear embedding layer projects the channel of the image patches into the arbitrary number C. Next, the Swin Transformer block processes the image patches by the shifted window based selfattention in non-overlapped windows. With the consecutive patch merging layer and Swin Transformer block, the hierarchical architecture of the Swin Transformer is constructed.

Swin Transformer Block
In terms of the segmentation tasks, they require per-pixel prediction on the input images. However, the computational complexity of the self-attention module in transformers is quadratic to image size in such application scenes, which is prone to be intractable for the transformer to segment the high-resolution remote sensing images. Therefore, the Swin Transformer block replaces the multihead self-attention (MSA) module in the block of vision transformer (ViT) to window-based multihead self-attention (W-MSA) module and shifted window-based multihead self-attention (SW-MSA) module. Assuming the size of input feature is H × W × C, the computational complexity O MSA of the traditional MSA module is computed via It is obvious that O MSA is quadratic in regards to HW. However, as W-SMA and SW-MSA compute the self-attention of each evenly partitioned window (with the size of S × S) of the image, the computational complexity of W-SMA and SW-MSA is which shows a linear relationship to HW. Compared to the computational complexity of MSA, it is scalable for W-SMA and SW-MSA to process the high-resolution remote sensing images. W-MSA evenly splits the image into 2 × 2 windows with the size of M × M. In the two consecutive Swin Transformer blocks, as in Figure 3, SW-MSA shifts the partitioned windows in W-MSA by ( M/2 , M/2 ) pixels, which can also be formulated as follows: where LN denotes the layer normalization operation; MLP is the module in the transformer architectures. m l and n l represent the output of (S)W-SMA module and MLP module, respectively. In terms of the Swin Transformer, it adopts the consecutive Swin Transformer block in each stage.

Hyperparameters Setting
As illustrated in Figure 2, the number of successive Swin Transformer blocks in each stage and the number of output channel C in the linear embedding layer formulates the network space (the width and depth) of Swin Transformer. Consequently, the Swin transformer small (Swin-S) is endowed with C = 96 and the number of successive blocks {2, 2, 18, 2} in Stage1 to Stage4. Analogously, Swin Transformer basic (Swin-B) has C = 128 and successive blocks of {2, 2, 18, 2}; Swin Transformer large (Swin-L) has C = 192 and successive blocks of {2, 2, 18, 2}. In terms of the window size, we maintain the size of 7 × 7 pixels for each evenly partitioned image.

Context Information Flow
Motivated by exploring the implicit mutual information between the sub-tasks of classification, location, and mask prediction, we have designed the context information flow (CIF) to explicitly incorporate the deep representative features in object detection with the mask RoI features to improve the performance of mask prediction. Generally, the bounding box features provide the prior information for mask prediction. However, the predicted masks can in turn supervise the bounding box features via backpropagation. Therefore, we supplemented the CIF to build the shortcut connection among detection branch and mask branch to benefit both tasks. The architecture of CIF is streamlined in Figure 1.
Assuming the pooled bounding box RoI features from FPN are φ (N × 256 × 7 × 7), we flatten φ to φ and apply two fully connected (FC) layers to map the distributed feature to the target feature Q (N × 1024), which can be presented as follows: where FC( * ; θ i ) denotes the FC layer with parameter θ i . To be consistent with the space of samples in the mask branch, Q is sliced to Q with the size of P × 256 × 7 × 7. Next, a supplemented FC layer is attached to Q for reassembling the context information from the detection branch. Immediately afterwards, the distributed represented features from FC layer are reconstructed to multidimensional feature M (P × 256 × 7 × 7). The process is formulated as follows: To match the input size (P × 256 × 7 × 7) of the mask branch and enlarge the receptive field when processing the feature M, we upsample M with the content-aware reassembly module (CARM) in two steps: content-aware kernel generation and feature reassemble. The overall process is shown in Figure 4. • Step 1: Content-aware Kernel Generation Figure 4a illustrates the intuitive implementation of Step 1, which is responsible for generating the k up × k up kernel corresponding to each object location. Analogously, it is composed of four sub-tasks: (1) A channel refactor is applied to compress the channel of Ψ for reducing the computational cost and model complexity. We choose a 1 × 1 convolutional kernel to compress the input channel from C to C m , making CIF lightweight but efficient.
(2) The content encoder which relies on a 3 × 3 convolutional kernel excites the feature with 4 * k 2 up output channels. (3) Assuming the encoded feature is M , we upsample it with pixel shuffle kernel to generate the reassembly kernel W with the size of P × k 2 up × 28 × 28. (4) Before being implemented to feature reassemble process, each spatial location of W is transmitted to Ω by a softmax function, which normalizes the sum of channel-wise kernel to 1. The procedure can be formulated as follows: where Conv( * ; w i ) represents the convolutional kernel with parameter w i . • Step 2: Feature Reassemble Figure 4b illustrates the intuitive implementation of Step 2, which applies the contentaware kernel to reassemble the input feature in the spatial dimension. Each location α = (i, j) in input feature Ψ is associated with a Ψ-centered square region N(Ψ α , k). Correspondingly, each k × k content-aware kernel Ω α in Ω enables pixel-wise summation with N(Ψ α , k) which contributes to each pixel α = (i , j ) of the upsampled Ψ 1 synergistically. The reassembly is described via where · represents the weighted element-wise summation between Ψ (i,j) and Ω α . The upsampled feature by CARM contains stronger semantic information than traditional upsample methods, e.g., bilinear interpolation, as it leverages the underlying context information in the original feature map. Through the context information flow from Φ to Ψ, the distribution represented feature for object detection is reconstructed to the size of P × 256 × 14 × 14 as that in mask prediction. Finally, we implement element-wise summation for Ψ 1 and input mask feature (P) to generate the shortcut connection, which is shown below: where P is the CIF enhanced feature for mask prediction.

Confluent Loss Function
Similar to object detection, instance segmentation retains the object-level discrimination in the segmentation task. Empirically, researchers extend the Cross Entropy (CE) loss function in object detection to binary cross entropy (BCE) loss function, for instance segmentation. BCE loss function is calculated via: where T i(x,y) is the pixel located at (x, y) of the ith level of the ground truth feature map and P i(x,y) is the pixel of the predicted feature map. However, the characteristic of pixel-level prediction intrinsically requires instance segmentation to consider the regional dependencies and reduce the boundary migration in the counterpart semantic segmentation tasks. As for regional dependencies, assuming there are two numerical sets X and Y, the dice score coefficient (DSC) can be expressed as Equation (15): Analogously, as forP and T, operator ∩ equals elementwise dot product, and operator | | equals numerical square. Therefore, the binary dice (BD) loss function can be formulated as Equation (16): In the next part, we supplement the boundary information to supervise the predicted mask. Given the distance map D G of the ground truth mask, the nonsymmetric L 2 distance of the predicted boundary (∂P) and ground truth boundary (∂G) can be calculated via the regional integrals: (17) where q represents the pixel on the ground truth boundary, Ω is the enclosed area of the predicted contour and ground truth contour. Considering the result of Ω D G (q)g(q) dq hinges on the ground truth mask, and that it should be a constant, we formulate the boundary distance with Ω D G (q)s(q) dq. Therefore, the binary boundary (BB) loss for instance segmentation is calculated as follows: where ι( * ) denotes the distance map of mask;T i(x,y) is the inverse of ground truth mask T i(x,y) ; κ i(x,y) is the normalized distance map ofT i(x,y) . Thus, our confluent loss for mask prediction can be formulated as follows: where γ and λ represent the loss weight for the BD loss (L BD ) and BB loss (L BB ), respectively. Following [26], we set γ + λ = 3 here to maintain the ratio of 3:1 for regional loss function and distributional loss function. Therefore, L C can be transformed to Note that the value of λ will be determined in our subsequent ablation experiments.

Experiments
In this section, we introduce the datasets for experiments, evaluation metrics, and implementation details in advance. Then, comprehensive experiments on the SAR Ship Detection Dataset (SSDD) and NWPU VHR-10 Instance Segmentation Dataset are conducted on our proposed EISP to verify its effectiveness.

SAR Ship Detection Dataset
The SAR Ship Detection Dataset (SSDD) is the first dataset for SAR imagery-based intelligent interpretation presented by Li et al. In [48], vanilla SSDD with horizontal bounding box annotation is extended to pixel-level polygon segmentation SSDD (PSeg-SSDD), which supports the instance segmentation of SAR imagery in our work. Consistent with the data volume in SSDD, PSeg-SSDD contains 1160 SAR images in total with various polarizations, resolutions, and scenes. In our experiments, we randomly divided the PSeg-SSDD into the training set and test set with the ratio of 7:3 for training and testing, respectively. The annotations of ships are standardized to COCO format with ground truth mask, area, and related bounding box for instance segmentation.

NWPU VHR-10 Instance Segmentation Dataset
The NWPU VHR-10 Instance Segmentation Dataset is extended by Wei et al. [17] with pixel-level polygon annotation from the vanilla NWPU VHR-10 Dataset, which supports the tasks of object detection, semantic segmentation, and instance segmentation in very high-resolution (VHR) optical remote sensing imagery. The NWPU VHR-10 Instance Segmentation Dataset contains 650 VHR images with annotated targets and 150 VHR images with a pure background. There are 10 classes of targets scattered in the dataset, including bridge (BR), basketball court (BC), storage tank (ST), harbor (HB), tennis court (TC), ship (SH), vehicle (VC), ground track field (GTF), baseball diamond (BD), and airplane (AI). Following the original division ratio of 7:3 in [15] that is used for generating the training set and test set, we obtain the training set and test set for our experiments.

Evaluation Metrics
Standard Microsoft Common Objects in Context (MS COCO [49]) evaluation metrics are adopted for evaluating the quantitative instance segmentation results generated on the test set. Based on the intersection over union (IoU) of the predicted results and ground truth results, the IoU ratio of each predicted result is defined by where the predicted mask and ground truth mask are, respectively, represented by P mask and G mask . Setting a prior IoU threshold criterion, the predictions of instance segmentation results can be categorized into true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Then, the corresponding precision value and recall value is calculated via Constructing the Cartesian coordinate system with recall value as the abscissa and precision value as the ordinate, the average precision (AP) of the prior IoU threshold is calculated by where P(r) is the precision value of the counterpart recall value, and r is the recall value. Considering the MS COCO evaluation metrics in our experiments, the AP value is the average of 10 AP IoU value from 0.5 to 0.95 with the stride of 0.05, which is calculated as follows: As for the dataset with N classes, the mean AP (mAP) is average AP value of the classes:

FLOPs
In computer vision, the number of trainable parameters are calculated via where C in and C out are the number of input and output channels of K × K convolution kernel. If the width and height of the input image are given, the floating-point operations (FLOPs) for computing the model complexity of CNN-based architectures are defined by:

Implementation Details
In our experiments, all of the methods are modeled by Pytorch framework. The training and test schemes are based on a single Nvidia Quadro RTX 6000 GPU. While training, the Adam is selected as the model optimizer. Each model is trained for 12 epochs with the mini-batch size of two. The initial learning rate is set at 0.0025 and attenuated by the ratio of 0.1 in 8th and 11th epoch. While testing, we select soft non-maximal suppression (Soft NMS) with 0.5 threshold to filter the finest bounding box among the predictions. Note that the images in NWPU VHR-10 instance segmentation dataset and PSeg-SSDD are, respectively, resized to 1000 × 600 pixels and 512 × 512 pixels for training and testing.

Effects of the EISP
To verify the effects of the Swin Transformer, CIF, and confluent loss function to vanilla mask R-CNN and the overall utility to EISP, we individually measured the AP of each module and the integral AP of EISP. As per the results reported in Table 1, Swin-L, CIF, and confluent loss function, respectively, yield 2.3%, 1.7%, and 1.8% AP value with regard to vanilla mask R-CNN in PSeg-SSDD. In addition, EISP achieves 3.6%, 4.7%, and 4.8% improvement in AP, AP 50 , and AP 75 , respectively. In the scale-differentiated AP indicators, EISP even achieves 8.1% AP M improvement. Under the counterpart experimental results of NWPU instance segmentation dataset, Swin-L, CIF, and confluent loss function, respectively, yield 5.2%, 1.5%, and 4.9% AP value with regard to vanilla mask R-CNN. In regards to EISP, it respectively achieves 8.2%, 4.8%, and 12.0% AP, AP 50 , and AP 75 improvement. As for the scale-differentiated AP indicators, it even achieves ∼9.0% improvement. In terms of computational complexity, the proposed confluent loss function receives considerable AP value improvement without adding the FLOPs. Relatively, CIF and Swin Transformer require additional FLOPs to drive. The qualitative results of EISP are illustrated in Figure 5, where the contour of the objects fits the counterpart ground truth mask well.

Ablation Experiments
In this section, we conduct experiments to select the optimal network structure for each module of the proposed EISP.

Experiments on Swin Transformer
Apart from the various architectures of Swin Transformers mentioned in Section 3.1.2, we supplement several mainstream backbone networks, including Res2Net [50], HR-Net [51,52], and RegNetx [53], for comprehensive contrast experiments. As shown in Table 2, the mainstream backbone networks for experiments serve as the efficient feature extractors for SAR and optical images. In the experimental results of PSeg-SSDD, HRNetv2-w32 yields 0.4% AP value in relation to ResNet-101; RegNetx and Res2Net achieve ∼0.9% AP improvement than ResNet-101. Overall, the Swin-S, Swin-B, and Swin-L achieve 0.5%, 1.1%, and 2.3% AP improvement in segmenting the SAR images, respectively. In the counterpart experimental results of NWPU VHR-10 instance segmentation dataset, HRNetv2-w32 and RegNetx4.0G yields 2.4% AP value in relation to ResNet-101; Res2Net achieves 3.0% AP improvement over ResNet-101. Overall, the Swin-S, Swin-B, and Swin-L achieve 5.6%, 6.0%, and 6.4% AP improvement in segmenting the optical images, respectively. Experimental results on PSeg-SSDD and NWPU VHR-10 instance segmentation dataset indicate that the Swin Transformer is efficient in segmenting the SAR and optical images.

Experiments on CIF
Considering the efficient context information flow from the bounding box branch to the mask branch, we compress the number of feature channels in the feature reassemble step. Here, we select the channel numbers 32, 64, 128, and 256 for experiments and the results are listed in Table 3. As for the results in PSeg-SSDD, the channel number of 64 achieves salient AP performance (58.4% AP value) compared to the rest of the situations. In the counterpart results of NWPU VHR-10, the channel number of 32 shows competitive performance to the channel number of 64 (59.3% AP value vs. 59.4% AP value). In general, for segmenting the SAR and optical images, we choose the channel number of 64 for our CIF module.

Experiments on Confluent Loss Function
As described in Section 3.3, we conduct ablation experiments on the value of λ to select the optimal choice for segmenting the SAR and optical images in Table 4. In the results of PSeg-SSDD, λ = 0 and λ = 0.3 retain competitive performance for SAR images (58.4% AP value vs. 58.5% AP value). With the increase of λ value, the AP value gradually decreases from 58.5% to 55.6%. In the counterpart results of NWPU VHR-10, the AP value increases at λ = 0.3 then decreases, and the numerical span is 8.1% (62.8% AP value vs. 54.7% AP value). For comprehensive consideration of the results of PSeg-SSDD and NWPU VHR-10 instance segmentation dataset, λ = 0.3 should be the optimal selection for instance segmentation of SAR and optical images.

Ship Segmentation Result of PSeg-SSDD
To verify the general effects of the proposed EISP, we selected seven mainstream instance segmentation methods, including Yolact [40], mask R-CNN, Instaboost [54], masksScoring R-CNN (MS R-CNN), cascade mask R-CNN (CM R-CNN), hybrid task cascade (HTC), and HQ-ISNet for comparison, which contains the categories of top-down, bottom-up, and RS images dedicated instance segmentation methods. The training and test hyperparameters follow the default settings in [55] except for that described in Section 4.3. Note that the top-down and bottom-up instance segmentation methods adopt the ResNet-101 and FPN as the feature extraction structure. The quantitative results are summarized in Table 5. As a bottom-up instance segmentation method, Yolact merely achieves 44.5% AP value in segmenting the SAR images. However, benefiting from the direct segmentation to the SAR ships, Yolact gains 37.2% AP L value, which is superior in segmenting the large ships compared to the top-down instance segmentation methods. Instaboost and MS R-CNN optimize mask R-CNN with location probability map guided mask annotations and mask quality to mask score calibration, respectively. Integrating the cascaded architectures, CM R-CNN and HTC further exceed the mask R-CNN by 1.5% and 1.8% AP, respectively. In the remote sensing field, HQ-ISNet achieves state-of-the-art performance in segmenting the RS images. By refactoring the HQ-ISNet and applying our training and test conditions, HQ-ISNet gains 59.4% AP value, pioneering IoU-differentiated (AP 50 and AP 75 ) and scale-differentiated (AP S , AP M , and AP L ) AP value.
As presented in Table 5, our proposed EISP obtains the highest (60.9%) AP compared to the state-of-the-art methods. It exceeds Yolact, mask R-CNN, and HQ-ISNet by 16.4%, 4.2%, and 1.5% AP in segmenting the SAR ships, respectively. In addition, it achieves the highest (93.3%) AP 50 and (73.3%) AP 75 value. As for segmenting the medium ships, EISP still yields 4.0% AP increments with regard to HQ-ISNet. Considering the scale variance of RS images, we supplement the multiscale training for further improving the performance of EISP. In the training phase, the images are resized to the size of 512 × 448, 512 × 480, 512 × 512, 512 × 544, 512 × 576, and 512 × 608 pixels. In the test phase, the images retain the size of 512 × 512 pixels. We name the EISP with multiscale training scheme as EISP*. Without whistles and bells, EISP* achieves 60.9% AP value, which further improves by 0.6% on the AP value of EISP. In addition, it exceeds Yolact, Mask R-CNN, and HQ-ISNet by 17.0%, 4.8%, and 2.1% AP in segmenting the SAR ships, respectively. With the cost of 20.8G FLOPs, EISP and EISP* yields HQ-ISNet 0.9% and 1.5% AP value, respectively. In addition, we provide the precision-recall (PR) curve of AP 50 for each state-of-the-art method in Figure 6, where the enclosed area of the x-axis, y-axis, and the curve represent the AP 50 . As presented in the left part of Figure 6, EISP and EISP* perform better than the state-of-the-art methods with AP 50 metric.  Apart from the quantitative results, we visualize the qualitative ship segmentation results in Figure 7. In the inshore scenes, such as the port, the state-of-the-art methods find it hard to distinguish the ships surrounded by the high-reflective artificial facilities. Thus, they are prone to generate false alarms (highlighted by purple rectangle), missing segmentations (highlighted by orange rectangles), aliasing masks (highlighted by red rectangles), and poorly segmented masks (highlighted by blue rectangle). In the offshore scenes, the aliasing masks and missing segmentations selectively appear in the densely distributed ships; however, in the counterpart results of EISP*, such defects are effectively suppressed and the fitness of the segmented masks are comparable to the ground truth, which cross-validates the effectiveness of EISP* in SAR images. Correspondingly, the false alarms appearing in line 9, column 4, and aliasing masks appearing in line 9, column 5 of Figure 7 indicate that EISP* can be further improved to cope with these cases.

Instance Segmentation Result of NUPU VHR-10
In accordance with the instance segmentation experiments on PSeg-SSDD, the seven state-of-the-art instance segmentation methods are used for comparison with our proposed EISP, where the setting of training and test hyperparameters follows the same criterion. Considering the scale variance of the optical images in NWPU VHR-10 instance segmentation dataset, we define the image size of the multiscale training scheme for EISP* as 1000 × 800, 1000 × 700, 1000 × 600, 1000 × 500, and 1000 × 400 pixels, and the image size of test remains 1000 × 600 pixels. Distinguished from the counterpart results of PSeg-SSDD, EISP and EISP* bridge the gap in the segmentation precision compared to the state-of-theart instance segmentation methods. The quantitive results are summarized in Table 6. The size of the input image is scaled to 800 × 800 pixels in training the Yolact model. Specifically, Yolact still poorly performs in segmenting the optical RS images; Instaboost, MS R-CNN, CM R-CNN, and HTC have the progressively increased AP value of 58.7%, 59.4%, 60.7%, and 61.9%. In accordance with the results in PSeg-SSDD, HQ-ISNet achieves the highest (62.7%) AP value among the state-of-the-art methods. As for the proposed EISP and EISP*, they receive the unprecedented 68.1% and 69.1% AP value, respectively. Specifically, EISP exceeds Yolact, mask R-CNN, and HQ-ISNet by 29.5%, 10.2%, and 5.4% AP in segmenting the SAR ships, respectively. As for EISP*, it yields 30.5%, 11.2%, 6.4% AP better values with regard to Yolact, Mask R-CNN, and HQ-ISNet, respectively. Under the scale-differentiated AP indicators, EISP* yields HQ-ISNet 5.1% in AP 50 value and 6.4% in AP 75 value. As for the scale-differentiated AP indicators, EISP* yields HQ-ISNet 1.0%, 5.8%, and 13.9% in AP S , AP M , and AP L value, respectively. With the cost of 202.8G FLOPs, EISP and EISP* respectively receive a leap of 5.4% and 6.4% AP 50 value compared to HQ-ISNet. As presented in the right part of Figure 6, the PR curves of EISP and EISP* are raised more than the remaining methods.
Similar to the procedure in PSeg-SSDD, we visualized the qualitative instance segmentation results in Figure 8. As illustrated in column 1, Figure 8, state-of-the-art methods encounter difficulties, e.g., missing segmentations (highlighted by orange rectangles), aliasing masks (highlighted by red rectangles), and poorly predicted masks (highlighted by blue rectangles), in segmenting the bridges with a large aspect ratio. In terms of the objects, e.g., the tennis court in column 2, the harbor in column 4, and the basketball court in column 5, with dense distribution, state-of-the-art methods tend to produce the aliasing masks among the objects. As for the airplanes with complicated contour, the predicted masks of state-of-the-art methods cannot fit the ground truth masks well. Incidentally, false alarms occasionally appeared in these methods. However, as illustrated in row 9, Figure 8, our proposed EISP* can effectively suppress these defects and generated the fitted masks for the objects regardless of the category, which cross-validates the effectiveness of the proposed method in optical images. Meanwhile, the false alarms in row 9, column 4 indicate that EISP* can be further improved to cope with the densely packed objects. . Qualitative instance segmentation results of the state-of-the-art methods and the proposed EISP* on NWPU VHR-10 instance segmentation dataset. Row 1 represents the ground truth annotations of the objects; row 2 to row 7 represent the results of state-of-the-art methods; row 8 shows the results of the proposed EISP*. Note that the red rectangle, blue rectangle, orange rectangle, and purple rectangle denote the aliasing masks of dense objects, poorly segmented mask, missing segmentations, and false alarms, respectively.
The NWPU VHR-10 instance segmentation dataset contains 10 categories of aerial objects. Therefore, we further measured the class-wise instance segmentation results of each method and we summarize them in Table 7. Among the categories, the ground track field achieves the highest (93.0%) AP value in EISP* and yields a 7.2% AP improvement compared to mask R-CNN; the airplane receives the highest (19.9%) AP improvement (from 27.1% to 47.0%) compared to mask R-CNN, while the AP value of 47.0% still needs to be improved. Similarly, as the top-down instance segmentation methods are inferior in handling the large variance of length and width, the bridge receives the lowest (45.4%) AP value with regard to EISP* due to its large aspect ratio. Relatively, the class-wise instance segmentation results of EISP* are visualized in Figure 9. Identical to the quantitative results, each category in NWPU VHR-10 instance segmentation dataset are segmented with fitted masks by the proposed EISP*.

Airplane Basketball Court Ground Track Field
Ship Storage Tank Vehicle Figure 9. Class-wise instance segmentation results of EISP* on the NWPU VHR-10 instance segmentation dataset.

Discussion
Mainstream deep-learning-based methods for interpreting SAR and optical objects adopt horizontal bounding box or oriented bounding box, which contain four coordinates for location and the azimuth coordinate for adjusting the orientation of the predicted results. However, these methods merely interpret the objects with an enclosed rectangular area; the contour and appearance of objects are missed. In this paper, we adopt the efficient instance segmentation paradigm (EISP) to interpret the SAR and optical images in a pixelwise manner. Intuitively, as illustrated in Figures 5 and 7-9, the predicted masks of EISP* are capable of interpreting the SAR and optical objects with the fitted boundary, pixellevel category, and mask-aware location. Despite the effectiveness of EISP and EISP* in segmenting the SAR images and optical images, they still encounter mistakes in precisely predicting the inshore ships in SAR images, e.g., row 9, column 5 of Figure 7, due to the complex inshore background and grayscale features and the densely packed objects in optical images, e.g., the aliasing masks in row 9, column 4 of Figure 8. Future work will focus on reducing the signal noise of SAR images and adapting the characteristics of small SAR ships for segmentation. As for optical images, except for the densely packed objects, we will focus on segmenting the objects with complicated contour, e.g., the airplane, to further improve the segmentation adaptability of the detector.

Conclusions
In this paper, we proposed an efficient instance segmentation paradigm (EISP) to interpret the RS images (including SAR image and optical image). Following the topdown instance segmentation formula, EISP adopts the Swin Transformer to construct the hierarchical features of RS images. Then, the region proposal network (RPN) and region of interest (RoI) extractor generate the region proposals for object detection and mask prediction. Next, the context information Flow (CIF) is responsible for interweaving the semantic features from the bounding box branch to the mask branch. Finally, the confluent loss function is proposed for refining the predicted masks. Experimental conclusions can be drawn on the PSeg-SSDD and NWPU VHR-10 instance segmentation datasets: (1) Swin-L, CIF, and confluent loss function in EISP acts on the whole instance segmentation utility; (2) EISP* exceeds vanilla mask R-CNN (by 4.2%) AP value on PSeg-SSDD and (by 11.2%) AP on the NWPU VHR-10 instance segmentation dataset; (3) The poorly segmented masks, false alarms, missing segmentations, and aliasing masks can be avoided to a great extent for EISP* in segmenting the RS images; (4) EISP* achieves the highest instance segmentation AP value compared to the state-of-the-art instance segmentation methods.