HQ-ISNet: High-Quality Instance Segmentation for Remote Sensing Imagery

: Instance segmentation in high-resolution (HR) remote sensing imagery is one of the most challenging tasks and is more difficult than object detection and semantic segmentation tasks. It aims to predict class labels and pixel-wise instance masks to locate instances in an image. However, there are rare methods currently suitable for instance segmentation in the HR remote sensing images. Meanwhile, it is more difficult to implement instance segmentation due to the complex background of remote sensing images. In this article, a novel instance segmentation approach of HR remote sensing imagery based on Cascade Mask R-CNN is proposed, which is called a high-quality instance segmentation network (HQ-ISNet). In this scheme, the HQ-ISNet exploits a HR feature pyramid network (HRFPN) to fully utilize multi-level feature maps and maintain HR feature maps for remote sensing images’ instance segmentation. Next, to refine mask information flow between mask branches, the instance segmentation network version 2 (ISNetV2) is proposed to promote further improvements in mask prediction accuracy. Then, we construct a new, more challenging dataset based on the synthetic aperture radar (SAR) ship detection dataset (SSDD) and the Northwestern Polytechnical University very-high-resolution 10-class geospatial object detection dataset (NWPU VHR-10) for remote sensing images instance segmentation which can be used as a benchmark for evaluating instance segmentation algorithms in the high-resolution remote sensing images. Finally, extensive experimental analyses and comparisons on the SSDD and the NWPU VHR-10 dataset show that (1) the HRFPN makes the predicted instance masks more accurate, which can effectively enhance the instance segmentation performance of the high-resolution remote sensing imagery; (2) the ISNetV2 is effective and promotes further improvements in mask prediction accuracy; (3) our proposed framework HQ-ISNet is effective and more accurate for instance segmentation in the remote sensing imagery than the existing algorithms.


Introduction
With the rapid development of imaging technology in the field of remote sensing, high-resolution (HR) remote sensing images are provided by many airborne and spaceborne sensors, for instance, RADARSAT-2, Gaofen-3, TerraSAR-X, Sentinel-1, Ziyuan-3, Gaofen-2 and unmanned aerial vehicles (UAV). Nowadays, these HR images have been applied to the national economy and the military fields, such as urban monitoring, ocean monitoring, maritime management, and traffic planning [1][2][3]. In particular, territories such as military precision strike and maritime transport safety tend to take full advantage of the HR remote sensing images for object detection and segmentation [3][4][5].
Traditional object detection methods in remote sensing (RS) imagery mainly pay attention to the detection results with the bounding boxes and the rotational bounding boxes, as shown in Figure  1b,c. Cheng et al. [6] proposed an approach to improve the performance of target detection by learning the rotation-invariant CNN (RICNN) model. Ma et al. [7] applied the You Only Look Once (YOLOv3) approach to locate collapsed buildings from remote sensing images after the earthquake. Gong et al. [8] put forward a context-aware convolutional neural network (CA-CNN) method to improve the performance of object detection. Liu et al. [9] proposed a multi-layer abstraction saliency model for airport detection in synthetic aperture radar (SAR) images. Wei et al. [10] came up with a HR ship detection network (HR-SDNet) to perform precise and robust ship detection in SAR images. Deng et al. [11] devised a method to detect multiscale artificial targets in remote sensing images. An et al. [12] came up with a DRBox-v2 with rotatable boxes to boost the precision and recall rates of detection for object detection in HR SAR images. Xiao et al. [13] came up with a novel anchor generation algorithm to eliminate the deficiencies in the previous anchor-based detectors. However, these detection results with the bounding boxes and the rotational bounding boxes do not reflect the pixel-level contours of the original targets. Traditional semantic segmentation methods in remote sensing imagery mainly focus on pixel-level segmentation results. Shahzad et al. [14] used Fully Convolution Neural Networks to automatically detect man-made structures, especially buildings in very HR SAR Images. Chen et al. [15], based on a fully convolutional network (FCN), proposed a symmetrical dense-shortcut FCN (SDFCN) and a symmetrical normal-shortcut FCN (SNFCN) for the semantic segmentation of very HR remote sensing images. Yu et al. [16] came up with an end-to-end semantic segmentation framework that can simultaneously segment multiple ground objects from HR images. Peng et al. [17] came up with dense connection and FCN (DFCN) to automatically acquire fine-grained feature maps of semantic segmentation for HR remote-sensing images. Nogueira et al. [18] came up with a novel method based on ConvNets to accomplish semantic segmentation in HR remote sensing images. However, these segmentation results cannot distinguish different instances in each category. Therefore, instance segmentation is introduced into the field of remote sensing.
Instance segmentation in remote sensing (RS) images is a complicated problem and one of the most challenging tasks [3,19]. It aims to predict both the location and the semantic mask of each instance in an image, as shown in Figure 1d. This task is much harder than object detection and semantic segmentation. However, there are rare methods currently suitable for instance segmentation in RS images. Meanwhile, it is more difficult to implement instance segmentation on HR RS images due to the complex background of remote sensing images. Therefore, this paper focuses on a high-quality instance segmentation method for remote sensing images, especially for high-resolution artificial targets.
Nowadays, many instance segmentation methods have emerged in the area of computer vision, which uses FPN structures as the backbone network, such as Mask R-CNN [19], Cascade Mask R-CNN [20,21], Mask Scoring R-CNN [22]. In the remote sensing field, Mou et al. [2] came up with a novel method to perform vehicle instance segmentation of aerial images and videos obtained by UAV. Su et al. [3] introduced the precise regions of interest (RoI) pooling into the Mask R-CNN to solve the problem of loss of accuracy due to the coordinate quantization in optical remote sensing images. However, these methods mostly utilize low-resolution representations or restore high-resolution representations for instance segmentation. Therefore, these methods are not appropriate for instance segmentation at the pixel-level in the HR RS images due to the huge loss of spatial resolution. Furthermore, in Cascade Mask R-CNN [20,21], the lack of interactive information flow between the mask branches will lead to the loss of the ability to gradually adjust and enhance between stages. In this article, a novel instance segmentation approach of HR remote sensing imagery based on Cascade Mask R-CNN [20,21] is proposed to address these problems, which we call the high-quality instance segmentation network (HQ-ISNet).
First, the HR feature pyramid network (HRFPN) is introduced into pixel-level instance segmentation in remote sensing images to fully utilize multi-level feature maps and maintain HR feature maps. Next, to refine mask information flow between mask branches, the instance segmentation network version 2 (ISNetV2) is proposed to promote further improvements in mask prediction accuracy. Then, we construct a new, more challenging dataset based on the synthetic aperture radar (SAR), ship detection dataset (SSDD) and the Northwestern Polytechnical University very-high-resolution 10-class geospatial object detection dataset (NWPU VHR-10) for remote sensing images' instance segmentation, which can be used as a benchmark for evaluating instance segmentation algorithms in the HR remote sensing images. Finally, the proposed HQ-ISNet is optimized in an end-to-end manner. Extensive experimental analyses and comparisons on the SSDD dataset [23] and the NWPU VHR-10 dataset [3,6] prove that the proposed framework is more efficient than the existing instance segmentation algorithms in the HR remote sensing images.
The main contributions of this article are shown below:  We introduce HRFPN into remote sensing image instance segmentation to fully utilize multi-level feature maps and maintain HR feature maps, so as to solve the problem of spatial resolution loss in FPN;  We design an ISNetV2 to refine mask information flow between mask branches, thereby promoting the improvement in mask prediction accuracy;  We construct a new, more challenging dataset based on the SSDD and the NWPU VHR-10 dataset for remote sensing images instance segmentation, and it can be used as a benchmark for evaluating instance segmentation algorithms in the HR remote sensing images. In addition, we provide a study baseline for instance segmentation in remote sensing images;  Most importantly, we are the first to perform instance segmentation in SAR images.
The organization of this paper is as follows. Section 2 is related to object detection and instance segmentation. Section 3 presents our instance segmentation approach. Section 4 describes the experiments, including the dataset description, evaluation metrics, experimental analysis, and experimental results. Section 5 discusses the impact of the dataset. Section 6 comes up with a conclusion.

Object Detection
Object detection needs to both declare the existence of a target belonging to the specified category and locate it in the image with a bounding box. The existing object detectors can be roughly split into two categories. The one is one-stage object detectors, which can perform object detection without proposals, such as YOLO v1-v3 [24][25][26], Single Shot MultiBox Detector (SSD) [27]. Fu et al. [28] put forward a Deconvolutional SSD (DSSD) for introducing additional context into the SSD to enhance detection performance. Li et al. [29] came up with Feature Fusion SSD (FSSD), which used a feature fusion module to enhance the detection performance. Lin et al. [30] put forward a RetinaNet, which utilized Focal Loss to address the class imbalance problem. The other is two-stage object detectors that generates proposals and then makes predictions for these proposals, such as Region with convolutional neural networks (R-CNN) [31], Fast R-CNN [32], Faster R-CNN [33]. Lin et al. [34] proposed a feature pyramid network (FPN) to utilize multi-level features. For high-quality object detection, Cai et al. [35] came up with a Cascade R-CNN, which consists of a series of detectors trained with increasing IoU thresholds. In short, compared with the one-stage detector, the two-stage detector has more accurate positioning and higher target recognition accuracy, but the one-stage detector has faster inference speed.

Instance Segmentation
Instance segmentation aims to predict both the location and the semantic mask of each instance in an image. This task is much harder than object detection and semantic segmentation. At present, the existing instance segmentation approaches can be summarily split into two categories: (1) detection-based methods first detect objects then perform segmentation within each bounding box. He et al. [19] came up with Mask R-CNN that adds a mask branch in parallel based on Faster R-CNN to predict instance masks at pixel-level. Mask R-CNN is shown in Figure 2. Liu et al. [36] put forward a novel approach, namely the Path Aggregation Network (PANet), to boost the information flow by adding a bottom-up path beyond FPN. Chen et al. [37] put forward MaskLab that utilized position-sensitive scores to acquire better segmentation results. Chen et al. [20] proposed a Hybrid Task Cascade to improve instance segmentation performance by adding semantic segmentation branches and training together with other branches. Huang et al. [22] came up with Mask Scoring R-CNN to address the problem of scoring masks to improve the quality of the predicted instance mask; (2) segmentation-based methods first obtain a pixel-level segmentation map in the entire image and then recognizes target instances. Liang et al. [38] came up with Proposal-Free Network (PFN) for instance-level object segmentation. Bai et al. [39] combined watershed algorithms and deep learning methods to generate image energy maps to perform instance segmentation.
In this paper, we follow the research line based on detection methods and further study the instance segmentation for remote sensing imagery.

The Methods
The proposed network will be described in detail in this section.

Detailed Description of the HQ-ISNet
The framework of HQ-ISNet based on Cascade Mask R-CNN [21] is shown in Figure 3. First, an HR feature pyramid networks (HRFPN) replaces the original FPN to fully utilize multi-level feature maps; next, the candidate proposals are generated by the RPN; finally, an instance segmentation network version 2 (ISNetV2) is used to refine the original mask branches and is executed to obtain the final instance segmentation results. In this section, we will present our proposed instance segmentation approach in detail.  Figure 3. Illustration of the high-quality instance segmentation network (HQ-ISNet) approach where "HRFPN" indicates a backbone network; "RPN" indicates the proposals; "Cs" indicates the classification; "M" denotes the mask branch; "B" represents the bounding box; "H" denotes the detection head; "pool" means region feature extraction.

Backbone Network and RPN
Currently, most instance segmentation methods use FPN structures as the backbone network, such as Mask R-CNN [19]. However, these methods mostly utilize low-resolution representations or restore high-resolution representations for instance segmentation, resulting in a huge loss of spatial resolution. To solve this problem, we urgently need a backbone network that can maintain a high resolution.
Recently, the HRFPN has achieved promising results for region-level ship detection in both inshore and offshore areas of SAR images [10]. The HRFPN invariably maintains HR feature maps by connecting parallel high-to-low resolution convolutions, and repeatedly exchange the information between multi-resolution representations. In addition, FPN is a serial connection, and HRFPN is a parallel connection. Hence, compared with FPN, the final feature maps are semantically richer and spatially more accurate. Nowadays, to fully utilize multi-level feature maps and maintain HR feature maps, we introduce the HRFPN into pixel-level instance segmentation in remote sensing imagery.
As in [10], the framework of the HRFPN consists of four stages of parallel convolution streams and an HRFPN block. A detailed description of the four-phase parallel convolutional flow can be found in the literature [10,40,41].
The detailed description of the HRFPN block is shown in Figure 4. Firstly, we represent the four outputs from high-to low-resolution as   , , , C C C C . Especially, the channel dimension in each feature map is reduced via a 1 1  convolutional layer. The output channels of HRFPN is set to 256.
The entire process of the HRFPN block is as follows where 1 1 Conv  and 3 3 Conv  indicate a 1 1  convolution layer and a 3 3  convolution layer, respectively; Upsample indicates bilinear up-sampling and then performs a 1 1  convolution; Downsample indicates a 3 3  convolution layer with a stride of 2, respectively;  indicates the operation of concatenation.  Furthermore, the candidate proposals are generated by the region proposal network (RPN) [33,34]. Specifically, HRFPN's output i P generates candidate proposals through a 3 3  convolution and two sibling 1 1  convolutions, as shown on the right side of Figure 4. In RPN, anchors are often involved. Following the literature [10,[33][34][35], the areas of the anchors are set to   , , , , P P P P P respectively, where 6 P is obtained via a 3 3  convolutional layer with a stride of 2 on 5 P . The anchors of multiple aspect ratios are used  

Instance Segmentation Network
Cai et al. [21] proposed a multi-stage architecture for object detection and instance segmentation called Cascade Mask R-CNN, which achieves promising results due to the adaptive handling of training distributions and progressive refinement of predictions. Therefore, we will implement our instance segmentation method based on Cascade Mask R-CNN to perform high-quality instance segmentation.
Cascade Mask R-CNN is obtained by direct hybridization of Cascade R-CNN and Mask R-CNN. In this implementation, each stage is similar to Mask R-CNN [19], with a mask branch, a class branch, and a box branch. The current stage will accept RPN or the box returned by the previous stage as an input, and then predict the new box and mask. For the convenience of description, we refer to the instance segmentation part in Cascade Mask R-CNN as ISNetV1, as illustrated in Figure 5a.
In ISNetV1, RoIAlign [19] is used to extract regional features from the proposals generated by RPN or the bounding box regression of the previous stage. Specifically, all proposals are adjusted to 7 7  and 14 14  by RoIAlign for the box branch and mask branch, respectively [19,21]. As is shown in Figure 5, the intersection over the union (IoU) thresholds of three detection heads are 0.5, 0.6, and 0.7, in which the predictions of each stage are fed into the next stage to obtain high-quality prediction results. The detection heads in the ISNetV1 have the same architecture [21]. Besides, the box branches and the class branches are consistent with the literature [10,21]. The mask branch is a small fully convolutional network (FCN) applied to each Region of Interest (RoI), predicting an instance mask in a pixel-to-pixel manner. Moreover, the mask branch generates small feature maps ( 28 28  ) from each proposal through four 3 3  convolutional layers and one deconvolutional layer [19]. Finally, the ISNetV1 is executed to obtain the final instance segmentation results.
Although the ISNetV1 has achieved good results, the mask prediction performance can still be improved. As can be seen in Figure 5a, the three mask branches of ISNetV1 lack direct information flow. The instance mask prediction of each stage completely depends on the bounding box regression of the previous stage and the RoI features of the current stage, without any connection with the mask branch of the previous stage. Specifically, the mask branches of multiple stages are more like training with the data of different distributions and then the ensemble during testing, rather than playing the role of gradual adjustment and enhancement between stages, which will prevent further improvements in mask prediction precision. In Cascade R-CNN [35], the information flow of the box branch is to make the features and learning goals of the next stage relevant to the current stage, that is, to gradually improve the prediction between different stages.
To address this problem, we follow similar principles in Cascade R-CNN [35], adding a connection between the mask branches of adjacent stages to provide the information flow of the mask branches, as illustrated in Figure 5b. Specifically, the mask features from the previous stage are provided to the current stage to facilitate further interaction of the information flow. The optimized network is called ISNetV2, as illustrated in Figure 5b.
In ISNetV2, the mask branch i M is a small FCN, which consists of four consecutives 3 3  convolutional layers and one deconvolutional layer, as shown in Figure 6. The features of i M are subjected to feature embedding through a 1 1  convolution, and then input to +1 i M . In other words, the feature maps before the deconvolutional layer are then embedded with a 1 1  convolutional layer to align with the merged backbone features of RoI. Lastly, the result is added to the next RoI through the element-wise sum. The rest of ISNetV2 is consistent with ISNetV1. The ISNetV2 uses the introduced bridge to directly interact with the adjacent mask branches, instead of separating mask features, which will promote further improvements in mask prediction accuracy.

M2 M3
RoI RoI Figure 6. The architecture of three-stage mask branches.

Loss Function
For an image, during training, a multi-task loss function is as follows [19][20][21]32,33,35] cls box mask where cls R , box R , and mask R represent the classification loss, the regression loss, and the segmentation loss, respectively.
The bounding box regression loss box R is defined as [21,35] where   = , , , x y w h g g g g g and   = , , , can represent ground-truth bounding box and the predicted bounding box, respectively. As in [32,33], is the smooth 1 In addition, needs to be normalized [32,33,35].
The classification loss cls R is defined as follows where   , log is the cross-entropy loss. y is the class label. p is a discrete probability distribution over the 1 M  categories.
The mask branch has a K m m   dimensional output for each RoI, which encodes K binary masks of resolution m m  , one for each of the K classes. The segmentation risk can be minimized as follows where mask L is the binary cross-entropy loss form in Mask R-CNN [19].

Experiments
In this section, the instance segmentation approaches will be evaluated in high-resolution remote sensing imagery.

Dataset Description
Two datasets are used in our experiments, including the SSDD dataset and the NWPU VHR-10 dataset. Instance masks in SSDD dataset and NWPU VHR-10 dataset have been released in https://github.com/chaozhong2010/VHR-10_dataset_coco.

The SSDD Dataset
The SSDD datasets [23] include 1160 SAR images with resolutions ranging from 1 to 15 m. Besides this, the SSDD has a total of 2540 ships. We further mark the instance masks directly on the SSDD dataset. In this paper, we use the LabelMe [42] open source project on GitHub to annotate these SAR images. Then, LabelMe converts the annotation message into the COCO JSON format. The SAR images annotation process is shown in Figure 7. In all experiment, the datasets are randomly split into a train dataset 70% (812 images) and a test dataset 30% (348 images).

The NWPU VHR-10 Dataset
The experiment also uses the NWPU VHR-10 datasets [3,6], which is a challenging ten-class geospatial object detection dataset. The positive image set in the datasets contains a total of 650 high-resolution optical remote sensing images with a resolution ranging from 0.08 to 2 m. These images were acquired from Google Earth and Vaihingen data. Su et al. [3] manually used the instance masks to annotate ten-class objects in these optical remote sensing images. Then, LabelMe converts the annotation message into the COCO JSON format. Some examples of images and the corresponding annotated instance masks are shown in Figure 8. In all experiment, the datasets are randomly split into a train dataset 70% (455 images) and a test dataset 30% (195 images).

Evaluation Metrics
For the instance segmentation of remote sensing imagery, the intersection over union (IoU) is the overlap rate between the ground-truth and the predicted mask. The calculation formulas of IoU is as follows where p M represents the predicted mask and g M denotes the ground-truth mask.
The performance of the instance segmentation methods in remote sensing images is quantitatively and comprehensively evaluated by the standard COCO [43] metrics. These metrics include average precision (AP), AP50, AP75, APS, APM, APL [43]. The average precision (AP) is averaged across all 10 IoU thresholds (0.50: 0.05: 0.95) and all categories. Averaging over IoUs rewards detectors with better localization. The larger AP value indicates that the more accurate the predicted instance masks, the better the instance segmentation performance. AP50 represents the calculation under the IoU threshold of 0.50; AP75 is a stricter metric and represents the calculation under the IoU threshold of 0.75. Therefore, AP75 performs better than AP50 in the instance of mask accuracy evaluation. The greater AP75 value indicates more accurate instance masks. APL is set for large targets (area > 96 2 ); APM is set for medium targets (32 2 < area < 96 2 ); APS is set for small targets (area < 32 2 ).

Implementation Details
All the experiments are implemented on pytorch and mmdetection [44]. The operating system is Ubuntu 16.04. A single GTX-1080Ti GPU is used to train and test the detectors.
For HQ-ISNet, Hybrid Task Cascade, and Cascade Mask R-CNN, we use a single GPU to train the model for 20 epochs [20,21,41,44]. The initial learning rate (LR) is set as 0.0025 for these methods. Then, the LR will gradually reduce by 0.1 after 16 and 19 epochs, respectively. The batch size is set to two images. We train Faster R-CNN, Mask R-CNN, Cascade R-CNN, and Mask Scoring R-CNN with batch size of 2 for 12 epochs [20,21,41,44]. The initial learning rate is set as 0.0025 for these methods. Then, the learning rate will gradually reduce by 0.1 after eight and 11 epochs, respectively. Besides, SGD is used to optimize the entire model. We use a momentum of 0.9 and a weight decay of 0.0001. The input images are adjusted to 1000 px along the long axis and 600 px along the short axis by the bilinear interpolation. Additionally, the overall framework is optimized in an end-to-end manner. All other hyper-parameters follow the literature [10,[19][20][21][22]33,35,44] in this paper.

Results of the HQ-ISNet
The instance segmentation outcomes of the proposed approach in SAR images and remote sensing optical images are shown in Figure 9. a and c are ground-truth mask; b and d are the predicted instance outcomes. As can be seen in Figure 9, HQ-ISNet is suitable for our instance segmentation task in HR remote sensing images. HQ-ISNet has almost no missed detections and false alarms, which guarantees that our mask branch performs instance segmentation. Finally, these artificial targets are correctly detected and segmented. Moreover, the segmentation results of HQ-ISNet are very close to the ground truth. HQ-ISNet successfully completed the instance segmentation task in HR remote sensing images. To further test our network, we performed test experiments on the SAR image from the port of Houston. SAR images were obtained with a Sentinel-1B [47] sensor. The following is the parameter information: the resolution is 3m, the polarization method is HH, and the imaging mode is S3-StripMap. In addition, we have annotated according to the labeling methods and principles in Section 4.1.1. As can be seen from Figure 10, HQ-ISNet successfully completed the instance segmentation task in the SAR image. Our results have almost no missed ships and false alarms, and the segmentation results also are very close to the ground truth. From Table 1 and Table 2, we can see that the HQ-ISNet, based on the ISNetV2 module and HRFPN backbone, has the best instance segmentation performance. It achieves 67.4% and 67.2% AP on the SSDD dataset and NWPU VHR-10 dataset, respectively. More specifically, with the help of the HRFPN, our network achieves a 2.1% and 5.3% performance improvement on the SSDD dataset and NWPU VHR-10 dataset in terms of AP. With the help of the ISNetV2, our network achieves 1.3% and 4.8% performance improvement on the SSDD dataset and NWPU VHR-10 dataset in terms of AP. Moreover, for AP75 score, our network achieves a gain of 1% and 1.6% on the SSDD dataset with ISNetV2 and HRFPN, respectively. For the AP75 value, our network achieves a gain of 5.4% and 6.1% on the NWPU VHR-10 dataset with ISNetV2 and HRFPN, respectively. In the SSDD and NWPU VHR-10 dataset, the AP50 has also been increased. Besides this, APS value, APM value, and APL value have also been improved in the SSDD and NWPU VHR-10 dataset. Among them, the APL in SSDD has a decline rate under the influence of HRFPN. We will discuss this in Section 5. The HRFPN can maintain a high resolution and solve the problem of spatial resolution loss in FPN, and the ISNetV2 refines the mask information flow between mask branches. Accordingly, the final feature maps are semantically richer and spatially more accurate. The final predicted instance mask is also more accurate. In addition, there are only ships in SSDD, and ten target categories in NWPU VHR-10 involve ship, harbor, ground track field, basketball court, etc. The NWPU VHR-10 dataset with many complex targets needs this rich semantic and spatial information, so its improvement is the most obvious. The results reveal that the HRFPN and ISNetV2 modules can effectively improve instance segmentation performance in remote sensing images.

Effect of HRFPN
The comparison of the outcomes of HRFPN and FPN in SAR images and remote sensing optical images is displayed in Figure 11. Mask R-CNN is used as a powerful baseline to accomplish our approach and comparison approach. Compared with FPN, the segmentation results of HRFPN are closer to the ground truth mask, and the instance masks of HRFPN are more accurate. It is worth noting that the instance segmentation performance of the HRPFN is better than the original FPN for the high-resolution remote sensing imagery. From Table 3 and Table 4, we can see that the HRFPN is more efficient than FPN in the Mask R-CNN framework for instance segmentation, with less computational complexity and smaller parameters. The AP value is 66.0% on the SSDD dataset, which can achieve a performance improvement of nearly 1.5% compared to FPN. In addition, the AP value is 60.7% on the NWPU VHR-10 dataset, which can achieve a performance improvement of nearly 3.3% compared to FPN. It has been suggested that our approach acquires more accurate instance masks and improves the instance segmentation performance. The AP50 and AP75 scores are 96.2% and 85.0% on the SSDD dataset, which achieves a 0.5% and 2.4% performance improvement over the FPN, respectively. Moreover, the AP50 and AP75 scores are 92.7% and 65.5% on the NWPU VHR-10 dataset, which achieves a 1% and 2.7% performance improvement over the FPN, respectively. We find that AP75 improves significantly compared to AP50 on both datasets. With looser metrics, AP50, our method may approach the best performance in the two datasets, so the improvement is not significant. However, under more stringent indicators AP75, our method has been greatly improved. Therefore, the predicted instance masks are more accurate. Besides, the performance improvement is obtained for small ships (APS) on the SSDD dataset, and APM maintains the original performance. We disucss APL in Section 5. Importantly, the APS score, APM score, and APL score have been greatly increased on the NWPU VHR-10 dataset. In particular, the performance improvement of small targets is most obvious, and small targets achieve nearly 6.2% performance gains.
Furthermore, we find that HRFPN improves the NWPU VHR-10 dataset more significantly than the SSDD dataset. There are only ships in SSDD, and ten target categories in NWPU VHR-10 involve ship, harbor, ground track field, basketball court, etc. Our HRFPN can maintain a high resolution and solve the problem of spatial resolution loss in FPN. Hence, the final feature maps are semantically richer and spatially more accurate compared with FPN. The NWPU VHR-10 dataset with many complex targets needs this rich semantic and spatial information, so its improvement is the most obvious, especially for small targets. In short, HRFPN can effectively improve instance segmentation performance in remote sensing images, with less computational complexity and smaller parameters. In the HRFPN structure, the HRFPN-W40 achieves a 66.0% AP score on the SSDD dataset and a 60.7% AP score on the NWPU VHR-10 dataset, which is improved compared with HRFPN-W18 and HRFPN-W32, but it also increases computational complexity and the parameters.
In conclusion, the HRFPN, which fully utilizes multi-level feature maps and can maintain HR feature maps, can make the predicted instance masks more accurate and effectively improve the instance segmentation performance for the HR remote sensing images. Table 4. Influence of the HRFPN on the NWPU VHR-10 dataset. Where "R-50" indicates ResNet-50; "R-101" represents ResNet-101. The comparison results of ISNetV1 and ISNetV2 in SAR images and remote sensing optical images are displayed in Figure 12. Cascade Mask R-CNN is used as a powerful baseline to accomplish our approach and comparison approach. Compared with ISNetV1, the segmentation result of ISNetV2 is closer to the ground truth mask. The ISNetV2 is more accurate than ISNetV1 in the mask segmentation. There is no doubt that that the instance segmentation performance of the ISNetV2 is better than the original ISNetV1 for the high-resolution remote sensing imagery, especially for high-resolution artificial targets. From Table 5 and Table 6, we can see that the ISNetV2 is more efficient than ISNetV1 in the Cascade Mask R-CNN framework for instance segmentation, with similar parameters and computational cost. With the ISNetV2, Cascade Mask R-CNN performs better on the SSDD dataset, which can achieve a performance improvement of nearly 1.4% in terms of AP. In addition, the AP value on the NWPU VHR-10 dataset achieves nearly 4.8% performance gains over ISNetV1. It has been suggested that ISNetV2 refines the mask information flow between the mask branches, which promotes further improvements in mask prediction accuracy. The AP50 and AP75 scores on the SSDD dataset, compared to ISNetV1, achieve a gain of 1.3% and 1%, respectively. Moreover, the AP50 and AP75 scores on the NWPU VHR-10 dataset, compared to ISNetV1, achieve a gain of 2.2% and 5.8%, respectively. The results reveal that the predicted instance mask is more accurate. Importantly, a large performance improvement is obtained for medium ships (APM) and large ships (APL) on the SSDD dataset, and APS also improved. Besides these, the APS score, APM score, and APL score have been greatly increased on the NWPU VHR-10 dataset. As a result, the instance segmentation performance is remarkably enhanced for small, medium and large targets. Furthermore, we find that ISNetV2 improves the NWPU VHR-10 dataset more significantly than SSDD dataset. There are only ships in SSDD, and ten target categories in NWPU VHR-10 involve ship, harbor, ground track field, basketball court, etc. When calculating the IoU, we know that the larger the target, the more pixels it takes, and the inaccurate prediction has a great impact on the quantitative result. Therefore, small changes have a big impact on results. Our ISNetV2 improves the mask information flow and makes the predicted results more accurate. Therefore, the NWPU VHR-10 dataset with a larger target size has the most significant improvement. Table 5. Influence of the ISNetV2 on the SSDD dataset. Where "R-50" indicates ResNet-50; "R-101" represents ResNet-101. " " means use ISNetV2 and "-" means use ISNetV1.

Backbone
ISNetV2 In summary, the information flow between the mask branches is refined, which promotes further improvements in mask prediction accuracy. Therefore, ISNetV2 can effectively improve instance segmentation performance in the HR remote sensing imagery. Table 6. Influence of the ISNetV2 on the NWPU VHR-10 dataset. Where "R-50" indicates ResNet-50; "R-101" represents ResNet-101. " " means use ISNetV2 and "-" means use ISNetV1.

Comparison with other approaches
The qualitative outcomes between the HQ-ISNet and the comparison method on the SSDD dataset and NWPU VHR-10 dataset are displayed in Figures 13 and 14 to further validate the instance segmentation performance. Row 1 is ground-truth mask; Row 2-4 are the outcomes of  As shown in Figures 13 and 14, compared with other instance segmentation methods, our approach can accurately detect and segment artificial targets in multiple remote sensing scenes. Specifically, these artificial targets are accurately covered by the predicted instance masks. HQ-ISNet has almost no missed detections and false alarms, which ensures that our mask branch performs better instance segmentation. Compared with bounding box detection, such as Faster R-CNN and Cascade R-CNN, the results of instance segmentation are closer to the silhouette of the original targets. The instance segmentation can also distinguish between different instances in the same category. The ships in Figure 13 are distinguished by different colors. The targets, such as airplanes, in Figure 14 are also distinguished by different colors. Furthermore, compared with other instance segmentation methods, our approach not only has almost no missed targets and false alarms but also has better mask segmentation results. The results of the SSDD dataset and the NWPU VHR-10 dataset imply that our method is suitable for instance segmentation task in HR remote sensing images and has a better mask segmentation performance than the other instance segmentation algorithms. In Tables 7 and 8, we compare the HQ-ISNet based on ISNetV2 and HRFPN with other advanced approaches on the SSDD dataset and the NWPU VHR-10 dataset to quantitatively evaluate the instance segmentation performance. These methods include Mask R-CNN [19], Mask Scoring R-CNN [22], Cascade Mask R-CNN [21] and Hybrid Task Cascade (HTC) [20] based on ResNet-FPN [45,46]. As can be observed from Table 7, the HQ-ISNet achieves the highest AP of 67.4%. Compared with Mask R-CNN and Mask Scoring R-CNN, the HQ-ISNet achieves 2.9% and 2.6% improvements, respectively. Besides, the HQ-ISNet achieves gains of 2.3% over Cascade Mask R-CNN. In short, compared with other instance segmentation algorithms on the SSDD dataset, our approach has a better instance segmentation performance and more accurate predicted instance masks. Moreover, the AP50 score of HQ-ISNet is 96.4%, which has 0.7% improvements over Mask R-CNN, 1.4% gains over Mask Scoring R-CNN, and 1.6% improvements over Cascade Mask R-CNN. The HQ-ISNet attains an 85.8% AP75, which achieves a gain of 3.2% over Mask R-CNN, 3.4% over Mask Scoring R-CNN, and 2.4% over Cascade Mask R-CNN. It has been established that the mask segmentation will be better and more precise than the other advanced approaches for instance segmentation on the SSDD dataset. The performance of small, medium, and large targets has also been improved on the SSDD dataset according to APS, APM, and APL. Under various AP indicators, we can obtain the same performance as HTC on the SSDD dataset, and some indicators exceed it, such as AP.
As can be observed from Table 8, the HQ-ISNet attains a 67.2% AP, which achieves a gain of 9.8% over Mask R-CNN, 8.4% over Mask Scoring R-CNN, and 6.9% over Cascade Mask R-CNN. In short, contrasted with other instance segmentation methods on the NWPU VHR-10 dataset, our approach has a better instance segmentation performance and more accurate predicted instance masks. Moreover, the AP50 score of HQ-ISNet is 94.6%, which has 2.9% improvements over Mask R-CNN, 3.3% gains over Mask Scoring R-CNN, and 2.3% improvements over Cascade Mask R-CNN. The HQ-ISNet attains a 74.2% AP75, which achieves a gain of 11.4% over Mask R-CNN, 9.3% over Mask Scoring R-CNN, and 7.6% over Cascade Mask R-CNN. It has been established that the mask segmentation will be better and more precise than the other advanced approaches for instance segmentation on the NWPU VHR-10 dataset. The performance of small, medium, and large targets has also been greatly improved on the NWPU VHR-10 dataset according to APS, APM, and APL scores. Under various AP indicators, we can obtain the same performance as HTC on the NWPU VHR-10 dataset, and some indicators exceed it, such as AP.
Furthermore, the performance gain on the NWPU VHR-10 dataset is greater than the SSDD dataset. Just as for the analysis of HRFPN and ISNetV2 in Section 4.4.2 and 4.4.3, the performance of NWPU VHR-10 is better due to the influence of target type, target size distribution, etc. In conclusion, our HRFPN can maintain a high resolution and solve the problem of spatial resolution loss in FPN, and our ISNetV2 improves the mask information flow. The final feature maps are semantically richer and spatially more accurate. The final predicted instance mask is also more accurate. Consequently, it can be extrapolated that the HRFPN and ISNetV2 modules can effectively improve instance segmentation performance in remote sensing images.  Tables 7 and 8 that the entire performance of HQ-ISNet performs the best with a lighter computation cost and fewer parameters. Besides, our models have a better performance than Mask R-CNN and Mask Scoring R-CNN with a similar model size and computational complexity. Compared with Cascade Mask R-CNN, our models have a better performance with less computational cost and smaller model size. Additionally, the HQ-ISNet has a similar performance compared to the Hybrid Task Cascade under the same model size, but with less runtime. Therefore, our network is more efficient and practical than other advanced approaches in terms of model size and computation complexity.
In [20], HTC introduced semantic segmentation into the instance segmentation framework to obtain a better spatial context. Because semantic segmentation requires fine pixel-level classification of the whole image, it is characterized by strong spatial position information and strong discrimination ability for the foreground and background. By reusing the semantic information of this branch into the box and mask branches, the performance of these two branches can be greatly improved. However, to achieve this function, HTC needs a separate semantic segmentation label to supervise the training of semantic segmentation branches, which is difficult to implement without annotations. Therefore, under the same model size, we achieve a similar performance compared to HTC, but our method runs for a shorter time and is easier to implement.
In summary, compared with other advanced approaches, our network acquires more accurate instance masks and improves the instance segmentation performance in HR remote sensing imagery. There are two main reasons for this. One is that HRFPN fully utilizes multi-level feature maps and can maintain HR feature maps. The other is that ISNetV2 refines the mask information flow between the mask branches.

Discussion
We found that the APL metrics in the SSDD dataset fluctuated greatly, so we calculated the number of target instances in SSDD according to the definition of large (area > 96 2 ), medium (32 2 < area < 96 2 ) and small (area < 32 2 ) targets in Section 4.2. As can be observed in Figure 15, ship instances are mainly concentrated in small and medium target areas and APL fluctuates greatly due to too few large ships in SSDD. According to the AP calculation formula [10], a small amount of missed detections and false alarms will cause huge changes in the APL value. Because our instance segmentation method relies on detection performance, in NWPU VHR-10, the target instances are mainly concentrated in large and medium target areas, but the number of small targets is significantly larger than the number of large targets in SSDD.
In addition, we calculate the variance to discuss the uncertainty estimate of the quality metric. We train and test our model five times to calculate the variance. As can be observed from Table 9, APL has the largest variance fluctuation in SSDD. It is known from Figure 15 that it is caused by too few large targets. In short, the variance of other indicators is relatively stable. Thus, our results are effective.

Conclusions
In this article, we put forward an instance segmentation approach based on Cascade Mask R-CNN for instance segmentation in HR remote sensing images, which is called HQ-ISNet. The HQ-ISNet adopts an HRFPN to fully utilize multi-level feature maps and maintain HR feature maps for remote sensing images' instance segmentation. Moreover, to refine mask information flow between mask branches, the instance segmentation network version 2 (ISNetV2) is proposed to promote further improvements in mask prediction accuracy. Then, we construct a new, more challenging dataset based on the SSDD and the NWPU VHR-10 dataset for remote sensing images' instance segmentation and it can be used as a benchmark for evaluating instance segmentation algorithms in the high-resolution remote sensing images. Experimental conclusions can be drawn on the SSDD and the NWPU VHR-10 dataset: (1) the HRFPN makes the predicted instance masks more accurate, which can effectively promote the instance segmentation performance of the HR remote sensing imagery; (2) the ISNetV2 is effective and promotes further improvements in mask prediction accuracy; (3) our proposed framework HQ-ISNet is effective and more accurate for instance segmentation in the remote sensing imagery than the existing algorithms. In future work, we will further study instance segmentation in SAR images.