CPISNet: Delving into Consistent Proposals of Instance Segmentation Network for High-Resolution Aerial Images

: Instance segmentation of high-resolution aerial images is challenging when compared to object detection and semantic segmentation in remote sensing applications. It adopts boundary-aware mask predictions, instead of traditional bounding boxes, to locate the objects-of-interest in pixel-wise. Meanwhile, instance segmentation can distinguish the densely distributed objects within a certain category by a different color, which is unavailable in semantic segmentation. Despite the distinct advantages, there are rare methods which are dedicated to the high-quality instance segmentation for high-resolution aerial images. In this paper, a novel instance segmentation method, termed consistent proposals of instance segmentation network (CPISNet), for high-resolution aerial images is proposed. Following top-down instance segmentation formula, it adopts the adaptive feature extraction network (AFEN) to extract the multi-level bottom-up augmented feature maps in design space level. Then, elaborated RoI extractor (ERoIE) is designed to extract the mask RoIs via the reﬁned bounding boxes from proposal consistent cascaded (PCC) architecture and multi-level features from AFEN. Finally, the convolution block with shortcut connection is responsible for generating the binary mask for instance segmentation. Experimental conclusions can be drawn on the iSAID and NWPU VHR-10 instance segmentation dataset: (1) Each individual module in CPISNet acts on the whole instance segmentation utility; (2) CPISNet* exceeds vanilla Mask R-CNN 3.4%/3.8% AP on iSAID validation/test set and 9.2% AP on NWPU VHR-10 instance segmentation dataset; (3) The aliasing masks, missing segmentations, false alarms, and poorly segmented masks can be avoided to some extent for CPISNet; (4) CPISNet receives high precision of instance segmentation for aerial images and interprets the objects with ﬁtting boundary.


Introduction
With the rapid development of observation and imaging techniques in the remote sensing field, the quantity and quality of very high-resolution (VHR) optical remote sensing images provided by airborne and spaceborne sensors have significantly increased, which simultaneously puts forward new demands on automatic analysis and understanding of remote sensing images. At present, the VHR images are applied in a wide scope of fields, e.g., urban planning, precision agriculture, and traffic monitoring. Meanwhile, with the strong feature extraction and end-to-end training capabilities, deep convolutional neural network (DCNN)-based algorithms show their superiority in the sub-tasks of computer vision, such as object detection, semantic segmentation, and instance segmentation. Driven by the huge application demands and application prospects, researchers have developed SAR images dataset (HRSID) for instance segmentation and ship detection. Ref. [13] introduced the precise region of interests (RoI) pooling for Mask R-CNN [14] to segment multi-category instances in VHR remote sensing images. Ref. [15] came up with a sequence local context (SLC) module to avoid confusion in dense-distributed ships. Ref. [16] introduced the semantic boundary-aware multitask learning network for vehicle instance segmentation. Ref. [17] presented a large-scale instance segmentation dataset for aerial images that contains 655,451 instances across 2806 HR images. Ref. [18] proposed a marine oil spill instance segmentation network to identify the similarity of the oil slick and other elements. Despite the above-mentioned works that predecessors have done, it still lacks algorithms of instance segmentation for high-resolution aerial images.
In this paper, we proposed a novel instance segmentation network for high-resolution aerial images, termed consistent proposals of instance segmentation network (CPISNet), which maintains consistent proposals between object detection and instance segmentation with cascaded architecture. CPISNet consists of three procedures. First, the adaptive feature extraction network (AFEN) is responsible for extracting the multi-level feature maps. Second, the single RoI extractor (SRoIE) and bounding box regression branch are adopted to construct the cascaded architecture, and the refined proposals from the last cascaded stage are transmitted to the elaborated RoI extractor (ERoIE) for mask RoI pooling while maintaining consistent proposals. Third, a consequence of fully convolutional blocks with shortcut connection replaces the interspersed FCN in the cascaded architecture of Cascade Mask R-CNN or HTC.
The main contributions of this paper are summarized as below: • CPISNet is proposed for multi-category instance segmentation of aerial images; • Effects of AFEN, ERoIE, and proposal consistent cascaded (PCC) architecture to the CPISNet are individually verified, which boost the integral network performance; • CPISNet achieves the best AP of instance segmentation in high-resolution aerial images compared to the other state-of-the-art methods.

Object Detection
The primary task of object detection is locating each object in the rectangular area with bounding box. Generally, existing object detection methods can be mainly divided into two formats: one-stage and two-stage methods. One-stage method omits the time consuming process of preparing region proposals and generates bounding boxes directly, e.g., You Only Look Once (YOLO) v1-v4 [19][20][21][22], Single Shot MultiBox Detector (SSD) [23] and RetinaNet [24]. Ref. [25] proposed the Fully Convolutional One-Stage Object Detection (FCOS) to eliminate the predefined anchors and detect objects in the per-pixel prediction formula. Ref. [26] adopted keypoint triplets into object detection to suppress the number of incorrect object bounding box and presented CenterNet for one-stage object detection. Ref. [27] came up with R3Det to progressively regress rotated bounding boxes from coarse to fine granularity. Relatively, two-stage methods first generate region proposals by a preliminary screening network such as Region Proposal Network (RPN) [28,29] then perform classification and localization via related network branch. The methods derived from Region with Convolutional Neural Network (R-CNN) [28], e.g., Fast R-CNN [30], Faster R-CNN [31], constitute the main-stream two-stage methods. Generally, a feature pyramid network (FPN) [29] is attached to the feature extraction network to generate highlevel semantic feature maps. Based on the basic architecture of Faster R-CNN, Cascade R-CNN [32] integrate a sequence of detection branches and train them with increasing Intersection over Union (IoU) thresholds to improve the accuracy. To sum up, one-stage methods are superior in detection speed but attenuated in detection precision, while two-stage methods are the opposite.

Instance Segmentation
Instance segmentation aims at predicting instance-level mask and pixel-level category of the objects. Mainstream instance segmentation methods can be roughly divided into topdown methods and bottom-up methods. Top-down methods follow the paradigm of detectthen-segment. Fully Convolutional Instance-aware Semantic Segmentation (FCIS [33]) jointly inherits the region proposals generated by RPN to integrate the position-sensitive score maps and the FCN for semantic segmentation. On the basis of Faster R-CNN, Mask R-CNN adds the mask branch to predict the instance-aware mask on each RoI. Path aggregation network (PANet [34]) proposed bottom-up path aggregation to boost the information flow which propagates in top-down instance segmentation methods. Mask Scoring R-CNN [35] presents the mask IoU head to improve the quality of the predicted mask. Hybrid Task Cascade (HTC [36]) proposed the joint multi-stage processing of the mask branch and detection branch. Bottom-up methods aim at grouping the pixels of each instance in an image and predict the corresponding semantic category. Polarmask [37] uses polar coordinate to classify the instance center and regress dense distance. Segmenting objects by locations (SOLO [38]) uses the location and size of the instance to assign the pixel-category, which transfers instance segmentation as a pixel-wise classification problem. SOLOv2 [39] extends SOLO with mask kernel prediction, mask feature learning, and matrix non-maximum suppression (Matrix NMS). BlendMask [40] presents the blender module which is inspired by both top-down and bottom-up methods. Analogously, top-down methods perform well in segmentation precision while bottom-up methods are superior in segmentation speed.

The Proposed Method
Our CPISNet follows the formula of top-down instance segmentation that detecting the object first and followed by performing instance-wise segmentation on each RoI. The detailed architecture of CPISNet is shown in Figure 1. First, AFEN is responsible for extracting the multi-level bottom-up augmented feature maps in the design space level. Second, the SRoIE and ERoIE are adopted for extracting the RoIs within the region proposals from RPN and multi-level feature maps from AFEN. Finally, the cascaded bounding box detection architecture and shortcut connection reconstructed mask branch are used for refining the bounding box detection result and generating the high-quality segmentation mask, respectively. The outputs from detection branch and mask branch constitute the instance segmentation result of CPISNet.

The Adaptive Feature Extraction Network
Our adaptive feature extraction network (AFEN) is separately introduced in two parts: the backbone network and multi-level feature extraction network.

Backbone Network
Instead of inheriting individual designed feature extraction network instances, we introduce RegNetx [41] as the backbone network which processes high-resolution aerial images in the design space level.
As illustrated in Figure 2, RegNetx consists of the stem (3 × 3 convolution with the stride of 2), stage (consecutive network blocks), and head (average pooling followed by fully connected layer), which is the same as classic backbone networks such as ResNet. Elevated to the structural details, classic backbone networks regard the combination of 1 × 1 convolution, 3 × 3 convolution, and 1 × 1 convolution followed by batch normalization and ReLU as a block. On this basis, RegNetx replaces the standard 3 × 3 convolution to 3 × 3 group convolution [42] with the hyperparameter g i to optimize the rudimentary residual bottleneck structure in the block. Meanwhile, classic backbones, e.g., ResNet, keep the same expansion ratio of block width (number of feature layers) among stages, and manually set the depth of network blocks, e.g., 3, 4, 6, and 3 depths of network blocks for stage 1 to 4 in ResNet-50, respectively. Relatively, RegNetx interpretably parametrizes the width and depth of network blocks among stages with a quantized linear function. First, the width v i of the i − th network block is computed via a linear parameterization: where the default parameters w 0 > 0, w a > 0, and d represent initial width, slope, and network depth, respectively. However, as v i should be an integer, we supplement the default constraint w m to compute s i via the following formulation: Then, s i is rounded to compute the quantized width u i of the i − th network block as follows: Considering the width of each network block is restricted by the hyperparameter g i of group convolution, u i is further normalized to the integer multiple of g i via: where * represents the rounding operation. Finally, the network blocks with the same widthũ i constitute a certain stage of RegNetx. From a quantitative point of view, give the hyperparameters w 0 , w a , w m , g i , and d, the widthũ i of the i − th residual block is obtained, which simultaneously defines the universal RegNet. Meanwhile, by employing the flop regime [43], hyperparameters w 0 , w a , w m , g i , and d of top model performance define the design space of RegNetx. Compared to classic backbone such as ResNet, RegNetx inherits its merit of the shortcut connection and further explores the designing space from block and stage to the whole backbone network structure. Based on the general describable network architecture in Figure 2 but with distinct hyperparameter settings of RegNetx, the output width of each stage, the number of blocks, and group ratio for ResNet and RegNetx are summarized in Table 1. Obviously, RegNetx has reduced output width and flexible expansion ratio of output width between consecutive stages. Moreover, by implementing group convolution for each network block, the model size of RegNetx is more lightweight compared to ResNet.  In top-down instance segmentation networks, FPN shows notable performance of multi-scale instance segmentation. As the edges and instance parts of low-level features can improve the localization capability of FPN, we introduce the bottom-up path augmentation (BPA) for FPN to improve the semantic representation of output feature maps. The lowest level of FPN is regarded the same as BPA. For the upper layer B i of BPA, it is constructed from B i−1 and the FPN layer F i via: where θ 1 and θ 2 represent the weight for each 3 × 3 convolution layer. As an extension of FPN, the output B i of BPA is regarded as the output multi-level feature map of AFEN. As illustrated in Figure 3, the backbone network and multi-level feature extraction network constitute the overall network architecture of our AEFN.

The RoI Extractors
As for top-down instance segmentation methods, RPN is responsible for preliminarily predicting the candidate region proposals, which initially screens out the positive samples among the predictions. To map the coordinate-based region proposals to the multi-level feature from FPN, [31] proposed the RoI extractor which selects matched region proposals for each output level of FPN and pools them with RoI Pooling to generate RoIs for object detection. Based on the previous exploration of researchers, we have designed corresponding RoI extractors for our CPISNet.

Single RoI Extractor
Generally, in the top-down instance segmentation methods, the speed of mask prediction is limited to object detection as it executes the detect-then-segment formula. The decrease of object detection speed will iteratively slow down the segmentation speed, and ulteriorly influence the network speed. Therefore, we adopt the single RoI extractor for each stage of our subsequent object detection network here.
Assuming the output multi-level feature map from FPN are {F 0 , F 1 , F 2 , F 3 }, and the initially screened out i − th bounding box from RPN is denoted as y t } represent the bottom left and top right coordinate of the bounding box, respectively. Therefore, area S i of the i − th bounding box is calculated as: Following the above-mentioned bounding box area S i , the level of i − th bounding box is calculated as: where k is related to the k − th level of FPN level; the denominator 56 denotes the smallest threshold scale of 56 2 for 0 − th level mapping that is defined by the canonical ImageNet [44] pre-training size. Following the schedule, each bounding box is mapped to a certain level of FPN. Next, the bounding box and correspondig FPN level are pooled by RoIAlign [14] to generate the RoI via: where RoI i represents the i − th RoI pooled by the i − th bounding box and F i . In this paper, we present the single RoI extractor (SRoIE) to extract the RoIs prepared for object detection branch. The architecture of SRoIE is shown in Figure 4.  (8)).

Elaborated RoI Extractor
Distinguished from the heuristically selected schedule in SRoIE, we select the pre elaborate, aggregate, and post elaborate schedule to construct our elaborated RoI extractor (ERoIE). The architecture of ERoIE is illustrated in Figure 5.
Objects (e.g., planes, harbors, and helicopters) in aerial images have geometric variations due to overlooking angle, local characteristics, etc., which may impede the network from integrally presenting the shape of an object. Consequently, we choose the dynamic convolutional network (DCN) [45,46] to deal with such variations. Assuming the output multi-level feature maps from FPN are {F 0 , F 1 , F 2 , F 3 }, along with the stride of {4, 8, 16, 32} (corresponding to the original image) for RPN. All the region proposals (RPs) from RPN are pooled within F i by RoIAlign via: where RoI i−th represents the pooled RoIs in i − th level. Here, all the RPs are regarded as the indispensable elements for RoI pooling. Then, each RoI i−th are preliminarily elaborated by the 5 × 5 dynamic convolution: where dynamic_conv denotes the dynamic convolutional network. More details see [45,46]. Next, the DRoI i−th for each level is aggregated via the element-wise addition: Finally, we adopt the global context block (GCB) to post elaborate the aggregated RoIs via: where γ 1 , γ 2 , and γ 3 are the weight for each 1 × 1 convolution; RL represents the consecutive ReLU and Layer Normalization operation. β j represents the global context feature weighted by softmax function. The output ERoI is regarded as the elaborated RoI feature of our ERoIE.

Proposal Consistent Cascaded Architecture for Instance Segmentation
Cascaded architecture is first introduced in object detection. Cai et al. [32] proposed a stage by stage object detector termed Cascade R-CNN, which leverages the output of previous stage to meet the demand of high-quality sample distribution of next stage. Similar to the formula of extending Faster R-CNN to Mask R-CNN, Cascade Mask R-CNN attaches a mask branch paralleling to the object detection branch in each stage to exert instance segmentation, which can be formulated via: where R Box t and R Mask t represent the pooled bounding box RoI features by bounding box RoI extractor P b and the pooled mask RoI features by mask RoI extractor P m in the t − th stage, respectively. x is the multi-scale feature map from FPN. Pb t and Pm t denote the predicted bounding box and predicted mask by bounding box branch B t and mask branch M t , respectively. Obviously, M t is individually generated in each stage, causing computationally inefficient.
To exploit the reciprocal relationship between detection and segmentation in cascaded architecture, Chen et al. [36] proposed HTC to interweave them for a joint stage by stage processing. Based on the merits of Cascade Mask R-CNN as Equations (15)-(17), HTC connects the mask branch of each stage as Equations (19) and (20): where Conv 1×1 ( * ; ω t ) represents 1 × 1 convolution with the weight ω t . F t is the mask information flow from stage t − 1 to stage t. R Mask t denotes the interweaved mask feature for mask prediction. Intuitive comparison of the cascaded architecture in Cascade Mask R-CNN and HTC is illustrated in Figure 6a,b. Unfortunately, these two cascaded architectures ignore the sample IoU distribution consistency of mask prediction, which potentially exacerbate the instance segmentation precision [47]. In this paper, we introduce the proposal consistent cascaded (PCC) architecture to realize high-quality instance segmentation for high-resolution aerial images with a novel cascaded architecture. The network architecture of PCC is shown in Figure 6c.
In PCC architecture, we inherit the architecture of cascaded bounding box stages in Cascade Mask R-CNN but abandon the additional mask branch in each detection stage to eliminate the disparity of the sample's IoU distribution when training and testing. As an alternative, we attach the mask branch to the last stage of detection branch. The pipeline is formulated as follows: where R Mask is pooled with the refined bounding box in the last stage. M n is the mask branch which contains n consecutive blocks with stacked convolutions. Each block M contains two 3 × 3 convolution with shortcut connection via: where R Mask n denotes the input of the n − th block; θ 1 and θ 2 are the weight for each 3 × 3 convolution. At the structural level, PCC does not just ensure instance segmentation to be performed on the basis of precise localization, but also eliminates the intermediate noisy boxes of mask prediction. Moreover, moderately adjusting the depth of mask branch can tweak the quality of mask predictions.

Experiments
In this section, we will separately introduce the datasets, loss functions, evaluation metrics, and implementation details. Next, experiments on these prerequisites are implemented to verify the effectiveness of our proposed CPISNet.

The Datasets
We select two mainstream instance segmentation dataset of high-resolution aerial images for experiments, including the Instance Segmentation in Aerial Images Dataset (iSAID [17]) and NWPU VHR-10 instance segmentation dataset [11] The NWPU VHR-10 instance segmentation dataset is the extended version of NWPU VHR-10 dataset [48,49] by [11], which provides the pixel-wise annotation for each instance. There are 10 object categories including airplane (AI), baseball diamond, ground track field, vehicle (VC), ship, tennis court, harbor, storage tank, basketball court, and bridge in total. The dataset consists of 650 very high-resolution (VHR) aerial images with targets and 150 VHR images with pure background. In our experiments, it is divided into the training set (70% images) and the test set (30% images) for training and testing, respectively.

Evaluation Metrics
Following the instance segmentation in natural scenes, we adopt the MS COCO evaluation metrics to evaluate the effectiveness of the methods. Similar to object detection, the AP of instance segmentation result is defined over the IoU, which is calculated through the overlap ratio of predicted mask and ground truth mask: where M p and M g denote the predicted mask and the ground truth mask, respectively. Based on a certain IoU threshold, the precision and recall value is defined by the instancewise classification results via: where TP, FP, and FN represent true positive, false positive, and false negative, respectively. Meanwhile, the AP of the predicted results is calculated through: where P is the precision value, r is the recall value. Generally, the AP value is calculated by averaging 10 IoU threshold, where the IoU threshold value ranges from 0.5 to 0.95 with the stride of 0.05. In addition to the AP, MS COCO evaluation metrics also include the single threshold AP for instance AP 50 (IoU = 0.5) and AP 75 (IoU = 0.75). Moreover, AP S , AP M , and AP L are responsible for measuring the AP of small (area< 32 2 pixels), medium (32 2 <area< 96 2 pixels), and large (area> 96 2 pixels) instance, respectively.

The Loss Functions
For simplicity, we choose cross entropy loss function for object classification, which is defined as: where class is the ground truth category label; x and c denote the predicted probability of a certain category and the number of categories, respectively. The smooth l1 loss is responsible for regressing the bounding boxes via: where p i is the predicted bounding box, and g i is the ground truth bounding box. N denotes the number of the predicted bounding boxes. Following [14], we select binary cross entropy (BCE) loss for mask prediction, which can be represented via: where P i denotes the predicted pixel with coordinate (x, y) in the predicted mask; T i is ground truth with coordinate (x, y) in the ground truth mask.

Implementation Details
All the models in our experiments are coded with Pytorch framework. A single RTX 3090 with 24 GB memory is adopted for training and testing the models. We select the stochastic gradient descent (SGD) as the optimizer for each model. In the training phase, with the initial learning rate of 0.0025, each model is trained for 12 epochs with mini-batch size of 2, and the learning rate is decreased by 0.1 at 8 − th and 11 − th epochs. As for image size, each image in the NWPU VHR-10 instance segmentation dataset is resized to the size of 1000 × 600 pixels for training and testing. Moreover, soft non-maximal suppression (Soft-NMS) [50] with the threshold of 0.5 is selected as the bounding box filter. The increasing IoU thresholds for each stage of the cascaded architectures are set at 0.5, 0.6, and 0.7, respectively.

Ablation Experiments
In this section, we conduct comprehensive experiments on AFEN, ERoIE, and PCC to verify the effects of our proposed CPISNet. All the experiments are based on the Mask R-CNN (meta top-down instance segmentation formula) with ResNet-101 backbone network. Moreover, we select the iSAID validation set to test our instance segmentation results.

Effects of CPISNet
The instance segmentation results of AFEN, ERoIE, and PCC are individually reported in Table 2. Quantitatively, AEFN, ERoIE, and PCC perform well in segmenting aerial objects (gain 0.6%, 0.9%, and 1.9% AP increments, respectively). With regard to CPISNet, it yields 2.6% AP increments than vanilla Mask R-CNN under the same training and testing conditions. With various AP indicators, CPISNet even gains 3.1% AP 50 increments and 5.3% AP L increments, respectively.   Table 3, HRNet and RegNetx both serve as the efficient backbone network for instance segmentation in high-resolution aerial images. The structure of HRNetw32-HRFPN and RegNetx3.2GF-FPN achieve 0.3% AP, 0.1% AP better than ResNet101-FPN, respectively. While our proposed AFEN can achieve higher mask prediction precision: 0.3% AP gain from BPA under RegNetx-3.2GF backbone, and 0.6% AP gain with AFEN-4.0GF compared to ResNet101-FPN. Results of ablation experiments indicate that AFEN is efficient in high-quality feature extraction for high-resolution aerial images.

Experiments on ERoIE
In this subsection, we implemented three stages of experiments, including effects of the preliminarily elaborated module, effects of the post elaborated module, and effects of the integral ERoIE, to verify the rationality of ERoIE.

Stage 1: Effects of the Preliminarily Elaborated Module
On the basis of experiments in [53], we follow the criterion of selecting the most effective convolution layer for the preliminarily elaborated module, and set DCN with the kernel size of 5 here to be consistent with the previous statement in Section 3.2.2. Ulteriorly, we compare the effects of the single-level and fused-level elaborated strategy when implementing DCN as the preliminarily elaborated module (element-wise experiments). Please note that DCN is additionally selected as the default post elaborated module here. Assuming the pooled RoI features from RoIAlign are B 2 , B 3 , B 4 , B 5 , the corresponding feature maps are recorded as B 2 − level, B 3 − level, B 4 − level, B 5 − level, respectively. As shown in Table 4, DCN for B 3 − level elaboration and post-processing outperforms remaining forms up to 0.3% AP in the single-level elaborated strategy, which is the same as B 1 + B 2 + B 3 in the fused-level elaborated strategy. Stage 2 focuses on evaluating the effects of the post elaborated module to ERoIE. In this context, we individually measure the global enhancement capability of GCB and DCN for post-processing. Without loss of generality, we replace the DCN to GCB for post elaboration in stage 1. As shown in Table 5, effects of GCB are similar to DCN in the single elaborated strategy. In particular, B 0 + B 1 + B 2 + B 3 in the fused-level elaborated strategy yields 0.6% AP than B 0 − level in the single-level elaborated strategy. Therefore, we select the DCN with fused-level elaborated strategy of B 0 + B 1 + B 2 + B 3 to preliminarily elaborate the RoIs, and the GCB to post elaborate the aggregated RoIs to formulate ERoIE.

Stage 3: Effects of the Integral ERoIE
Stage 3 tends to research the effects of the integral ERoIE formula. Results are shown in Table 6. Without appendages, ERoIE has a similar performance to SRoIE. When omitting preliminary elaboration, adding post-processed GCB/DCN can improve 0.3% and 0.6% AP, respectively. As for our integral ERoIE (best result in Table 6), it yields SRoIE 0.9% AP compared to SRoIE, which verifies the effectiveness of DRoIE in instance segmentation for high-resolution aerial images.

Experiments on PCC
In this subsection, we have implemented two groups of experiments, including selecting the depth of mask branch and the effects of PCC, to verify the rationality of PCC.

Group 1: Selecting the Depth of Mask Branch
Distinguished from the scattered mask branch (contains four consecutive convolution layers) in each stage of the cascaded architecture in Cascade Mask R-CNN and HTC, the mask branch in PCC stacks the consecutive convolution layers with shortcut connection within 2 convolution layers (denoted as a block). Here, the depth of mask branch is equal to the number of blocks. As shown in Table 7, with a gradually increasing number of blocks, PCC yields 0.6% AP increments until 8 blocks. Meanwhile, with over 8 blocks, the AP of PCC begins to drop. It is worth mentioning that even with 2 blocks (equal to the number of convolution layers in the scattered mask branch), PCC improves 1.3% AP based on vanilla Mask R-CNN, which additionally verifies the effectiveness of PCC.

Group 2: Effects of PCC
Group 2 tends to evaluate the superiority of PCC in the structural level. Therefore, we compare the performance of PCC with Cascade Mask Branch (Cascaded architecture in Cascade Mask R-CNN) and Mask Information Flow (Cascaded architecture in HTC). Table 8 lists the instance segmentation results of the cascaded architectures with ResNet-50 and ResNet-101 backbone network. Compared to Cascaded Mask Branch and Mask Information Flow, PCC with ResNet-50 respectively outperforms 1.0% AP and 0.4% AP, which is the same as PCC with ResNet-101. Moreover, PCC maintains significant increments in threshold AP (AP 50 and AP 75 ) and area AP (AP S , AP M , and AP L ).

Instance Segmentation Results on iSAID
To measure the instance segmentation capability of CPISNet in the integral model level, we select five state-of-the-art top-down instance segmentation methods, containing Mask R-CNN, Mask Scoring R-CNN (MS R-CNN), Cascade Mask R-CNN (CM R-CNN), HTC, and SCNet, with the default training and testing hyperparameters as in [54], except for the dedicated hyperparameters that introduced in Section 4.4, for a fair comparison with CPISNet. All the state-of-the-art methods adopt ResNet-101 and FPN for multi-scale feature extraction; the momentum and weight decay for SGD are set at 0.9 and 0.0001, respectively. Correspondingly, CPISNet adopts AFEN-4.0GF for multi-scale feature extraction here; the momentum and weight decay for SGD are set at 0.9 and 0.00005, respectively. Meanwhile, we add the frames per second (FPS) and model size to evaluate the practical engineering application ability of each method. As presented in Table 9, our CPISNet achieves the highest 38.6% AP compared to other methods. As for the non-cascaded methods, CPISNet yields 2.6% and 1.7% AP increments than Mask R-CNN and MS R-CNN with similar model size, respectively. Relatively, compared to the cascaded methods, CPISNet still maintains over 1% AP increments (1.7% AP, 1.2% AP, and 1.3% AP increments than CM R-CNN, HTC, and SCNet, respectively) with reduced model size.
Considering the scale variance of objects in high-resolution aerial images, we further introduce the multi-scale training strategy to improve the scale sensitivity of our CPISNet, termed CPISNet*. While training, the aerial images are rescaled to the size of 1200 × 800, 1000 × 800, 800 × 800, 600 × 800, and 400 × 800 pixels. As for testing, the size of aerial images retains 800 × 800 pixels. Without bells and whistles, CPISNet* further improves 0.8% AP with the same model size as CPISNet and slightly inferior FPS. In general, CPISNet* yields Mask R-CNN 3.4% in AP. With various threshold AP indicators, AP 50 and AP 75 improve 4% and 3.6%, respectively. Moreover, CPISNet* outperforms vanilla Mask R-CNN 3.9%, 3.3%, and 4.5% in segmenting small, medium, and large objects in high-resolution aerial images, respectively. Qualitatively, we provide the comparison of visualized instance segmentation results of vanilla Mask R-CNN and our proposed CPISNet* in Figure 7. As illustrated in row 2, Figure 7, the instance segmentation of Mask R-CNN retains aliasing masks, missing segmentations, and poorly segmented mask. Fortunately, our proposed CPISNet* can effectively suppress such defects in instance segmentation for high-resolution aerial images.  To measure the meticulous results of CPISNet*, we report the class-wise AP of each method for each aerial category in Table 10. Notably, storage tank achieves 80.5% AP (highest AP among 15 aerial categories) and ship obtains 7.2% AP improvement (highest AP improvement among 15 aerial categories) in iSAID validation set. Meanwhile, we observe that for some categories, e.g., tennis court and roundabout, CPISNet* yields ∼5% AP improvement than Mask R-CNN. Qualitatively, we have visualized the class-wise instance segmentation results in Figure 8. Please note that each subfigure represents the foremost aerial categories. Identical to the quantitative results, CPISNet* is capable of segmenting the hard samples, e.g., densely distributed objects (row 1, column 3-4), small objects (row 2, column 3-4), and objects with nonrigid boundaries (row 1, column 1 and row 3, column 3), in high-resolution aerial images. Quantitative and qualitative results on iSAID validation set indicate our proposed CPISNet is more effective in segmenting aerial objects than state-of-the-art methods. Following [17], we further measure the generalization ability of the state-of-the-art methods on iSAID test set. Please note that the quantitative results in Tables 11 and 12 are tested on the official evaluation server. As shown in Table 11, compared to vanilla Mask R-CNN, our CPISNet* achieves even better AP improvement in iSAID test set (3.8%) than that in iSAID validation set (3.4%), which reflects the strong generalization ability of CPISNet*. With various AP indicators, CPISNet* still exceeds vanilla Mask R-CNN over 4% increments. Table 12 reports the class-wise AP of the methods. Intuitively, the small vehicle and helicopter challenge the generalization ability of instance segmentation methods due to the small size and unique geometric variations. However, our proposed CPISNet* not only improves the AP of small vehicle and helicopter, but also remains in the ascendancy for other categories, e.g., 6.7% AP increments for the basketball court.

Instance Segmentation Results on NWPU-VHR-10 Dataset
Similar to the experiments on iSAID, we supplement the instance segmentation experiments on NWPU VHR-10 dataset to additionally verify the rationality of CPISNet. Still, we select the same control methods as the experiments in iSAID. Considering the image size in NWPU VHR-10 dataset, we define CPISNet* as CPISNet with multi-scale training strategy by rescaling the image size to 1000 × 1200, 1000 × 1000, 1000 × 800, 1000 × 600, and 1000 × 400 pixels. Distinguished from the results on iSAID, CPISNet and CPISNet* have widened the gap in instance segmentation performance of high-resolution aerial images compared to the state-of-the-art methods. As shown in Table 13, CPISNet* achieves the highest 67.5% AP among the state-of-the-art methods and yields 9.2% AP increments than Mask R-CNN. Compared to SCNet, CPISNet* yields 5.2% AP increments and 27.0% reduced model size but merely 2.1 slowed FPS. With various AP indicators, CPISNet* exceeds over 10% improvements (11.4% in AP 50 and 16.5% in AP L ) than Mask R-CNN. Moreover, as illustrated in Figure 9, CPISNet* can suppress false alarms, deal with nonrigid boundaries and accurately distinguish the densely distributed objects (unavailable in semantic segmentation).  Figure 9. Comparison of the visualized instance segmentation results of vanilla Mask R-CNN and our proposed CPISNet* in NWPU VHR-10 instance segmentation dataset. Rows 1 to 3 denote ground truth, Mask R-CNN results and CPISNet* results, respectively. Please note that the red rectangle, orange rectangle, and blue rectangle in row 2 represent aliasing masks of dense objects, false alarm, and poorly segmented mask, respectively.
Next, we report the class-wise instance segmentation results in Table 14. As shown in Table 14, ground track field receives the highest AP value of 92.5% among 10 categories. Moreover, the airplane, tennis court, and basketball court achieve the remarkable AP increments of 14.7%, 14.9%, and 14.0% than Mask R-CNN, respectively. For some particular categories, e.g., bridge with an unbalanced aspect ratio and airplane with the irregular boundary, the AP value has dramatically improved but still very low. Qualitatively, we have visualized the class-wise instance segmentation results in Figure 10. Identical to the quantitative results, the predicted masks of CPISNet* fit the object boundary well. Each category interpreted by CPISNet* completely presents its characteristic.

Discussion
Considering the defects of object detection and semantic segmentation in high-resolution aerial images, we employ the instance segmentation to interpret the objects in highresolution aerial images, which can locate the objects with object boundary, classify the objects in pixel-level, and distinguish the objects within a certain category by a different color. The superiority of instance segmentation in high-resolution aerial images can be observed from Figures 7-10. Despite the effectiveness of CPISNet in segmenting the aerial objects for both iSAID and NWPU VHR-10 instance segmentation dataset, it still encounters difficulties in segmenting the nested objects, e.g., the ground track field and soccer ball field. Moreover, the objects with small size, large aspect ratio, and irregular boundary challenge the precision of instance segmentation. Future research will focus on tackling the above-mentioned problems and, in addition, improving the FPS of the proposed model under the premise of maintaining high instance segmentation precision in two aspects. First, as the pairwise convolution layers in the mask branch are mapped with shortcut connection, implementing the channel pruning operation may accelerate the procedure of mask prediction (similar to optimizing the backbone network) and further improve the FPS of the model. Second, the reusable bounding box stages in the cascaded architecture of CPISNet may reduce its inference speed. Therefore, to increase the FPS of the model, it is useful to replace the shared fully connected layers to the layers such as global average pooling layer in each bounding box stage.

Conclusions
In this paper, we propose a novel instance segmentation network for interpreting multi-category aerial objects, termed CPISNet. The CPISNet follows the top-down instance segmentation formula. First, it adopts the AFEN to extract the multi-level bottom-up augmented feature maps in design space level. Second, ERoIE is designed to extract the mask RoIs via the refined bounding boxes output from PCC and multi-level features output from AFEN. Finally, the convolution block with shortcut connection is responsible for generating the binary mask for instance segmentation. Experimental conclusions can be drawn on the iSAID and NWPU VHR-10 instance segmentation dataset: (1) Each individual module in CPISNet acts on the whole instance segmentation utility; (2)CPISNet* exceeds vanilla Mask R-CNN 3.4%/3.8% AP on iSAID validation/test set and 9.2% AP on NWPU VHR-10 instance segmentation dataset; (3) The aliasing masks, missing segmentations, false alarms, and poorly segmented masks can be avoided to some extent for CPISNet; (4) CPISNet receives high precision of instance segmentation for aerial images and interprets the objects with fitting boundary.