Panoptic SwiftNet: Pyramidal Fusion for Real-time Panoptic Segmentation

Dense panoptic prediction is a key ingredient in many existing applications such as autonomous driving, automated warehouses or remote sensing. Many of these applications require fast inference over large input resolutions on affordable or even embedded hardware. We propose to achieve this goal by trading off backbone capacity for multi-scale feature extraction. In comparison with contemporaneous approaches to panoptic segmentation, the main novelties of our method are efficient scale-equivariant feature extraction, cross-scale upsampling through pyramidal fusion and boundary-aware learning of pixel-to-instance assignment. The proposed method is very well suited for remote sensing imagery due to the huge number of pixels in typical city-wide and region-wide datasets. We present panoptic experiments on Cityscapes, Vistas, COCO and the BSB-Aerial dataset. Our models outperform the state of the art on the BSB-Aerial dataset while being able to process more than a hundred 1MPx images per second on a RTX3090 GPU with FP16 precision and TensorRT optimization.


Introduction and related work
Panoptic segmentation [20] recently emerged as one of the most important recognition tasks in computer vision. It combines object detection [37] and semantic segmentation [29] in a single task. The goal is to assign a semantic class and instance index to each image pixel. Panoptic segmentation finds its applications in a wide variety of fields such as autonomous driving [44], automated warehouses, smart agriculture [12], and remote sensing [9]. Many of these applications require efficient inference in order to support timely decisions. However, most of the current state of the art does not meet that requirement. State-of-the-art panoptic approaches [19,40,36,16] are usually based on high-capacity backbones in order to ensure a large receptive field which is required for accurate recognition in large-resolution images. Some of them are based on instance segmentation approaches and therefore require complex post-processing in order to fuse instance-level detections with pixel-level classification [19,40].
Different than all previous approaches to panoptic segmentation, we propose to leverage multi-scale convolutional representations [11,34] in order to increase the receptive field, and decrease pressure onto the backbone capacity through scale equivariance [21]. In simple words, coarse input resolution may provide a more appropriate view at large objects. Furthermore, a mix of features from different scales is both semantically rich and positionally accurate. This allows our model to achieve competitive recognition performance and real-time inference on megapixel resolutions.
Early work on joint semantic and instance segmentation evaluated model performance separately, on both of the two tasks [43,10]. The field gained much attention with the emergence of a unified task called panoptic segmentation and the corresponding metric -panoptic quality [20]. This offered an easy transition path due to straightforward upgrade of the annotations: panoptic segmentation labels often can be created by combining existing semantic and instance segmentation ground truth. Thus, many popular recognition datasets [7,27,31] were able to release panoptic supervision in a short amount of time, which facilitated further research.
Most of the recent panoptic segmentation methods fall under one of the two categories: box-based (or top-down) and box-free (also known as bottom-up) methods. Box-based methods predict the bounding box of the instance and the corresponding segmentation mask [19,40,30]. Usually, semantic segmentation of stuff classes is predicted by a parallel branch of the model. The two branches have to be fused in order to produce panoptic predictions. Panoptic FPN [19] extends the popular Mask R-CNN [13] with a semantic segmentation branch dedicated to the stuff classes. Their fusion module favors instance predictions in pixels where both branches predict a valid class. Concurrent work [40] followed a similar extension idea, but proposed a different post-processing step. UPSNet [40] stacks all instance segmentation masks and stuff segmentation maps and applies the softmax to determine the instance index and semantic class in each pixel. EfficientPS [30] combines EfficentNet backbone, with custom upsampling and panoptic fusion. Although, it improves the efficiency with respect to [19,30], its inference speed is still far from real-time.
Box-free methods do not detect instances through bounding box regression [42,5,24,38]. Most of these methods group pixels into instances during a post-processing stage. DeeperLab [42] extends a typical semantic segmentation architecture [4] with a class-agnostic instance segmentation branch. This branch detects five keypoints per instance and groups pixels according to multiple range-dependent maps of offset vectors. Panoptic Deeplab [5] outputs a single heatmap of object centers and a dense map of offset vectors which associates each thing pixel with the centroid of the corresponding instance. Panoptic FCN [24] also detects object centers, however, they aggregate instance pixels through regressed per-instance kernels. Recent approaches [38,6,23] propose unified frameworks for all segmentation tasks, including panoptic segmentation. These frameworks include special transformer-based modules that recover mask-level embeddings and their discriminative predictions. Pixel embeddings are assigned to masks according to the similarity with respect to mask embeddings. This approach seems inappropriate for real-time processing of large images due to large computational complexity of mask-level recognition.
Semantic segmentation of remote-sensing images is a popular and active line of research [41,28,45,25]. On the other hand, only a few papers consider panoptic segmentation of remote-sensing images [9,8,17]. Nevertheless, it seems that panoptic segmentation of remote-sensing images might support many applications: e.g. automatic compliance assessments of construction work with respect to urban legislation. Just recently, a novel BSB aerial dataset has been proposed [9]. The dataset collects more than 3000 aerial images of urban area in Brasilia and the accompanying panoptic ground truth. Their experiments involve Panoptic FPN [19] with ResNet-50 and ResNet-101 backbones [14]. In contrast, our method combines multi-scale ResNet-18 [14] features with pyramidal fusion [34], and delivers better PQ performance and significantly faster inference.
Multi-scale convolutional representations were introduced for semantic segmentation [11]. Further work [34] showed that they can outperform spatial pyramid pooling [46] in terms of effective receptive field. To the best of our knowledge, this is the first work that explores the suitability of multi-scale representations for panoptic segmentation. Similar to [5] our method relies on center and offset prediction for instance segmentation. However, our decoder is shared between instance and semantic segmentation branch. Furthermore, it learns with respect to a boundary-aware objective and aggregates multiscale features using pyramidal fusion in order to facilitate recognition of large objects.
The summarized contributions of this work are as follows: • To the best of our knowledge, this is the first study of multi-scale convolutional representations [11] for panoptic segmentation. We show that pyramidal fusion of multi-scale convolutional representations significantly improves the panoptic segmentation in high-resolution images comparing to the single-scale variant.
• We point out that panoptic performance can be significantly improved by training pixel-to-instance assignment through a boundary-aware learning objective.
• Our models outperform generalization performance of the previous state-of-the-art in efficient panoptic segmentation [5] while offering around 60% faster inference. Our method achieves state-of-the-art panoptic performance on the BSB Aerial dataset while being at least 50% faster than previous work [9].

Method
Our panoptic segmentation method builds on scale-equivariant features [11], and cross-scale blending through pyramidal fusion [34]. Our models deliver three dense predictions [5]: semantic segmentation, instance centers and offsets for pixel-to-instance assignment. We attach all three heads to the same latent representation at 4× subsampled resolution. The regressed centers and offsets give rise to class-agnostic instances. Panoptic segmentation maps are recovered by fusing predictions from all three heads [5].

Upsampling multi-scale features through pyramidal fusion
We start from a ResNet-18 backbone [14] and apply it to a three-level image pyramid that involves the original resolution as well as 2× and 4× downsampled resolutions. The resulting multi-scale representation only marginally increases the computational effort with respect to the single-scale baseline. On the other hand, it significantly increases the model's ability to recognize large objects in high-resolution images. The proposed upsampling path effectively increases the receptive field through cross-scale feature blending which is also known as pyramidal fusion [34]. Pyramidal fusion proved beneficial for the semantic segmentation performance and we hypothesize that this might be the case for panoptic segmentation as well. In fact, a panoptic model needs to have a look at the whole instance in order to stand a chance to recover the centroid in spite of occlusions and articulated motion. This is hard to achieve when targeting realtime inference on large images since computational complexity is linear in the number of pixels. We conjecture that pyramidal fusion is an efficient way to increase the receptive field and enrich the features with the wider context. Figure 1 illustrates the resulting architecture. Yellow trapezoids represent residual blocks of a typical backbone which operate on 4×, 8×, 16× and 32× subsampled resolutions. The trapezoid color indicates feature sharing: the same instance of the backbone is applied to each of the input images. Thus our architecture complements scale-equivariant feature extraction with scale-aware upsampling. Skip connections from the residual blocks are projected with a single 1 × 1 convolution (red squares) in order to match the number of channels in the upsampling path. Features from different pyramid levels are combined with elementwise addition (green circles). The upsampling path consists of five upsampling modules. Each module fuses features coming from the backbone via the skip connections and features from the previous upsampling stage. The fusion is performed through elementwise addition and a single 3 × 3 convolution. Subsequently, the fused features are upsampled by bilinear interpolation. The last upsampling module outputs a feature tensor which is four times subsampled w.r.t. to the input resolution. This feature tensor represents a shared input to the three dense prediction heads that recover the following three tensors: semantic segmentation, centers heatmap and offset vectors. Each head consists of a single 1 × 1 convolution and 4× bilinear upsampling. Such design of the prediction heads is significantly faster than in Panoptic Deeplab [5], because it avoids inefficient depthwise-separable convolutions with large kernels on large resolution inputs. Please note that each convolution in the above analysis actually corresponds to a BN-ReLU-Conv unit [15].
The proposed upsampling path has to provide both instance-agnostic features for semantic segmentation as well as instance-aware features for panoptic regression. Consequently, we use 256 channels along the upsampling path, which corresponds to twice larger capacity than in efficient semantic models [34].
The proposed design differs from the previous work which advocates separate upsampling paths for semantic and instance-specific predictions [5]. Our Figure 1: Panoptic SwiftNet with three-way multi-scale feature extraction, pyramidal fusion, common upsampling path, and three prediction heads. Yellow trapezoids denote residual blocks (RB). Red squares represent a 1 × 1 convolutions that adjust the number of feature maps so that the skip connection can be added to the upsampling stream. Blue trapezoids represent upsampling modules. The gray rectangle represents the three prediction heads. Modules with the same color share parameters. Numbers above the lines (ND) specify the number of channels. Numbers below the lines (/N) specify the downsampling factor w.r.t. to the original resolution of the input image.

Boundary-Aware Learning of Pixel-to-Instance Assignment
Regression of offset centers requires large prediction changes at instance boundaries. A displacement of only one pixel must make a distinction between two completely different instance centers. We conjecture that such abrupt changes require a lot of capacity. Furthermore, it is intuitively clear that the difficulty of offset regression is inversely proportional to the distance from the instance boundary.
Hence, we propose to prioritize the pixels at instance boundaries by learning offset regression through boundary-aware loss [34,47]. We divide each instance into four regions according to the distance from the boundary. Each of the four regions is assigned a different weight factor. The largest weight factor is assigned to the regions which are closest to the border. The weights diminish as we go towards the interior of the instance. Consequently, we formulate the boundary-aware offset loss as follows: In the above equation H and W represent height and width of the input image, O i,j and O GT i,j -predicted and ground-truth offset vectors at location (i, j), and w i,j -boundaryaware weight at the same location. Note that this is quite different than in earlier works [34,47] which are all based on focal loss [26]. To the best of our knowledge, this is the first use of boundary modulation in a learning objective based on L1 regression.

Compound learning objective
We train our models with the compound loss consisting of three components. The semantic segmentation loss L SEM is expressed as the usual per-pixel cross entropy [29]. In Cityscapes experiments we additionally use online hard-pixel mining and consider only the top 20% pixels with the largest loss [4]. The center regression loss L CEN corresponds to L2 loss between the predicted centers heatmap and the ground-truth heatmap [5]. Such learning objective is common in learning heatmap regression for keypoint detection [39,32]. The offset loss L BAOL corresponds to the modulated L1 loss as described in (1). The three losses are modulated with hyperparameters λ SEM , λ CEN , and λ BAOL : We express these hyper-parameters relative to the contribution of the segmentation loss by setting λ SEM = 1. We set λ CEN = 200 in early experiments so that the contribution of the center loss becomes approximately equal to the contribution of the segmentation loss. We validate λ BAOL = 0.0025 in early experiments on the Cityscapes dataset. Figure 2 (b) shows ground truth targets for a single training example on the BSB Aerial dataset. The semantic segmentation labels associate each pixel with categorical semantic ground truth (top-left). Ground-truth offsets point towards the respective instance centers. Notice abrupt changes at instance-to-instance boundaries (top-right). The ground-truth instance-center heatmap is crafted by Gaussian convolution (σ = 5) of a binary image where ones correspond to instance centers (bottom-left). We craft the offset-weight ground-truth by thresholding the distance transform [2] with respect to instance boundaries. Note that pixels at stuff classes do not contribute to the offset loss since the corresponding weights are set to zero (bottom-right).

Recovering panoptic predictions
We recover panoptic predictions through post-processing of the three model outputs as in Panoptic Deeplab [5]. We first recover instance centers by non-maximal suppression of the center heatmap. Second, each thing pixel is assigned the closest instance center according to the corresponding displacement from the offset map. Third, each instance is assigned a semantic class by taking arg-max over the corresponding semantic histogram. This voting process presents an opportunity to improve the original semantic predictions, as demonstrated in Figure 4 (col. 1). Finally, the panoptic map is constructed by associating instance indices from step 2 with aggregated semantic evidence from step 3.
We greatly improve the execution speed of our Python implementation by implementing custom CUDA modules for steps 3 and 4. Still, the resulting CUDA implementation requires approximately the same time as optimized TensorRT inference in our experiments on Jetson AGX. This suggests that further improvement of the inference speed may require rethinking of the post-processing step.

Experiments
We consider multi-scale models with pyramidal fusion based on ResNet-18 [14]. We evaluate the performance of our panoptic models with standard metrics [20]: panoptic quality (PQ), segmentation quality (SQ), and recognition quality (RQ). In some experiments, we also evaluate our panoptic models on semantic and instance segmentation tasks, and show the corresponding performance in terms of mean intersection over union (mIoU) and average precision (AP). For completeness, we now briefly review the used metrics.
Mean intersection over union (mIoU) measures the quality of semantic segmentation. Let P c and G c denote all pixels that are predicted and labeled as class c. Then, equation (3) defines IoU c as ratio between the intersection and the union of P c and G c . This ratio is also known as Jaccard index. Finally, the mIoU metric is simply an average over all classes C.
Panoptic quality (PQ) measures the similarity between the predicted and the ground truth panoptic segments [20]. Each panoptic segment collects all pixels labeled with the same semantic class and instance index. As in intersection over union, PQ is first computed for each class separately, and then averaged over all classes. In order to compute the panoptic quality, we first need to match each ground truth segment g with the predicted segment p. The segments are matched if they overlap with IoU > .5. Thus, matched predicted segments are considered as true positives (TP), unmatched predicted segments as false positives (FP), and unmatched ground truth segments as false negatives (FN). The equation (4) shows that PQ is proportional to the average IoU of true positives, and inversely proportional to the number of false positives and false negatives.
PQ can be further factorized into segmentation quality (SQ) and recognition quality (RQ) [20]. This can be achieved by multiplying the eq. (4) with Finally, we briefly recap the average precision (AP) for instance segmentation. In recent years, this name has become a synonym for COCO mean average precision [27]. This metric averages the traditional AP over all possible classes and 10 different IoU thresholds (from 0.5 to 0.95). In order to compute the traditional AP, each instance prediction needs to be associated with a prediction score. The AP measures the quality of ranking induced by the prediction score. It is then computed as the area under the precision-recall curve which is obtained by considering all possible thresholds of the prediction score. Each prediction is considered a true positive if it overlaps some ground truth instance more than the selected IoU threshold.
Our extensive experimental study considers four different datasets. We first present our results on the BSB Aerial dataset [9] which collects aerial images. Then, we evaluate our models on two road-driving datasets with high-resolution images: Cityscapes [7] and Mapillary Vistas [31]. Finally, we present an evaluation on the COCO dataset [27] which gathers a very large number of medium-resolution images from personal collections. Moreover, we provide a validation study of our panoptic model on Cityscapes. We chose Cityscapes for this study because of its appropriate size and high-resolution images which provide a convenient challenge for our pyramidal fusion design. In the end, we measure the inference speed of our models on different GPUs with and without TensorRT optimization.

BSB Aerial dataset
The BSB Aerial dataset contains 3000 training, 200 validation and 200 test images of urban areas in Brasilia, Brazil. All images have the same resolution of 512 × 512 pixels and are densely labeled with 3 stuff (street, permeable area, and lake) and 11 thing classes (swimming pool, harbor, vehicle, boat, sports court, soccer field, commercial building, commercial building block, residential building, house, and small construction). The highest number of pixels is annotated as 'permeable area' because this class considers different types of natural soil and vegetation.
We train our models for 120000 iterations on batches consisting of 24 random crops with resolution 512 × 512 pixels. We use ADAM [18] optimizer with a base learning rate of 3 · 10 −4 which we polynomially decay to 10 −7 . We augment input images with random scale jitter and horizontal flipping. We also validate image rotation for data augmentation and present the outcomes in Table 2. Table 1 compares our experimental performance on BSB Aerial dataset with the related work. Our model based on ResNet-18 outperforms Panoptic FPN with stronger backbones by a large margin. We note improvements over Panoptic FPN with ResNet-50 and ResNet-101 of 5.7 and 4.2 PQ points on the validation data, and 5.9 and 3.1 PQ points on the test data. Besides being more accurate, our model is also significantly faster. In fact, our model is 1.5 times faster than untrained Panoptic-FPN-ResNet-50. Note that we estimate the inference speed of Panoptic FPN with a randomly initialized model which detects less than one instance per image.
Image rotation is rarely used for data augmentation on typical road-driving datasets. It makes little sense to encourage the models robustness to rotation if the camera pose is nearly fixed and consistent across all images from both train and validation subsets. However, it seems that rotation in aerial images could simulate other possible poses of the acquisition platform, and thus favor better generalization on test data. Table 2 validates this hypothesis on the BSB aerial dataset. We train our model in three different setups. The first setup does not use rotation. The second rotates each training example for a randomly chosen angle from the set {0 • , 90 • , 180 • , 270 • }. The third setup rotates each training example for an angle randomly sampled from a range of 0 • − 360 • . Somewhat surprisingly, the results show that rotated inputs decrease the generalization ability on BSB val. The effect is even stronger when we sample arbitrary rotations. Visual inspection suggests that this happens due to the constrained acquisition process. In particular, we have noticed that in all inspected images the shadow is pointing at roughly the same direction. We hypothesize that the model learns to use this fact as some sort of orientation landmark for offset prediction. This is similar to road driving scenarios where the models usually learn the bias that the sky covers the top part of the image while the road covers the bottom. Clearly, rotation of the training examples prevents the model in learning these biases, which can hurt the performance on the validation set. The experiments support this hypothesis because the deterioration is much smaller in semantic performance than in panoptic performance. We remind the reader that, unlike panoptic segmentation, semantic segmentation does not require offset predictions.  Figure 3 shows the predictions of our model on four scenes from BSB Aerial val. Rows show: input image, semantic segmentation, centers heatmap, offset directions, and panoptic segmentation. Panoptic maps designate instances with different shades of the color of the corresponding class.
The presented scenes illustrate the diversity of the BSB dataset. We can notice a large green area near the lake (col. 1), but also urban area with streets and houses (col. 4). There is a great variability in object size. For example, some cars in column 2 are only 10 pixels wide, while the building in column 3 is nearly 400 pixels wide. Our model deals with this variety quite well. We can notice that we correctly detect and segment most of the cars, and also larger objects such as buildings. However, in the first column we can notice that most of the soccer field in top-right part of the image is mistakenly segmented as permeable area. Interestingly, center detection and offset prediction seem quite accurate, but semantic segmentation failed to discriminate between the two classes. This is a fairly common and reasonable mistake because the two classes are visually similar. In the last column, we notice that the road segments are not connected. However, this is caused by the labeling policy which considers trees above the road as permeable area or unlabeled.

Cityscapes
The Cityscapes dataset contains 2975 training, 500 validation and 1525 test images. All images are densely labeled. We train the models for 90000 iterations with the ADAM optimizer on 1024 × 2048 crops with batch size 8. This can be carried out on 2 GPUs, each with 20GiB of RAM. We augment the crops through horizontal flipping and random scaling with factor between 0.5 and 2.0. As before, we decay the learning rate from 5 · 10 −5 to 10 −7 . Table 3 compares our panoptic performance and inference speed with other methods from the literature. Panoptic SwiftNet based on ResNet-18 (PSN-RN18) delivers competitive panoptic performance with respect to models with more capacity, while being significantly faster. In comparison with Panoptic Deeplab [5] based on MobileNet v2, we observe that our method is more accurate in all three subtasks and also faster for 61% when measured in the same runtime environment. We observe that methods based on Mask R-CNN [36,19] achieve poor inference speed and non-competitive semantic segmentation performance in spite of a larger backbone.

COCO
COCO dataset [27] contains digital photographs collected by Flickr users. Images are annotated with 133 semantic categories [3]. The standard split proposes 118K images for training, 5K for validation and 20K for testing. We train for 200K iterations with ADAM optimizer on 640 × 640 crops with batch size 48. This can be carried out on 2 GPUs, each with 20 GiB of RAM. We set the learning rate to 4 · 10 −4 and use polynomial rate decay. Table 4 presents our results on the COCO validation set. We compare our performance with previous work by evaluating 640 × 640 crops. We achieve the best accuracy among the methods with real-time execution speed. Although designed for large-resolution images, pyramidal fusion performs competitively even on COCO images with median resolution of 640 × 480 pixels.

Mapillary Vistas
Mapillary Vistas dataset [31] collects high-resolution images of road driving scenes taken under a wide variety of conditions. It contains 18K train, 2K validation and 5K test images densely labeled with 65 semantic categories. We train for 200K iterations with ADAM optimizer. We increase the training speed with respect to the Cityscapes configuration by reducing the crop size to 1024 × 1024 and increasing the batch size to 16. We compose the training batches by over-sampling crops of rare classes. During evaluation we resize the input image so that the longer side is equal to 2048 pixels while maintaining the original aspect ratio [5]. Similarly, during training we randomly resize the input image so that the mean resolution is the same as in in the evaluation. Table 5 presents the performance evaluation on Vistas val. Our model achieves comparable accuracy w.r.t. literature, while being much faster. These models are slower than their Cityscapes counterparts from Table 3 for two reasons. First, the average resolution of Vistas images in our training and evaluation experiments is 1536 × 2048 while the Cityscapes resolution is 1024 × 2048. Second, these models also have slower classification and post-processing steps due to larger numbers of classes and instances.  Figure 4 shows qualitative results on three scenes from validation sets of Cityscapes, Mapillary Vistas and COCO. The rows show: input image, semantic segmentation, centers heatmap, offset directions and the panoptic segmentation. Column 1 shows a Cityscapes scene with a large truck on the left. Semantic segmentation mistakenly recognizes a blob of pixels in the bottom-left as class bus instead of class truck. However, the panoptic map shows that the post-processing step succeeds to correct the mistake since the correct predictions outvoted the incorrect ones. Hence, the instance segmentation of the truck was completely correct. Column 2 shows a scene from Mapillary Vistas. We observe that our model succeeds to correctly differentiate all distinct instances. Column 3 shows a scene from COCO val. By looking at the offset predictions, we can notice that the model hallucinates an object in the top-right part. However, this does not affect the panoptic segmentation because that part of the scene is classified as the sky (stuff class) in semantic segmentation. Thus, the offset predictions in these pixels are not even considered. scene each from COCO and Cityscapes. The columns present the input image alongside the corresponding panoptic predictions. We zoom in on onto circular image regions where the differences between the two models are the most significant. The Figure reveals that our model produces more accurate instance masks on larger objects. Conversely, Panoptic FPN often misclassifies boundary pixels as background instead of a part of the instance. This is likely due to generating instance segmentation masks on a small, fixed-size grid [13]. In contrast, our pixel-to-instance assignments are trained on a fine resolution with a boundary-aware objective. We also observe that our model sometimes merges clusters of small instances in the distance. We believe this is because of the centerpoint-based object detection and corresponding non-maximum suppression. Box-based approaches perform better in such scenarios, which is consistent with the AP evaluation in Table 3.

Validation study on the Cityscapes dataset
This study quantifies the contribution of pyramidal fusion and boundary-aware offset loss to panoptic segmentation accuracy. Table 6 compares pyramidal fusion with spatial pyramid pooling [46,22] as alternatives for providing global context. We train three separate models with 1, 2, and 3 levels of pyramidal fusion, as well as a single-scale model with spatial pyramidal pooling of 32x subsampled features [22]. We observe that spatial pyramid pooling improves AP and mIoU for almost 4 percentage points (cf. rows 1 and 2). This indicates that standard models are unable to capture global context due to undersized receptive field. This is likely exacerbated by the fact that we initialize with parameters obtained by training on 224 × 224 ImageNet images. We note that two-level pyramidal fusion outperforms the SPP model across all metrics. Three-level pyramid fusion achieves further improvements across all metrics. PYRx3 outperforms PYRx2 for 1 PQ point, 2.8 AP points, and 1.1 mIoU points.  Table 7 explores the upper performance bounds w.r.t. particular model outputs. These experiments indicate that semantic segmentation represents the most important challenge towards accurate panoptic segmentation. Perfect semantic segmentation improves panoptic quality for roughly 20 points. However, we believe that rapid progress in semantic segmentation accuracy is not very likely due to wide popularity of the problem. When comparing oracle center and oracle offsets we observe that the latter one brings significantly larger improvements: 3 PQ and 10 AP points over the regular model. In early experiments most of the offset errors were located at instance boundaries, so we introduced the boundary-aware offset loss (1). Table 8 explores the influence of the boundary-aware offset loss (1) to the PSN-18 accuracy. We consider three variants based on the number of regions that divide each instance. In each variant, we set the largest weight equal to 8 for the region closest to the border, and then gradually reduce it to 1 for the most distant region. Note that we set the overall weight of the offset loss to λ = 0.01 when we do not use the the boundaryaware formulation (1). We observe that the boundary-aware loss with four regions per instance brings noticeable improvements across all metrics. The largest improvement is in instance segmentation performance which increases for 1.1 AP points.  Figure 6 evaluates speed of our optimized models across different input resolutions. All models have been run from Python within the TensorRT execution engine. We have used TensorRT to optimize the same model for two graphics cards and two precisions (FP16 and FP32). Note however that these optimizations involve only the network inference, while the post-processing step is executed with a combination of Pytorch [35] and cupy [33].

Inference speed after TensorRT optimization
Interestingly, FP16 brings almost 2× improvement across all experiments, although RTX3090 declares the same peak performance for FP16 and FP32. This suggests that it is likely that our performance bottleneck corresponds to memory bandwidth rather than computing power.  : Inference speed of a three-level PSN-RN18 across different input resolutions for two graphics cards and two precisions. All configurations involve the same model after optimizing it with TensorRT. All datapoints are averaged over all images from Cityscapes val in order to account for the dependence of the post-processing time on scene complexity. We start the clock when an image is synchronized in CUDA memory and stop it when the post-processing is completed. We capture realistic post-processing times by expressing all datapoints as average inference speed over all images from Cityscapes val. The figure shows that our model achieves real-time inference speed even on highresolution 2MPx images. Our PSN-RN18 model achieves more than 100 FPS on 1MPx resolution with FP16 RTX3090.

Conclusion
We have proposed a novel panoptic architecture based on multi-scale features, crossscale blending, and bottom-up instance recognition. Ablation experiments indicate clear advantages of the multi-scale architectures and boundary-aware learning to the panoptic performance. Experiments with oracle components suggest that semantic segmentation represents a critical ingredient of panoptic performance. Panoptic SwiftNet-RN18 achieves state-of-the-art generalization performance and the fastest inference on the BSB Aerial dataset. The proposed method is especially appropriate for remote sensing applications as it is able to efficiently process very large resolution inputs. It also achieves state-of-the-art performance on Cityscapes as well as competitive performance on Vistas and COCO among all models aiming at real-time inference. Source code will be publicly available upon acceptance.