LiDAR-Based Real-Time Panoptic Segmentation via Spatiotemporal Sequential Data Fusion

: Fast and accurate semantic scene understanding is essential for mobile robots to operate in complex environments. An emerging research topic, panoptic segmentation, serves such a purpose by performing the tasks of semantic segmentation and instance segmentation in a uniﬁed framework. To improve the performance of LiDAR-based real-time panoptic segmentation, this study proposes a spatiotemporal sequential data fusion strategy that fused points in “thing classes” based on accurate data statistics. The data fusion strategy could increase the proportion of valuable data in unbalanced datasets, and thus managed to mitigate the adverse impact of class imbalance in the limited training data. Subsequently, by improving the codec network, the multiscale features shared by semantic and instance branches were efﬁciently aggregated to achieve accurate panoptic segmentation for each LiDAR scan. Experiments on the publicly available dataset SemanticKITTI showed that our approach could achieve an effective balance between accuracy and efﬁciency, and it was also applicable to other point cloud segmentation tasks.


Introduction
A prerequisite for an intelligent robot to perform an assigned task efficiently is an accurate "understanding" of the working environment and the expected impact. Achieving such understanding involves investigating a series of theoretical and technical issues, such as environmental perception, environment representation, and spatial inference of intelligent machines, which are key common technologies for the new generation of artificial intelligence (AI) and open up a new research territory for the surveying and mapping community [1]. Due to its fundamental role in the enabling technologies, environmental perception has attracted much attention in recent years. With the rapid development of AI and robotics, such abilities need to be transformed from processing two-dimensional space-time, static past time, and abstract and abbreviated symbolic expression to processing three-dimensional space-time, dynamic present time, and fine and rich three-dimensional reproduction of realistic scenes [2]. Furthermore, the revolution of deep learning for robotic vision and the availability of large-scale benchmark datasets have advanced research on the key capabilities of environmental perception, such as semantic segmentation [3], instance segmentation [4], object detection and multi-object tracking [5]. However, most research focuses on category/object-wise improvement for the individual tasks (e.g., reasoning of a single category in semantic segmentation, recognition of an individual object in instance segmentation), which falls short of the practical need to provide a holistic environment understanding for intelligent robots. Therefore, many researchers aim to bridge the gap with panoptic segmentation and multi-object detection and tracking. Panoptic segmentation was originally proposed by combining semantic segmentation with instance segmentation in a detection and tracking. Panoptic segmentation was originally proposed by combining semantic segmentation with instance segmentation in a unified framework [6], which aims to unify point-wise semantic annotation of "stuff classes" (background classes) and instance ID annotation of "thing classes" (foreground classes).
Though panoptic segmentation has garnered extensive attention in the image domain, it is still in its infancy for point cloud processing. Depending on the scene and the sensor, point cloud panoptic segmentation methods can be categorized into indoor RGB-D-based methods and outdoor LiDAR-based methods. As for indoor scenes, some progress has been made in terms of algorithms and benchmark datasets for panoptic segmentation of dense and homogeneous point clouds that are obtained from RGB-D sensors. The typical approach is to voxelize the point clouds and then extract or learn features directly point by point for clustering. However, on the one hand, it is difficult to apply the idea of indoor RGB-D-based methods to outdoor tasks because of sparsity, disorder, and uneven distribution of LiDAR point clouds in outdoor scenes. On the other hand, the small number of LiDAR point cloud datasets with accurate annotation information has constrained otherwise flourishing research on point cloud panoptic segmentation in outdoor scenes. Since LiDAR is less susceptible than vision sensors to light and weather conditions, it is now extensively used in environmental perception applications, such as robotic mapping, autonomous driving, 3D reconstruction, and other areas [3]. To pave way for research on LiDAR-based scene understanding, Behley et al. introduced the SemanticKITTI dataset [7] that provides point-wise annotation of each LiDAR scan in the KITTI dataset. The authors also set up qualification tasks, such as semantic segmentation, panoptic segmentation, semantic scene complementation, and dynamic object segmentation. For clarification, we illustrate panoptic segmentation as well as related tasks using sample data (sequence 08) from SemanticKITTI in Figure 1. Currently, panoptic segmentation of LiDAR point clouds is achieved by either voxelizing the point clouds (with cubes or cylinders) for 3D convolution [8] or downscaling the point clouds into images for 2D convolution [9]. Voxelization usually yields high accuracy, but its intensive memory consumption and expensive computation cost make it impractical for real-time applications [10]. Downscaled point cloud projections represent each frame as a regular image of the same scale, and a variety of mature 2D image convolution methods are available for panoptic segmentation [11]. Although it is feasible for real-time data processing, the downscaled projection method inevitably loses some information and thus needs accuracy improvement. In general, several recently proposed Li-DAR-based panoptic segmentation methods suffer from either low efficiency (resulting from the high computational complexity of point clouds) or low accuracy (due to downscaling point clouds to achieve real-time performance). It seems that efficiency and accuracy cannot be balanced. However, the applications of LiDAR, such as autonomous driving and mobile mapping, require real-time processing of point cloud data. The characteristics of LiDAR point cloud, the realistic need for real-time processing, and the accuracy requirements for scene understanding motivate the research on real-time panoptic segmentation of LiDAR point clouds in this study. In other words, how to improve the accuracy of panoptic segmentation while meeting the requirements of real time is the goal of our research. Specifically, this study proposes a spatiotemporal sequential data fusion strategy based on "thing classes" point fusion and considers that the LiDAR point cloud is sparse, susceptible to occlusion, and distance-dependent (i.e., dense in close range and sparse in far range), thereby yielding high-density data of "thing classes" objects. To meet the real-time requirements, the point cloud was downscaled and projected as a 2D bird'seye view (BEV) image and represented in polar coordinates instead of cartesian coordinates. Furthermore, multiscale features were efficiently extracted and aggregated by improving the codec network with UNet as the backbone, called PolarUNet3+. Within the Figure 1. Visualization of typical LiDAR-based outdoor scene understanding tasks: (a) raw scan using viridis as a colormap, (b) semantic segmentation of environmental elements for categorylevel understanding, (c) instance segmentation of environmental elements in specific categories for object-level understanding, (d) panoptic segmentation for comprehensives understanding of the environmental element categories and specific category entities, and (e) legends for visualizing segmentation results; the segmentation results of the stuff classes below also follow this legend. Note that the number of entities varies in each scan, and thus the corresponding color for each entity in the thing classes is randomly generated. The color coding of thing classes in instance segmentation and panoptic segmentation do not apply to the remainder sections. Currently, panoptic segmentation of LiDAR point clouds is achieved by either voxelizing the point clouds (with cubes or cylinders) for 3D convolution [8] or downscaling the point clouds into images for 2D convolution [9]. Voxelization usually yields high accuracy, but its intensive memory consumption and expensive computation cost make it impractical for real-time applications [10]. Downscaled point cloud projections represent each frame as a regular image of the same scale, and a variety of mature 2D image convolution methods are available for panoptic segmentation [11]. Although it is feasible for real-time data processing, the downscaled projection method inevitably loses some information and thus needs accuracy improvement. In general, several recently proposed LiDAR-based panoptic segmentation methods suffer from either low efficiency (resulting from the high computational complexity of point clouds) or low accuracy (due to downscaling point clouds to achieve real-time performance). It seems that efficiency and accuracy cannot be balanced. However, the applications of LiDAR, such as autonomous driving and mobile mapping, require real-time processing of point cloud data. The characteristics of LiDAR point cloud, the realistic need for real-time processing, and the accuracy requirements for scene understanding motivate the research on real-time panoptic segmentation of LiDAR point clouds in this study. In other words, how to improve the accuracy of panoptic segmentation while meeting the requirements of real time is the goal of our research. Specifically, this study proposes a spatiotemporal sequential data fusion strategy based on "thing classes" point fusion and considers that the LiDAR point cloud is sparse, susceptible to occlusion, and distance-dependent (i.e., dense in close range and sparse in far range), thereby yielding high-density data of "thing classes" objects. To meet the real-time requirements, the point cloud was downscaled and projected as a 2D bird's-eye view (BEV) image and represented in polar coordinates instead of cartesian coordinates. Furthermore, multiscale features were efficiently extracted and aggregated by improving the codec network with UNet as the backbone, called PolarUNet3+. Within the proposed network, semantic and instance branches shared the features for semantic and instance predictions, respectively. Finally, the semantic segmentation predictions were fused with the instance segmentation predictions to remove the ambiguity in point-wise segmentation, thereby yielding panoptic segmentation prediction for a single scan. As for the choice of benchmark dataset, the nuScenes dataset [12] does not provide ground truths for panoptic segmentation, thus the experimental evaluation was performed only with the SemanticKITTI dataset. The experimental results show that our method could achieve an effective balance between accuracy and time efficiency.
The major contributions of this study are as follows: (1) A data fusion strategy based on "thing classes" points was proposed, which increased the proportion of valuable data in the unbalanced datasets. It could be combined with random sampling to balance the size of input data for training. This data fusion method could alleviate the adverse impact caused by the uneven category distribution of a point cloud. The experiments showed that this pattern was also applicable to other LiDAR point cloud segmentation tasks. (2) An improved UNet based on polar coordinate representation was used as a strong shared backbone, which could effectively aggregate shallow and deep features at full scales. The polar coordinate representation could minimize the negative impact caused by the inherent "the farther, the sparser" feature of LiDAR point clouds.
The remainder of this paper is organized as follows: Section 2 presents a review of related work. Section 3 elaborates the details of our method. The experimental results on the SemanticKITTI datasets are presented in Section 4. Finally, Section 5 gives a brief conclusion and outlooks.

Related Work
Point cloud segmentation is fundamental to scene understanding, which requires an understanding of both global geometry and a combination of fine-grained features at each point to enable point-wise segmentation. This section presents research work related to semantic, instance, and panoptic segmentation for scene understanding with varying focuses, namely category prediction of environmental elements, object recognition of environmental elements, and an overall understanding of the environment, respectively.

Semantic Segmentation of Point Clouds
Early semantic segmentation of point clouds usually employed methods such as support vector machines, conditional random fields or random forests, which offered limited semantic annotation classes and low accuracy rates. Deep learning, with its powerful learning capabilities at the feature level, has progressively become the dominant approach for semantic segmentation of point clouds. In terms of application scenarios, point cloud semantic segmentation can be divided into indoor and outdoor applications; segmentation methods can be divided into four types: point convolution, image convolution, voxel convolution, and graph convolution [13]. Point clouds of an indoor scene are usually acquired by RGB-D sensors and characterized by limited spatial coverage, dense data points, and evenly distributed point clouds. For indoor scenes, PointNet [14] and PointNet++ [15], the most representative point convolution methods, extract local feature information for semantic segmentation through multiscale domain information of points. Researchers in [16] voxelized the point cloud to achieve indoor semantic segmentation by extracting aggregated voxel features. In [17], the point cloud was represented as a set of interconnected simple shapes with hyper points and directed graphs with attributes were used to capture the results with contextual information for semantic segmentation. For outdoor scenes, image convolution methods are more often used when real-time performance is considered. This category of methods downscales the 3D point cloud data by projecting them onto a 2D image plane, which is semantically segmented and then projected back onto the 3D data. In [18], images of the point cloud data were captured from all angles, and Remote Sens. 2022, 14, 1775 5 of 31 the 3D data were reduced in a multi-view format, for which the difficulty lays in selecting the viewpoint. Some studies [19][20][21] subjected the point cloud data to spherical projection to yield the range image for semantic segmentation. PolarNet [22] used a projection of the point cloud as a BEV and employed circular convolution to achieve semantic segmentation. If the requirement of real-time is not considered, voxel convolution dominates while rasterizing the point cloud and convolving it in the form of voxels to address the effects of disorder in the point cloud data. 3DCNN-DQN-RNN [23] and VolMap [24] were early works, but their performance was limited by the choice of voxel size and thus needs to be improved. Cylinder3D [25] considered the data distribution characteristics of the LiDAR point cloud, subjected the point cloud to cylindrical partitioning, and applied 3D sparse convolution to yield impressive segmentation performance. Objectively speaking, the characteristic of LiDAR point clouds, "the farther, the sparser", is still a challenge to the voxel convolution methods.

Point Cloud Instance Segmentation
Similar to image instance segmentation methods, point cloud instance segmentation methods can be divided into two groups: proposal-based methods and proposal-free methods. Proposal-based methods first perform 3D object detection and then generate instance mask predictions for every object. Three-dimensional semantic instance segmentation (3D-SIS) [26] learns color and geometry features from RGB-D scans to achieve indoor semantic instance segmentation. Generative shape proposal networks (GSPNs) [27] develop a shapeaware module by enforcing geometric understanding to generate proposals. 3D-Bonet [28] formulates the bounding box generation task as an optimization problem and uses two independent branches to generate proposals and mask predictions. Proposal-based methods require multistage training and encounter the challenge of redundant proposals, which make it difficult to obtain real-time instance segmentation results.
Proposal-free methods usually regard instance segmentation as a clustering task based on semantic segmentation, and the research efforts focus on feature learning and clustering of instance points. The similarity group proposal network (SGPN) [29] is a pioneering development in which highly similar points are clustered by constructing a similarity matrix for the learned point features. A variety of instance segmentation methods based on similarity assessment have emerged immediately after that report. Reference [30] described 2D-3D hybrid feature learning based on a global representation of 2D BEVs combined with local point clouds and clustered instance points by using the mean shift algorithm. PointGroup [31] uses 3D sparse convolution to extract semantic information and guides instance generation through the offset prediction branch. As a subsequent clustering step to semantic segmentation, proposal-free methods are usually not computationally intensive, but the accuracy of their instance segmentation is limited by the performance of semantic feature extraction and the effect of sparse point clustering. At this stage, the challenge of the proposal-free methods is to improve the accuracy of instance partitioning while maintaining the efficiency of partitioning.

Point Cloud Panoptic Segmentation
As for feature extraction, point cloud panoptic segmentation puts semantic segmentation and instance segmentation under a unified framework in accordance with the four segmentation methods (i.e., point convolution, image convolution, voxel convolution, and graph convolution), and the proposal-based or proposal-free methods are used for instance point inference. Building on existing approaches of semantic and instance segmentation, research on panoptic segmentation focuses on aspects such as the disambiguation between semantic and instance predictions and the efficient clustering of instance points. Multiobject panoptic tracking (MOPT) [32] and EfficientLPS [33] downscale point cloud data into distance images for feature extraction and fuse semantic predictions with instance prediction results based on confidence levels to remove point-by-point prediction ambiguities. Panoptic-DeepLab [34] clusters neighboring points according to a prediction of the regression centers of the instances. DSNet [8] incorporates a learnable dynamic shift module based on Cylinder3D [25] to select different bandwidths for clustering sparse instance points. Panoptic-PolarNet is based on PolarNet [22] to downscale point clouds to BEV images to predict regression centers and offsets for real-time panoptic segmentation. Based on KPConv [35], 4D-PLS [5] achieves panoptic segmentation and target tracking by fusing multi-frame point clouds and using a density-based clustering method.

Method
Our work was inspired by Panoptic-PolarNet [9] and aimed to achieve improvement from four perspectives: The "thing classes" point data were fused in specific spatial and temporal domains; a robust shared backbone network, namely, Polar-UNet3+, was created by improving the codec network; a parallel semantic segmentation branch and instance segmentation branch were used to generate separate predictions; and finally, the semantic segmentation predictions and instance segmentation predictions were merged to yield the panoptic segmentation results. This section describes in detail the modules and the workflow of our method.

Overview
Our method draws on the panoptic segmentation process in Panoptic-PolarNet [9], and the design of each component was inspired by various distinguished methods, which resulted in the methodological framework shown in Figure 2. The spatiotemporal sequential data fusion module enabled the fusion of "thing classes" points in a specific spatiotemporal domain. This module worked with the random sampling module to reduce the amount of data input to the backbone network. The cylindrical partitioning module transformed the point cloud data from a cartesian coordinate representation into a polar coordinate representation and combined the point-wise features obtained through multilayer perceptron (MLP) processing with cylindrical voxelized partitioning to generate the cylindrical partitioning features. The backbone network module obtained a robust backbone network, Polar-UNet3+, by using full-scale skip connections, and the semantic segmentation branch shared the features extracted from the backbone network with the instance segmentation branch to generate predictions. The prediction fusion module fused the panoptic segmentation results using majority voting to eliminate the semantic segmentation ambiguity within instances caused by point-wise segmentation. instance prediction results based on confidence levels to remove point-by-point prediction ambiguities. Panoptic-DeepLab [34] clusters neighboring points according to a prediction of the regression centers of the instances. DSNet [8] incorporates a learnable dynamic shift module based on Cylinder3D [25] to select different bandwidths for clustering sparse instance points. Panoptic-PolarNet is based on PolarNet [22] to downscale point clouds to BEV images to predict regression centers and offsets for real-time panoptic segmentation. Based on KPConv [35], 4D-PLS [5] achieves panoptic segmentation and target tracking by fusing multi-frame point clouds and using a density-based clustering method.

Method
Our work was inspired by Panoptic-PolarNet [9] and aimed to achieve improvement from four perspectives: The "thing classes" point data were fused in specific spatial and temporal domains; a robust shared backbone network, namely, Polar-UNet3+, was created by improving the codec network; a parallel semantic segmentation branch and instance segmentation branch were used to generate separate predictions; and finally, the semantic segmentation predictions and instance segmentation predictions were merged to yield the panoptic segmentation results. This section describes in detail the modules and the workflow of our method.

Overview
Our method draws on the panoptic segmentation process in Panoptic-PolarNet [9], and the design of each component was inspired by various distinguished methods, which resulted in the methodological framework shown in Figure 2. The spatiotemporal sequential data fusion module enabled the fusion of "thing classes" points in a specific spatiotemporal domain. This module worked with the random sampling module to reduce the amount of data input to the backbone network. The cylindrical partitioning module transformed the point cloud data from a cartesian coordinate representation into a polar coordinate representation and combined the point-wise features obtained through multilayer perceptron (MLP) processing with cylindrical voxelized partitioning to generate the cylindrical partitioning features. The backbone network module obtained a robust backbone network, Polar-UNet3+, by using full-scale skip connections, and the semantic segmentation branch shared the features extracted from the backbone network with the instance segmentation branch to generate predictions. The prediction fusion module fused the panoptic segmentation results using majority voting to eliminate the semantic segmentation ambiguity within instances caused by point-wise segmentation. Overview of our proposed method. We first aligned "thing classes" point clouds together within the temporal window. After fusing them with the current scan, random sampling was performed as input (Section 3.2). Then, a network called Polar-UNet3+ (Section 3.4) was introduced, a strong backbone used for semantic segmentation and instance detection based on cylinder partitioning (Section 3.3). Finally, the panoptic prediction was obtained by fusing the above predictions. Overview of our proposed method. We first aligned "thing classes" point clouds together within the temporal window. After fusing them with the current scan, random sampling was performed as input (Section 3.2). Then, a network called Polar-UNet3+ (Section 3.4) was introduced, a strong backbone used for semantic segmentation and instance detection based on cylinder partitioning (Section 3.3). Finally, the panoptic prediction was obtained by fusing the above predictions.

Multi-Scan Fusion via Foreground Point Selection
In panoptic segmentation, movable environmental elements are usually assigned to the "thing classes", and immovable environmental elements are assigned to the "stuff Remote Sens. 2022, 14, 1775 7 of 31 classes". In the SemanticKITTI dataset, for example, the number of "thing classes" was set to eight ("car", "truck", "bicycle", "motorcycle", "other-vehicle", "person", "bicyclist", "motorcyclist"), while that of "stuff classes" was set to 11 ('road", "sidewalk", "parking", "other-ground", "building", "vegetation", "trunk", "terrain", "fence", "pole", "trafficsign"). To better analyze the distribution of various environmental elements in a single scan of point cloud data, we sampled every 100 frames in the sequence 00 data and counted the average percentage of various environmental elements in 46 frames of data, as shown in Table 1. (Considering the size of the table, we selected the proportion of 10 frames of data for display and the average proportion of 46 frames of data sampled in the last row.) As the table shows, the average percentage of environmental elements in the "thing classes" was 8.9%, while the average percentage of environmental elements in the "stuff classes" was 91.1%. It was challenging to achieve accurate instance segmentation of the "thing classes" environmental elements with less than 10% of the data while still meeting real-time requirements. Inspired by the Range Sparse Net (RSN) [36], we proposed a spatiotemporal sequential data fusion strategy based on "thing classes" point fusion to improve the performance of panoptic segmentation by fusing multiple scans of "thing classes" point data to yield a relatively complete portrayal of the "thing classes" of instances. RSN achieves real-time object detection by predicting the "thing classes" points. RSN first downscales the LiDAR point cloud into an image, then segments the "thing classes" points and "stuff classes" points, and finally applies a convolution operation to the chosen "thing classes" points to yield 3D object detection.
As for autonomous driving in urban environments, where LiDAR collects environmental information at a fixed position relative to the vehicle, only part of the instance-level environmental elements can be captured in a single scan of data. In other words, the complete information on the instance-level environmental elements cannot be acquired in a single scan of data. As the position of the LiDAR acquisition changes, multiple views of the environment are continuously collected. By fusing multiple scans of data within a certain time window, theoretically, a relatively complete picture of the instances can be acquired. Our goal was to improve the panoptic segmentation performance by integrating the "thing classes" data in a specific spatial and temporal domain during training. For point cloud data S t and a time-window threshold n for the current moment t, we fused the "thing classes" point data ThS m (t − n < m < t − 1) located within time window (t − n, t − 1) and the point cloud data S t for the current moment. The fused point cloud data were taken as input for single-frame panoptic segmentation. Point cloud fusion was performed based on the pose description and coordinate transformation method in simultaneous localization and mapping (SLAM). Firstly, each point P i = (x, y, z, i) contained in the point cloud data S t was represented in the form of a homogeneous coordinate, namely P i = (x, y, z, 1), where (x, y, z) is the cartesian coordinate of P i and i is the intensity value. Then, the adjacent frame data were transformed into the same coordinate system for fusion based on the pose description matrix T S t−1 S t ∈ R 4×4 for fusion, where T S t−1 S t consists of a rotation matrix R S t−1 S t ∈ SO(3) and a translation vector k S t−1 S t ∈ R 3 . The pose description matrix for any two frames of data transformed by coordinates could be expressed as T S m S t = T S m S m+1 T S m+1 S m+2 · · · T S t−2 S t−1 T S t−1 S t , and the coordinate transformation process could be expressed as Figure 3 shows a schematic diagram of instance-level point cloud fusion based on multiple scans of data (taking a car as an example).
( − , − 1) and the point cloud data for the current moment. The fused point cloud data were taken as input for single-frame panoptic segmentation. Point cloud fusion was performed based on the pose description and coordinate transformation method in simultaneous localization and mapping (SLAM). Firstly, each point = ( , , , ) contained in the point cloud data was represented in the form of a homogeneous coordinate, namely = ( , , , 1), where ( , , ) is the cartesian coordinate of and is the intensity value. Then, the adjacent frame data were transformed into the same coordinate system for fusion based on the pose description matrix ∈ ℝ × for fusion, where consists of a rotation matrix ∈ (3) and a translation vector ∈ ℝ .
The pose description matrix for any two frames of data transformed by coordinates could be expressed as = ⋯ , and the coordinate transformation process could be expressed as = | ∈ . Figure 3 shows a schematic diagram of instance-level point cloud fusion based on multiple scans of data (taking a car as an example). Considering that in the environment there are elements that are either stationary or moving at different speeds, the quality of the point cloud data fusion might be degraded when the "thing classes" elements are moving too fast or when a large time-window threshold is selected. Hence, the time-window threshold was set to 3 in this study.  Considering that in the environment there are elements that are either stationary or moving at different speeds, the quality of the point cloud data fusion might be degraded when the "thing classes" elements are moving too fast or when a large time-window threshold is selected. Hence, the time-window threshold n was set to 3 in this study. Figure 4 shows a schematic diagram of foreground environmental element point cloud data fusion based on the past three scans.
The fusion of multiple scans of "thing classes" point data increased the amount of input data for a single frame during the training, which resulted in an increase of the amount of input data by approximately 10% for each additional frame of "thing classes" point data fused. Moreover, only the current frame could be used in the panoptic segmentation test for single scan data, and it was impossible to fuse multiple frames from the past. To improve the performance and robustness of our approach, we randomly sampled training data after a certain number of epochs of initial training were performed with input data integrated with multiple scans' "thing classes" points. In other words, the amount of input data that was fused with multiple scans' "thing classes" points was randomly sampled to a single-frame data size, thereby ensuring that the amount of training data in a single scan was guaranteed to match the amount of validation and test split. Figure 5 shows a schematic diagram of random sampling of the fused point cloud data. The point cloud data became sparse, especially the environmental elements in "stuff" classes with high data proportions (such as roads). The "thing classes" environmental element points in the time window were selected and fused with the current scan through coordinate transformation to obtain relatively complete instance information of "thing classes" environmental elements.
The fusion of multiple scans of "thing classes" point data increased the amount of input data for a single frame during the training, which resulted in an increase of the amount of input data by approximately 10% for each additional frame of "thing classes" point data fused. Moreover, only the current frame could be used in the panoptic segmentation test for single scan data, and it was impossible to fuse multiple frames from the past. To improve the performance and robustness of our approach, we randomly sampled training data after a certain number of epochs of initial training were performed with input data integrated with multiple scans' "thing classes" points. In other words, the amount of input data that was fused with multiple scans' "thing classes" points was randomly sampled to a single-frame data size, thereby ensuring that the amount of training data in a single scan was guaranteed to match the amount of validation and test split. Figure 5 shows a schematic diagram of random sampling of the fused point cloud data. The point cloud data became sparse, especially the environmental elements in "stuff" classes with high data proportions (such as roads).

Polar-Cylindrical Partitioning
When aiming to analyze the characteristics of the LiDAR point clouds, on the one hand, we easily found that if the point cloud was viewed from an overhead perspective, the 2D aerial view showed an approximate circular distribution of point data [22]. However, the LiDAR point cloud also showed a cylindrical distribution centered on the sensor [25]. On the other hand, the inherent "the farther, the sparser" feature of LiDAR point clouds causes the density of point cloud data to be inversely proportional to distance. That is, point density decreases as distance increases. Due to the "the farther, the sparser" feature, the cube-voxelization method that is suitable for homogeneous and dense indoor point clouds is not applicable to outdoor LiDAR point clouds. Based on the cylindrical distribution characteristics of point cloud data and the "the farther, the sparser" feature, PolarNet [22] and Cylinder3D [25] apply cylindrical voxelization of point clouds and

Polar-Cylindrical Partitioning
When aiming to analyze the characteristics of the LiDAR point clouds, on the one hand, we easily found that if the point cloud was viewed from an overhead perspective, the 2D aerial view showed an approximate circular distribution of point data [22]. However, the LiDAR point cloud also showed a cylindrical distribution centered on the sensor [25]. On the other hand, the inherent "the farther, the sparser" feature of LiDAR point clouds causes the density of point cloud data to be inversely proportional to distance. That is, point density decreases as distance increases. Due to the "the farther, the sparser" feature, the cube-voxelization method that is suitable for homogeneous and dense indoor point clouds is not applicable to outdoor LiDAR point clouds. Based on the cylindrical distribution characteristics of point cloud data and the "the farther, the sparser" feature, PolarNet [22] and Cylinder3D [25] apply cylindrical voxelization of point clouds and achieve semantic segmentation by using two-dimensional convolution and three-dimensional sparse convolution, respectively. We took full advantage of the above-noted research findings and adopted cylindrical partitioning of the point cloud data according to polar coordinates.
With respect to the commonly used regular cubic voxelization method, each equalsized grid contains significantly fewer data points as the distance increases, while the grids close to the center of the sensor contain more data points. Moreover, the polar coordinate system-based cylindrical voxelization method generates equal-angle sectors of different sizes with varying radii, while the volume of the sector closer to the sensor center is smaller than that of the sector farther away. Compared with conventional voxelization, benefiting from equal-angle division, cylindrical partitioning contains fewer data points in closer grids and more data points in more distant grids. Overall, the point data becomes increasingly evenly distributed within the grid as the distance varies. Each point CP i = (x, y, z, i) in the input data F t at the current moment t is transformed into a polar coordinate point PP i = (ρ, θ, z, i) by the following function: After the coordinate transformation, the position of the point in the cylindrical partition is determined based on the point distance z, axis radius ρ, and azimuth θ. Then, the points within each sector in the partition are given a fixed length feature vector by MLP. This vector is used as an input feature for the backbone network.

Backbone Design
The design of the backbone network is central to achieving fast and accurate panoptic segmentation. Considering the research progress in real-time panoptic segmentation at the current stage, we chose the lightweight and efficient UNet [37] as the infrastructure and considered the contribution of UNet3+ [38] in improving network performance to create our backbone network, Polar-UNet3+. Figure 6 illustrates a schematic comparison of UNet and Panoptic-PolarNet.

Backbone Design
The design of the backbone network is central to achieving fast and accurate panoptic segmentation. Considering the research progress in real-time panoptic segmentation at the current stage, we chose the lightweight and efficient UNet [37] as the infrastructure and considered the contribution of UNet3+ [38] in improving network performance to create our backbone network, Polar-UNet3+. Figure 6 illustrates a schematic comparison of UNet and Panoptic-PolarNet. The key concern to network design is the model's ability to efficiently extract and integrate features at all levels. As a classical "U-type" codec architecture, UNet integrates the features extracted from encoding and decoding sessions by using skip connections; UNet's idea of feature integration is to integrate deep features upon the completion of coding without paying attention to shallow features. To fully integrate shallow features, UNet++ [39] uses nested joins to integrate shallow features from the first coding round, thereby improving performance, but the nested join approach may impair the real-time performance. UNet3+ goes a step further by using full-scale skip connections instead of The key concern to network design is the model's ability to efficiently extract and integrate features at all levels. As a classical "U-type" codec architecture, UNet integrates the features extracted from encoding and decoding sessions by using skip connections; UNet's idea of feature integration is to integrate deep features upon the completion of coding without paying attention to shallow features. To fully integrate shallow features, UNet++ [39] uses nested joins to integrate shallow features from the first coding round, thereby improving performance, but the nested join approach may impair the real-time performance. UNet3+ goes a step further by using full-scale skip connections instead of nested joins, complemented by a deep supervision strategy. This enables the integration of feature information at full scale, thereby further improving performance while ensuring real-time performance. We proposed Polar-UNet3+ for LiDAR point cloud segmentation based on UNet3+. Polar-UNet3+ consists of four encoders and three decoders. Each decoder integrates multiscale features from the encoder or other decoders by upsampling or maximum pooling, thereby allowing each decoder to gain shallow and deep features at full scale, as shown in Figure 7. In contrast to UNet and Panoptic-PolarNet, Polar-UNet3+ integrates multiscale features through full-scale skip connections. The features extracted by Polar-UNet3+ are shared by parallel semantic and instance branches. The semantic branches provide point-wise prediction annotations, and the instance branches offer regression centers for prediction instances. In the semantic branch, the unbalanced amount of data containing various environmental elements in the point cloud data poses a challenge for the semantic segmentation of small categories. Each scan contains a significantly higher proportion of "stuff classes" environmental elements than "thing classes" environmental elements. The imbalance in the data frequently causes the semantic segmentation network to tend to learn a high proportion of environmental element categories in training, and it is difficult to fully learn a very low proportion of environmental element categories. To minimize the performance loss of the semantic segmentation network due to the imbalance in the data, we chose a weighted cross-entropy loss function in the semantic branch : where denotes a weight value that depends on the inverse of the frequency of occurrence of each category of environmental elements, ( ) denotes the predicted value for category , and ( ) represents the true value for category .
Concerning the instance branch, based on the regression center with comprehensive consideration given to the clustering effect and running time, this study employed the The features extracted by Polar-UNet3+ are shared by parallel semantic and instance branches. The semantic branches provide point-wise prediction annotations, and the instance branches offer regression centers for prediction instances. In the semantic branch, the unbalanced amount of data containing various environmental elements in the point cloud data poses a challenge for the semantic segmentation of small categories. Each scan contains a significantly higher proportion of "stuff classes" environmental elements than "thing classes" environmental elements. The imbalance in the data frequently causes the semantic segmentation network to tend to learn a high proportion of environmental element categories in training, and it is difficult to fully learn a very low proportion of environmental element categories. To minimize the performance loss of the semantic segmentation network due to the imbalance in the data, we chose a weighted cross-entropy loss function in the semantic branch L sem : where δ i denotes a weight value that depends on the inverse of the frequency of occurrence f i of each category of environmental elements, p(y i ) denotes the predicted value for category i, and p(ŷ i ) represents the true value for category i. Concerning the instance branch, based on the regression center with comprehensive consideration given to the clustering effect and running time, this study employed the mean shift clustering algorithm to cluster the "thing classes" data points to generate the instance ID. 4D-PLS [5] and Panoptic-PolarNet [9] employ a density clustering algorithm, while DSNet [8] incorporates a dynamic shifting algorithm by improving the mean shift clustering algorithm. Compared with the density clustering algorithm, the mean shift clustering algorithm is less sensitive to density variations and noise points and is thus more suitable for sparse LiDAR point clouds. Moreover, unlike the dynamic shifting algorithm, the mean shift clustering algorithm does not require iterative operations of bandwidth and kernel functions and differentiation operations, thus effectively reducing the clustering time. We first chose the mean squared error loss as the loss function L mse for the regression center prediction in the instance branch and then used the L ins function as the loss function for the instance ID: In summary, we used the loss function L to train Polar-UNet3+:

Prediction Merging
Ideally, the inner points of the predicted instance ID should have consistent semantic labels. In other words, the same instance ID should share the same semantic label, and the same point should not be assigned to a different instance ID. However, in the classindependent instance segmentation method, it is inevitable that the semantic labels of the inner points of the same instance ID are inconsistent because the clustering of points into instances in the instance branch cannot consider the predicted values of the semantic categories in the semantic branch. To resolve the ambiguity between point-wise semantic annotations and shared instance IDs within each instance, we followed the majority voting strategy in the class-independent panoptic segmentation method to choose the semantic annotation value with the higher percentage of internal points of the same instance as the semantic annotation value of the instance ID, thereby correcting the points with inconsistent semantic annotations. The simple and efficient majority voting strategy ensured consistency between instance segmentation and semantic segmentation with minimal consumption of computational resources.

Experiments
In this section, we evaluate our approach on the public dataset SemanticKITTI. We performed the training on a hardware platform with eight GeForce RTX™ 2080 TI graphic processing units (GPUs) (11 GB of video memory) with 80 epochs. After 50 epochs of training, the random sampling strategy was applied. After all training, validation and testing were performed with a single graphics card after training.

Dataset and Metrics
To fully evaluate the performance of our method and to benchmark it against other panoptic segmentation algorithms, the SemanticKITTI dataset was chosen in this study for training validation and testing. The SemanticKITTI dataset provided 43,442 frames of point cloud data (sequences 00-10 were point cloud ground truths with point-wise annotations). Moreover, challenges for multiple segmentation tasks were launched at the dataset's official website, where participants' methods were assessed with uniform metrics. In reference to the requirements on the use of other algorithms and datasets, 19,130 frames of point cloud data from sequences 00-07 and 09-10 were used as the training dataset, 4071 frames of data from sequence 08 were used as the validation dataset, and 20,351 frames of point cloud data from sequences 11-21 were used as the test dataset. For the semantic segmentation and panoptic segmentation of single-frame data, 19 categories needed to be labeled. For the semantic segmentation of multiple scans, 25 categories needed to be annotated (six additional mobile categories). The evaluation metrics for semantic segmentation were mean intersection-over-union (mIoU): where C denotes the number of categories, TP i , FP i , and FN i denote true positive, false positive, and false negative values for category i, respectively. The evaluation metrics for panoptic segmentation [6,40] were the panoptic quality, recognition quality, and semantic quality, denoted as PQ, RQ, and SQ, respectively, for all categories; PQ Th , SQ Th , and RQ Th , respectively, for the "thing classes"; and PQ St , SQ St , and RQ St , respectively, for the "stuff classes". Additionally, only SQ was used as PQ † for PQ in the "stuff classes". PQ, RQ, and SQ were defined as follows: Table 2 presents the results of the quantitative evaluation of the method proposed in this study on the test dataset. We took FPS = 10 Hz as the grouping criterion to compare the panoptic segmentation performance of each algorithm to verify the effectiveness of the method proposed in this study. Note that the results of the comparison methods were obtained from the competition, and taking 10 Hz as the grouping criterion complied with the data acquisition frequency, which is usually used as the criterion of real-time performance. Panoptic-PolarNet uses cylindrical partitioning for instance segmentation with UNet as the backbone network, 4D-PLS adopts multi-scan data fusion to achieve panoptic segmentation of single-scan data, and DS-Net applies 3D sparse convolution of cylindrical partitioning and the dynamic shifting clustering algorithm with multiple iterations. The method proposed in this study drew to varying degrees on these three panoptic segmentation methods. Compared with Panoptic-PolarNet, our approach improved PQ by 0.5% and FPS by 1 Hz. Compared with 4D-PLS, our approach improved PQ by 4.3%. The PQ obtained by DS-Net is 1.3% higher than that obtained by our method, but the FPS obtained by DS-Net is merely 3.4 Hz, which does not allow for real-time panoptic segmentation.

Performance and Comparisons
Furthermore, we investigated the key components for performance improvements based on the experiment results. As discussed in the related work (Section 2), a robust panoptic segmentation can be realized using either a strong backbone, excellent clustering methods, or smart data fusion strategies. Firstly, based on the powerful backbone named Cylinder3D, DS-Net proposes a dynamic shifting clustering method to improve the performance of panoptic segmentation. Cylinder3D adopts 3D sparse convolution and obtains a 67.8% mIoU score in single scan semantic segmentation competitions. Therefore, DS-Net can get a 55.9% PQ score. However, our backbone only got less than a 60% mIoU score in single scan semantic segmentation competitions. Thanks to the way of "thing classes" data fusion, we could approach the performance of DS-Net under the condition of meeting the requirements of real time. Secondly, 4D-PLS performs panoptic segmentation by fusing multi-scan data. Although multi-scan data are fused, the proportion of "thing classes" data remains unchanged. In our method, only the "thing classes" data were fused, which improved the proportion of valuable data for instance segmentation, and then improved the performance of panoptic segmentation. Finally, compared with Panoptic-PolarNet, our method adopted a strong backbone that could aggregate features at full scales, called Polar-UNet3+, which could achieve better performance in the case of shortening the panoptic segmentation time. Table 3 shows, on a semantic category-by-semantic category basis, the quantitative evaluation results obtained by our proposed method on the test dataset. The quantitative evaluation results demonstrate that this method achieved a balance between accuracy and efficiency in panoptic segmentation. The visualizations of panoptic segmentation results on the validation split of Se-manticKITTI are shown in Figure 8. The figure consists of five subfigures, which correspond to the panoptic segmentation results of the five scanned frames sampled at 800-frame intervals in the validation split. Each subfigure consists of four parts: the ground truth of the panoptic segmentation, the semantic segmentation predictions, the instance segmentation predictions, and the panoptic segmentation predictions. Note that we used semantic-kitti-api [44] to visualize the panoptic ground truths and predicted labels. During each visualization, the color corresponding to each instance was given randomly. Therefore, in Figure 8, each instance's color of the ground truths and predicted labels is inconsistent. The qualitative comparison results show that we could accurately predict each instance and that the error mainly came from the point-wise annotations inside the instance, which will be a direction of improvement in the future.  The visualization results of panoptic segmentation on the test split of SemanticKITTI are shown in Figure 9. The figure consists of eleven subfigures, which correspond to the panoptic segmentation results of each sequence sampled at the 500th scan in the test split (since the 500th scan of sequence 16 contained only a few pedestrians and sequence 17 only had 491 scans, the 300th scan was selected in sequence [16][17]. Each subfigure consists of four parts: the raw scan, the semantic segmentation predictions, the instance segmentation predictions, and the panoptic segmentation predictions. Because the test split did not provide ground truths for comparison, we did not use red rectangular boxes for identification. The visualization results of panoptic segmentation on the test split of SemanticKITTI are shown in Figure 9. The figure consists of eleven subfigures, which correspond to the panoptic segmentation results of each sequence sampled at the 500th scan in the test split (since the 500th scan of sequence 16 contained only a few pedestrians and sequence 17 only had 491 scans, the 300th scan was selected in sequence [16][17]. Each subfigure consists of four parts: the raw scan, the semantic segmentation predictions, the instance segmentation predictions, and the panoptic segmentation predictions. Because the test split did not provide ground truths for comparison, we did not use red rectangular boxes for identification.

Ablation Study
We present an ablation analysis of our approach, which considers the "thing classes" point cloud fusion, random sampling, Polar-Unet3+, and grid size. All results were compared on the validation set, sequence 08. Firstly, we set the grid size to (480, 360, 32) and investigated the contribution of each key component of the method. The results are shown in Table 4. As a key part of the performance improvement, Polar-Unet3+ boosted PQ by 0.8%, a further 0.9% performance improvement was achieved for the "thing classes" point fusion, and the random sampling module improved the stability of the method.

Ablation Study
We present an ablation analysis of our approach, which considers the "thing classes" point cloud fusion, random sampling, Polar-Unet3+, and grid size. All results were compared on the validation set, sequence 08. Firstly, we set the grid size to (480, 360, 32) and investigated the contribution of each key component of the method. The results are shown in Table 4. As a key part of the performance improvement, Polar-Unet3+ boosted PQ by 0.8%, a further 0.9% performance improvement was achieved for the "thing classes" point fusion, and the random sampling module improved the stability of the method. After analyzing the performance of each component, we explored the effect of voxelization size on segmentation time and performance, and the results are shown in Table 5. For the LiDAR point cloud featuring "dense in close range and sparse in far range", finer voxels did not deliver significant performance gains but rather compromised segmentation efficiency.

Other Applications
The improvement of panoptic segmentation performance benefited from the "thing classes" data fusion and random sampling strategies. To fully evaluate the effectiveness and generalization of our method, we chose the more challenging semantic segmentation task and the moving object semantic segmentation task to verify our method. The semantic segmentation of moving objects involves training on the basis of multiple scans as input and labeling the semantic on a single scan (the environmental elements of 25 categories need to be predicted, and the environmental elements of specific categories need to be distinguished whether they are moving or not). Compared with moving object segmentation, moving object semantic segmentation not only distinguishes the state of environmental elements (moving or static) but also needs to accurately label all environmental elements. Compared with the semantic segmentation of a single scan, moving object semantic segmentation adds six categories: moving car, moving truck, moving other vehicle, moving person, moving bicyclist, and moving motorcyclist.
On the basis of ensuring real-time performance, we expanded Polar-Unet3+ for panoptic segmentation, including four encoders and three decoders, to MS-Polar-Unet3+, including five encoders and four decoders. The main reason for deepening the network was that semantic segmentation does not need to predict the instance ID, and the reduced amount of calculation can be used to deepen the network structure, as shown in Figure 10.
In the process of multiple-scan fusion, we defined "thing classes data" as moving-class data, including "moving car", "moving truck", "other moving vehicle", "moving person", "moving bicyclist", and 'moving motorcyclist". In addition, "1 + P + M" was used to represent the multiple-scan fusion mode, in which P is the number of complete previous scans and M is the number of scans in which only moving-class points are fused. In the selection of P, due to the limitation of the hardware platform, it was difficult for us to integrate the past four scans, as in SemanticKITTI [7], and we set P to 2. In the selection of M, because the number of points in the moving classes accounts for a small proportion, we referred to the conclusion in Lidar-MOS [45] and set M to 8. In the training process, referring to the training parameter setting of the panoptic segmentation task, the grid size was (480, 360, 32), the number of training epochs was 80, and the random sampling strategy was applied at the 50th epoch. Table 6 shows the quantitative evaluation results of our method on the test split of SemanticKITTI. Our method could obtain 52.9% mIoU on the basis of ensuring real-time performance. The moving object semantic segmentation results of Cylinder3D [25] come from the official evaluation website of SemanticKITTI [46], and we could not confirm its fusion quantity. Note that the existence of moving objects improves the difficulty of semantic segmentation. Cylinder3D [25] obtains 67.8% mIoU in a single scan semantic segmentation task, while it obtains only 52.5% mIoU in a multiple-scan semantic segmentation task. Our method obtained 52.9% mIoU on the basis of ensuring real-time performance.
The quantitative evaluation results of our proposed approach in panoptic segmentation and moving object semantic segmentation on the test split of SemanticKITTI show that our method achieved an effective balance between segmentation accuracy and efficiency, and the foreground data fusion and random sampling strategies could be popularized and applied to other LiDAR-based point cloud segmentation networks.
Furthermore, we combined the segmentation results of dynamic objects with SLAM to evaluate the effectiveness of our method. Moving objects in the environment will produce a wrong data association effect, which will affect the pose estimation accuracy of SLAM algorithms. If moving objects can be removed accurately, it will undoubtedly improve the performance of SLAM algorithms. We chose the MULLS [47] as the SLAM benchmark. We used three kinds of input data for experiments: raw scan, the scan which filtered out dynamic objects according to the semantic ground truth (abbreviated as Dynamic Moving GT), and the scan which filtered out dynamic objects according to the semantic predictions obtained by our method (abbreviated as Dynamic Moving).
For the evaluation of the SLAM algorithm, the quantitative evaluation index of absolute pose error (APE) was applied, using Sim (3) Umeyama alignment in the calculation. We then used the evo tool [48] to evaluate the estimated pose results, including the root mean square error (RMSE), the mean error, the median error, and the standard deviation (Std.). Table 7 is a comparison of the APE used for the translation component of different inputs based on the MULLS. Figure 11 shows the APE visualization results for different inputs. This figure consists of eleven subfigures, which correspond to the sequence 00-10 in the SemanticKITTI dataset, respectively. According to Table 7 and Figure 10, it is obvious that filtering out dynamic objects could significantly improve the accuracy of pose estimation. Considering that most of the dynamic objects in the KITTI dataset were static in the environment, the experimental results strongly demonstrate the effectiveness of our method in segmenting dynamic objects, which improved the accuracy of pose estimation and enhanced the performance of different SLAM algorithms.

Conclusions
This study contributes to achieving fast and accurate scene understanding for autonomous driving in urban environments. Specifically, this study proposes a spatiotemporal sequential data fusion strategy based on the statistical analysis of the characteristics of LiDAR point cloud data. This strategy improved the segmentation performance of "thing classes" fusion while incorporating only a few valuable data points and allowed control of the training input by random sampling. Moreover, this strategy greatly mitigated the adverse effects caused by the uneven distribution of categories caused by the inherent characteristics of urban environment in the limited training data. We believe that this strategy is also applicable to other LiDAR point cloud segmentation tasks. The codec network was further improved by establishing full-scale skip connections to efficiently aggregate the multiscale features shared by semantic and instance branches and to enable accurate panoptic segmentation of a single scan in the consistency fusion module.
To evaluate the proposed method, the SemanticKITTI dataset with point-wise annotations of semantic and instance information was chosen for the experimentation. Experimental results on the SemanticKITTI dataset suggest that our proposed method could achieve an effective balance between accuracy and efficiency. To summarize, this study is an active exploration into the research on scene understanding for intelligent robots with real-time panoptic segmentation of LiDAR point clouds as the core. In the future, we will focus on the fast and accurate scene understanding of complex dynamic environments.

Conclusions
This study contributes to achieving fast and accurate scene understanding for autonomous driving in urban environments. Specifically, this study proposes a spatiotemporal sequential data fusion strategy based on the statistical analysis of the characteristics of LiDAR point cloud data. This strategy improved the segmentation performance of "thing classes" fusion while incorporating only a few valuable data points and allowed control of the training input by random sampling. Moreover, this strategy greatly mitigated the adverse effects caused by the uneven distribution of categories caused by the inherent characteristics of urban environment in the limited training data. We believe that this strategy is also applicable to other LiDAR point cloud segmentation tasks. The codec network was further improved by establishing full-scale skip connections to efficiently aggregate the multiscale features shared by semantic and instance branches and to enable accurate panoptic segmentation of a single scan in the consistency fusion module.
To evaluate the proposed method, the SemanticKITTI dataset with point-wise annotations of semantic and instance information was chosen for the experimentation. Experimental results on the SemanticKITTI dataset suggest that our proposed method could achieve an effective balance between accuracy and efficiency. To summarize, this study is an active exploration into the research on scene understanding for intelligent robots with real-time panoptic segmentation of LiDAR point clouds as the core. In the future, we will focus on the fast and accurate scene understanding of complex dynamic environments.