Fast 3 D Semantic Mapping on Naturalistic Road Scenes

Fast 3D Semantic Mapping on Naturalistic Road Scenes Xuanpeng Li 1,∗ , Dong Wang1, Huanxuan Ao1, Rachid Belaroussi2 and Dominique Gruyer2 1 School of Instrument Science and Engineering, Southeast University, 210006 Nanjing, Jiangsu, China 2 COSYS/LIVIC, IFSTTAR, 25 allée des Marronniers, 78000 Versailles * Correspondence: li_xuanpeng@seu.edu.cn † This paper is an extended version of our paper published in LI, Xuanpeng, et al. Fast semi-dense 3D semantic mapping with monocular visual SLAM. In: Intelligent Transportation Systems (ITSC), 2017 IEEE 20th International Conference on. IEEE, 2017. S. 385-390.


Introduction
Naturalistic scene understanding plays a key background role in most vision-based mobile robots.For example, autonomous navigation in outdoor scenes asks for a rapid and comprehensive understanding of surroundings for obstacle avoidance and path planning.Vehicle movement in limited temporal and spatial contexts always requires knowledge of what something is, where it is located, and ego-vehicle's surrounding.
Robotic maps, such as Occupancy grid map and OctoMap, traditionally provide geometric presentation of the environment.However, they lack the correlation in data between map points and semantic knowledge; thus, they could not be directly utilized in naturalistic road scenes.
Scene parsing is an important and promising step to address this issue.It benefits from the state-of-the-art Deep Convolutional Neural Networks (DCNNs) which contributes to better performance of 2D pixel labeling than traditional methods.Then, combined with the Simultaneous Localization and Mapping (SLAM) technology, automobile could locate itself and meanwhile recognize surrounding objects in pixel-wise level.For instance, it could make autonomous vehicle accomplish certain high-level tasks, such as "parking on the right free place" and "stopping at the crosswalk".This form of semantically annotated 3D representation provides mobile robots with functions of understanding, interaction and navigation in various scenes.
Semantic segmentation has been an active topic for a long time.Most methods have focused on increasing the accuracy of the semantic segmentation, and have seen major improvements [1][2][3].However, they usually asks for high-power computing resources, which is not suitable for the embedded platform.Several recent research focuses on the balance between the computing cost and the accuracy of object detection, classification and 2D pixel labeling [4,5].They achieves a better performance with regards to the embedded and mobile platforms.
Compared to the SLAM technology with scaled sensors, such as stereo and RGB-D cameras, monocular visual SLAM is a promising technology, because monocular vision is flexible, inexpensive, and most importantly, widely equipped on most recent vehicles.Scaled sensors could provide reliable measurement in their specific ranges, whereas they lack the capability of seamless switch between various-scale scenes such as indoor and outdoor.And they normally need large storage resources.
Most man-made environments, e.g., road scenes, usually exhibit distinctive spatial relations among varied classes of objects.Being able to capture, model and utilize these kinds of relations could enhance semantic segmentation performance in the 3D semantic mapping [6].In this paper, we exploit a monocular SLAM method that provides cues of 3D spatial information and utilize state-of-the-art DCNN to build a 3D scene understanding system towards road scenes.Moreover, a Bayesian 2D-3D transfer and a map regularization process are exploited to generate a consistent reconstruction in the spatial and semantic context.In our monocular mapping system, the 3D map is incrementally reconstructed with a sequence of automatically selected keyframes and corresponding semantic information.There is no need to label each frame in a sequence, which could save a considerable amount of computation cost.We refer the reader to Figure 1 for an illustration.Different from the frame skipping strategy proposed by Hermans et al. [7] and McCormac et al. [8], our method could work well under fast camera motions.Since the 3D map should have global consistent depth information, it could be regularized in term of spatial structures.The regularization is aimed to remove distinctive outliers and makes components more consistent in the point cloud map, i.e., local points with same semantic label should be approached in 3D space.Two datasets, Cityscapes [9] and KITTI [10], are used to evaluate our approach.Several raw videos are taken to reconstruct 3D map with semantic labels.
This paper is presented as follows.In the following Section 2, a review of the related work is given.
The problem formulation is presented in Section 3. The 3D semantic mapping is described in Section 4, including the semantic segmentation, the monocular visual SLAM, the Bayesian incremental fusion and the global regularization.Section 5 includes the results of 2D semantic inference and 3D semantic mapping.Finally, Section 6 concludes the paper and discusses possible extensions of our work.

Related Work
Our work is motivated by [8] which contributes an indoor 3D semantic SLAM from the RGB-D input.It aims towards a dense 3D map based on ElasticFusion SLAM [11] with semantic labeling.Pixel-wise semantic information is acquired from a Deconvolutional semantic segmentation network [12] using the scaled RGB information and the depth as the input.Depth information is also used to update surfel's depth and normal information to construct 3D dense map during loop closure.In addition, a previous work, SLAM++ [13], creates a map with semantically defined objects, but it is limited to predefined database and hand-crafted template models.In this paper, we make use of an incremental Bayesian fusion strategy with state-of-the-art visual SLAM and semantic segmentation.
Visual SLAM usually contains sparse, semi-dense, and dense types depending on the methods of image alignment.Feature-based methods only exploited limited feature points -typically image corners and blobs or line segments, such as classic MonoSLAM [14] and ORB-SLAM [15,16].They are not suitable for 3D semantic mapping due to rather sparse feature points.In order to better exploit image information and avoid the cost on calculation of features, direct dense SLAM system, such as the surfel-based dense slam, ElasticFusion [11] and Dense Visual SLAM [17], have been proposed recently.Whereas, direct image alignment from these dense methods is well-established for monocular, RGB-D and stereo sensors.Semi-dense methods like Large-Scale Direct-SLAM (LSD-SLAM) [18] and Semi-direct Visual Odometry (SVO) [19] provide possibility to build a synchronized 3D semantic mapping system.
Deep CNNs have proven to be effective in the field of image semantic segmentation.[22,23] boosts real-time performance of semantic segmentation without losing the accuracy too much.The state-of-the-art DeepLab-v3+ [5] contains a simple effective decoder module to refine the segmentation results especially along object boundaries.Furthermore, combining the encoder part of MobileNet-v2 in its encoder-decoder structure, DeepLab-v3+ could achieve a better trade-off between precision and runtime.
In the topic of scene understanding and mapping, recent research employ 3D priors of objects increasingly.
Salas-Moreno et al. [13] project 3D mesh of objects to the RGB-D frame in a graphical SLAM framework.
Valentin et al. [24] propose a triangulated meshed representation of the scene from multiple depth measurements and exploit the Conditional Random Field (CRF) to capture the consistency of 3D object mesh.Kundu et al. [25] exploit the CRF for joint voxels to infer the semantic information and occupancy.Sengupta and Sturgess [26] use stereo camera, estimated pose and CRF to infer the semantic octree presentation of the 3D scene.Vineet et al. [27] propose an incremental dense stereo reconstruction and semantic fusion technique to handle dynamic objects in the large-scale outdoor scenes.Kochanov et al. [28] employ scene flow measurements to incorporate temporal updates into the mapping of dynamic environment.Landrieu et al. [29] introduce a regularization framework to obtain spatially smooth semantic labeling of 3D point clouds from a point-wise classification, considering the uncertainty associated with each label.Gaussian Process (GP) is another popular method for map inference.Jadidi et al. [30] exploit GP to learn the structural and semantic correlation between map points.This technique also incorporates OcotoMap to handle sparse measurements and missing labels.In order to improve the training and query time complexities of the GP-based semantic mapping, Gan et al. [31] further introduce a Relevance Vector Machine (RVM) inference technique for efficient map query at any resolution.
Our semi-dense approach is also inspired by dense 3D semantic mapping methods [6,7,32,33] in both indoor and outdoor scenes.The major contributions from these work involve the 2D-3D transfer and the map regularization.Especially, Hermans et al. [7] propose an efficient 3D CRF to regularize 3D semantic mapping consistently considering influence between neighbors of 3D points (voxels).In this work, we adopt a similar strategy to improve the performance of the 3D semantic reconstruction in the road scenes.The key concepts are • a 3D semantic mapping system based on the monocular vision, • integration of monocular SLAM and scene parsing into 3D semantic representation, • exploiting the correlation between semantic information and geometrical information to enforce spatial consistency, • active sequence downsampling and sparse semantic segmentation so that to achieve a real-time performance and reduce the storage.

Preprints
Following the comparison in [27], we list the characteristics of our approach and some relative work in TABLE 1.
Table 1.Comparison with some related work: M = monocular camera, S/D = stereo/depth camera, L = Lidar, O = outdoor, I = incremental, SDT = sparse data structures, RT = real time

Notation
The target is to estimate the 3D semantic map M comprising of a pose-graph of keyframes with semantic map taken from a monocular camera.Let I i : Ω → R 3 symbolize an H × W RGB image at the frame indexed by i. Keyframes are extracted from image sequence in light of camera's pose T j i at the i frame with respect to previous keyframe j.We define the ith keyframe to be a tuple the full-resolution inverse depth map associated with image I i , and The keyframes are consecutively stacked in a pose-graph G = (V, E ), where V = {K 0 , . . ., K n } is the set of keyframes and E = {S ) from keyframe i to keyframe j, and scale factor s j i > 0. In reference to world frame W, normally regarded as the first keyframe K 0 , the pose of the keyframe indexed by i is denoted as For a sequence of keyframes (n keyframes), we get the nth keyframe's pose The 3D map M is reconstructed by the projection of the inverse depth map of all keyframes, where each 3D point P can be labeled as one of the solid semantic objects in the label space L = {l 1 , l 2 , . . ., l k } like Road, Building, Tree, etc.We use X = {X 1 , X 2 , . . ., X M } to denote the set of random variables corresponding to the 3D points P i : i ∈ {1, . . ., M}, where each variable X i ∈ X take a value l i from the predefined label space L.

3D semantic mapping
Our target is to build a 3D semantic map with semi-dense and consistent label information online while the image sequences are captured by a moving monocular forward camera.Given an image sequence, the inference of the 3D semantic map is regarded as: which can be estimated by the maximum a-posterior (MAP).Compared to the model used in [25], our observation is continuously updating, not all existing measurements.Thus, we adopt an incremental fusion strategy to estimate the 3D semantic map by incorporating new arriving keyframes.Correspondingly, the approach is decoupled into three separately running processes as shown in Figure 2.
Framework of our method: The input is the sequence of the RGB frames, denoted as I.There are three separate processes, a keyframe selection process, a 2D semantic segmentation process , and a 3D reconstruction with semantic optimization process.Keyframes K are conditionally extracted from the sequence based on the distance between the poses.The following frames refine the depth map and the variance map of each keyframe until new keyframe is extracted.The 2D semantic segmentation module predicts the pixel-level class of the new-arriving keyframe.Finally, the keyframes are incrementally explored to reconstruct the 3D map with semantic labeling and then it is regularized by a dense CRF.
In the system, the monocular SLAM process maintains and tracks on a global map of the environment, which contains a number of keyframes connected by pose-pose constraints with associated probabilistic semi-dense depth maps.It runs in real-time on a CPU.Represented as point clouds, the map gives a semi-dense and highly accurate 3D reconstruction of the environment.Meanwhile, the second process of the 2D semantic segmentation generates the pixel-level classification on the extracted keyframes.A fast deep CNN model is explored to predict the semantic information on a GPU.In addition, an incremental fusion process for the semantic label optimization is operated in a parallel way.It builds a local optimal correspondence between semantic labeling and voxels in the 3D point cloud.To obtain a globally optimal 3D semantic segmentation, we attempt to make use of information of neighboring 3D points, involving the distance, color similarity and semantic label.It updates voxel's position and corresponding semantic label, which gives a globally consistently 3D semantic map.

2D Scene Parsing
We explore the DeepLab-v3+ deep neural network proposed by Chen et al. [5].Two important components in the DeepLab series are the atrous convolution and atrous spatial pyramid pooling (ASPP), which enlarge the field of view of filters and explicitly combine the feature maps at multiple scales.The improvement in the DeepLab-v3+ involves the encoder-decoder structure and the augmentation of ASPP module with image-level feature.The former is able to capture sharper object boundaries by regaining the spatial information, while the latter encodes multi-scale contextual information to capture long range information.These contributions make DeepLab successfully handle both large and small objects and achieve a better trade-off between precision and run-time.
For the semantic segmentation of road scenes, we exploit the Cityscapes dataset and the KITTI dataset and adopt the predefined 19-class label space L = {l 1 , l 2 , . . ., l 19 }, which contains Road, Sidewalk, Building, Wall, and so on.We use all semantic annotated images in the Cityscapes dataset for training and fine-tune the model with the KITTI dataset.
Note that there is not any depth information involved in the training process.In the inference, we keep the original resolution of input image according to different datasets.

Semi-Dense SLAM
We explore LSD-SLAM to track camera's trajectory and build consistent, large-scale maps of the environment.LSD-SLAM is a real-time, semi-dense 3D mapping method.It has several advantages: firstly, it is a scale-aware image alignment algorithm to directly estimate the similarity transform between two keyframes against different scale environments, such as office rooms (indoor) and urban roads (outdoor).The second one is that it is a probabilistic approach to incorporate noise on the estimated depth maps into the tracking based on the propagation of uncertainty.Moreover, it could integrate easily with different kinds of sensors like monocular, stereo and panoramic cameras for various applications.These features are of benefit to a reliable tacks and maps even in challenging surroundings.
LSD-SLAM has three major components: tracking, depth map estimation and map optimization.Spatial regularization and outlier removal are incorporated in the estimation of depth map with small-baseline stereo comparisons.In addition, a direct, scale-drift aware image alignment is carried on these existing keyframes to detect scale-drift and loop closures.Due to the inherent correlation between the depth map and the tracking accuracy, depth residual is used to estimate the similarity transform sim(3) constraints between keyframes.
Consequently, a 3D point cloud map is built based on a set of keyframes with the estimated depth maps via minimizing the error of image alignment.The map is continuously optimized in the background using a g2o pose-graph optimization.The approach runs in 25Hz on an Intel i7 CPU.More details like keyframe selection and depth estimation should be referred to the work [18].

Incremental Fusion
There might be a large amount of inconsistent 2D semantic labels between consecutive frames, due to the noise of sensors, the complexity of environments in the real world and the failure of scene parsing model.

Incremental fusion of semantic label from the stacked keyframes allows associating probabilistic label in a
Bayesian way, when combining with the depth map propagation between keyframes in the LSD-SLAM.We will give the details about the incremental semantic fusion with the depth estimation as follows.
The camera projection transformation function π(•) : R 3 → R 2 is defined as which maps a point P = [x, y, z] T in 3D space into a 2D point p = [x , y ] T on the digital image plane I i in the camera coordinate system.Since this projection function is nonlinear, for the computation efficiency, the transformation should be augmented into the homogeneous coordinate system, which is defined as where the matrix K is referred to as the camera matrix.Given a 3D point P W in the world reference system, the mapping to image plane I i in the homogeneous reference system is calculated as where T i W the pose of the camera in the world reference system.Then, we get Euclidean coordinates p = [x h /z h , y h /z h ] T from the homogeneous coordinates.From this point on, any point p and P is assumed to be in homogeneous coordinates and thus we drop the h index, unless stated otherwise.
Correspondingly, given the inverse depth estimation d for a pixel p = [x , y ] T in I i of the keyframe K i , we also have an inverse projection function below: continuously refined using its following frames until new keyframe is defined.In reference to Equation 4 and 5, we can derive the 3D point in the world reference system as follows: where the homogeneous transformation matrix has the property: Once a new frame is chosen to become a keyframe K j , its depth map D j is initialized by projecting points from previous keyframe into it.The information of existing, close-by keyframes is propagated to new keyframe for its initialization and semantic probabilistic refinement.The point in the depth map of new keyframe is obtained by Here, we have a Gaussian distributed transformation between keyframes, regarded as p ∈ I i → P W → p ∈ I j .
The class label corresponding to a 3D point P is denoted as X : P → l ∈ L. Note that the label Sky is removed from L for the 3D semantic mapping.Our target is to obtain the independent probability distribution of each 3D point over the class labels P(X|K i 0 ) given a sequence of existing keyframes K i 0 = {K 0 , K 1 , . . ., K i } in the pose-graph G.
We explore a recursive Bayesian fusion to refine the corresponding probability distribution of 3D points with new keyframe's update: with Applying the first-order Markov assumption to p(K i |K i−1 0 , X), then we have: We assume that P(X) does not change over time and there is no need to calculate the normalization factor According to the formulations above, the semantic probability distribution of all given keyframes can be recursively updated as follows: The incremental fusion can refine the semantic label of the points in the 3D space based on the pose-graph of keyframes.It could handle the inconsistent 2D semantic labels, even though its performance relies on the depth estimation.In addition, map geometry is another useful feature which could improve the performance of the 3D semantic mapping further.The following section describes how we use the dense CRF to regularize the 3D semantic map by exploring the map geometry, which could propagate semantic information between spatial neighbors.

Map Regularization
The dense CRF is widely used in the 2D semantic segmentation to enhance the performance of semantic segmentation.Some previous works [6,7,32] seek its application on the 3D map to model contextual relations between various class labels in a fully connected graph.It is a heuristic approach that assume the influence between neighbors should be proportional to their distance, visual and geometrical similarity [7].
The CRF model is defined as a graph composed of unary potentials as nodes and pairwise potentials as edges, but the size of the model makes traditional inference algorithms impractical.Thanks to Krahenbuhl and Koltun's work [35], a highly efficient approximate inference algorithm is proposed to handle this issue by defining the pairwise edge potentials as a linear combination of Gaussian kernels.We apply the efficient inference of the dense CRF to maximize label agreement between similar 3D points as follows.Assume the 3D semantic map M containing M 3D points is defined as a random field.A CRF (M, X) is characterized by a Gibbs distribution as follows: where E(X|M) is the Gibbs energy and Z(M) is the partition function.The maximum a posteriori (MAP) labeling of the random field is which is converted into minimizing the Gibbs energy by the mean-field approximation and message passing scheme.
We employ the associative hierarchical CRF [32,36] which integrates the unary potential ψ i , the pairwise potential ψ i,j and the higher order potential ψ c into the Gibbs energy at different levels of the hierarchy (voxels and supervoxels) given by: by the indexes i, j ∈ {1, . . ., M} correspond to different 3D points P i , P j in the 3D map M.
Unary Potential: The unary potential ψ i (•) is defined as the negative logarithm of the probabilistic label for a given 3D point: This term means the cost of 3D point P i taking an object label l ∈ L based on the incremental semantic probabilistic fusion above.The output of the unary potential for each point is produced independently, and thus, the MAP labeling produced by the unary potential alone is generally inconsistent.
Pairwise Potentials: The pairwise potential ψ i,j (•) is modeled to be a log-linear combination of m Gaussian edge potential kernels: where µ(•) is a label compatibility function corresponding to the Gaussian kernel functions k (m) (f i , f j ).f denotes the feature vector for the 3D point P including the position, the RGB appearance and the surface normal vector of the reconstructed surface.And µ(•) is a Potts model given by: This term is defined to encourage the consistency over pairs of neighboring points for the local smoothness of the 3D semantic map.We employ two Gaussian kernels for the pairwise potentials following the previous work [7].The first one is an appearance kernel as follows: where c is the RGB color vector of the corresponding 3D points.This kernel is used to build long range connections between 3D points with a similar appearance.
The second one, a spatial smoothness kernel, is defined to enforce a local, appearance-agnositc smoothness among 3D points with similar normal vectors.where n are the respective surface normals.The surface normal are computed using the Triangulated Meshing using Marching Tetrahedra (TMMT) proposed in [32].Note that the original method is towards producing a dense labeling with the stereo vision.Since the LSD-SLAM only generates semi-dense 3D point clouds, we modify the TMMT to extract a triangulated mesh within limited ranges of short distance between 3D points.
High Order Potential: The higher order term ψ c (X c |c) encourages the 3D points (voxels) in the given segment to take the same label and penalizes partial inconsistency of supervoxels as described in [36].It is defined as where γ l c represents the cost if all voxels in the segment take the label l.N l c = ∑ i∈c δ is the number of inconsistent 3D points with the label l which is penalized with a factor k c , regarded as the inconsistency cost.
All parameters θ P,c , θ c , θ P,n , θ n , θ P,s , θ s specify the range in which points with similar features affect each other, respectively.They can be obtained using piece-wise learning.

Experiments and Results
We demonstrate the performance of our approach on the KITTI dataset [10], which contains a variety of urban scene sequences involving lots of moving objects in various lighting conditions.It consists of various datasets, such as the semantic dataset, the odometry dataset, and the detection dataset.Thus, it is very challenging to the datasets and models as shown in Table 2.We benchmark the performance of our semantic mapping system on the KITTI odometry dataset some turns as shown in Figure 3.These road-scene frames involves two resolutions 1242 × 375 and 1226 × 370.
Our system runs on an Intel Core i7-5960K CPU and a NVIDIA Titan X GPU for online process.
Since the KITTI sequences are mostly captured in 10 Hz, it is highly below the normal speed requirements of LSD-SLAM about 60 Hz.In addition, the LSD-SLAM is hard to handle severe turning when the platform moves.Due to the limit of the monocular LSD-SLAM, we choose 6 sequences to evaluate.
In the following sections, we show some qualitative results for our approach in 5.1 and the quantitative results of our evaluation are presented in 5.2, in which we also make the runtime analysis on our semantic mapping approach.

Qualitative Results
First, we present some qualitative results of the KITTI semantic dataset in Figure 4.Then, we use the trained model to make prediction on the KITTI odometry dataset, and the results are exemplified as shown in Figure 5.
Take the sequence odometry_03 as an example of our semantic mapping approach.The sequence consists of 801 RGB frames on a urban road of about 560m and a camera calibration file.Figure 6 shows the semantic reconstruction with a close-up view including large-scale annotations such as road, building and even small-scale objects like traffic signs.Note we discard some keyframes at the beginning, due to random initialization of LSD-SLAM.Qualitative results of 3D semantic mapping from the sequence odometry_03.Our approach not only reconstructs and labels entire outdoor scenes that include roads, sidewalks and buildings, but also accurately recovers thin objects such as traffic signs and trees.The close-up views show the details of the map.

Quantitative Results
For the quantitative performance of our approach, we focus on the 2D semantic segmentation and the runtime of the entire system, since the 3D reconstruction mainly depends on the LSD-SLAM method.
Semantic Segmentation: Table 3 shows the quantitative results of 2D semantic segmentation based on different DeepLab-v3+ models on the KITTI datasets.We evaluate these models by the mean intersection/union (mIOU) score, the model size, and the computational runtime.The mIOU score is defined as in terms of the True/False Positives/Negatives for a given class i.We do not resize the image to evaluate the models here.Whereas, for the 3D semantic mapping process, we need to half resize the input images in order to make a trade-off between accuracy and computational speed.
During the training process, these models are initialized with the checkpoints pre-trained from various datasets including ImageNet [37] and MS-COCO [38].In the training step on the Cityscapes dataset, we directly use the ImageNet-pretrained checkpoints as the initialization.Note we employ the MobileNet_v2 based model which has been pre-trained on MS-COCO dataset, and the Xception_71 based model has been pre-trained on both ImageNet and MS-COCO datasets.These pre-trained models can be accessed from the github 3 .

Figure 1 .
Figure1.Overview of our system: From monocular image sequence, keyframes are selected to obtain its 2D semantic information, which then transfer to the 3D reconstruction to build the 3D semantic map.
(www.preprints.org)| NOT PEER-REVIEWED | Posted: 3 January 2019 Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 3 January 2019 doi:10.20944/preprints201901.0009.v1 associated inverse depth variance map.Depth map and variance are defined in the subset of pixels as Ω D i ⊂ Ω i , which means semi-dense, only available for certain image regions of large intensity gradient.The symbol S i : Ω S i → R represents the full-resolution semantic map with maximum probability of object class from the semantic segmentation process.

)
Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 3 January 2019 Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 3 January 2019 doi:10.20944/preprints201901.0009.v1 for the 3D reconstruction.The KITTI dataset contains a 2D semantic segmentation data of 200 labeled training images and 200 test images 1 .Its data format and metrics conform with the Cityscapes dataset [9].The Cityscapes dataset involves 19 classes within high quality pixel-level annotations of 5000 images with a resolution of 2048 × 1024, including 2975 training images, 500 validation images, and 1525 testing images.In our experiment, we train the model on the Cityscapes and then tune it on the KITTI taking the volume size of dataset into account.For the training of 2D semantic segmentation model, various encoder models in the DeepLab-v3+ are evaluate including ResNet, Xception, and MobileNet.And we find that the "poly" stochastic gradient descent is better than the "step" one on these datasets.The TensorFlow library is employed to do the training and inference on the workstation with 4 Nvidia Titan X GPU cards.The hyper-parameters used in training are set corresponding

Figure 5 .
Figure 5. Instances of 2D semantic segmentation in the KITTI odometry set

Table 2 .
Hyper-parameters used in the training step