Dense RGB-D semantic mapping with Pixel-Voxel neural network

For intelligent robotics applications, extending 3D mapping to 3D semantic mapping enables robots to, not only localize themselves with respect to the scene's geometrical features but also simultaneously understand the higher level meaning of the scene contexts. Most previous methods focus on geometric 3D reconstruction and scene understanding independently notwithstanding the fact that joint estimation can boost the accuracy of the semantic mapping. In this paper, a dense RGB-D semantic mapping system with a Pixel-Voxel network is proposed, which can perform dense 3D mapping while simultaneously recognizing and semantically labelling each point in the 3D map. The proposed Pixel-Voxel network obtains global context information by using PixelNet to exploit the RGB image and meanwhile, preserves accurate local shape information by using VoxelNet to exploit the corresponding 3D point cloud. Unlike the existing architecture that fuses score maps from different models with equal weights, we proposed a Softmax weighted fusion stack that adaptively learns the varying contributions of PixelNet and VoxelNet, and fuses the score maps of the two models according to their respective confidence levels. The proposed Pixel-Voxel network achieves the state-of-the-art semantic segmentation performance on the SUN RGB-D benchmark dataset. The runtime of the proposed system can be boosted to 11-12Hz, enabling near to real-time performance using an i7 8-cores PC with Titan X GPU.


I. INTRODUCTION
A Real-time 3D semantic mapping is desired in a lot of robotics applications, such as autonomous navigation and robot arm manipulation.The inclusion of semantic information with a 3D dense map is much useful than geometric information alone in robot-human or robot-environment interaction.It enables robots to perform advantage tasks like "nuclear wastes classification and sorting" or "autonomous warehouse package delivery" more intelligently.
A variety of well-known methods such as RGB-D SLAM [1], Kinect Fusion [2] and ElasticFusion [3] can generate dense or semi-dense 3D map from RGB-D videos.But those 3D maps contain no semantic-level understanding of the observed scenes.Meanwhile, the semantic segmentation achieved a significant progress with advantage of convolution neural network.Thus far, FCN [4], SegNet [5] and Deeplab [6] are the most popular methods for RGB level semantic segmentation.FuseNet [7] and LSTM-CF [8] take advantage of both RGB and depth images to improve semantic segmentation.PointNet [9] is the forerunner for 3D semantic segmentation that consumes an unordered point cloud.During RGB-D mapping, both RGB image with rich contextual information and point cloud with rich 3D geometric information can be obtained directly.To date, there are no existing methods that make use of both RGB and point cloud for the semantic segmentation and mapping.In this paper, we proposed a dense RGB-D semantic mapping system with a Pixel-Voxel neural network which can perform dense 3D mapping while simultaneously recognizing and semantically labelling each point in the 3D map.The main contributions of this paper can be summarized as follows: • A Pixel-Voxel network consuming RGB image and point cloud is proposed, which can obtain global context information through PixelNet and meanwhile, preserve accurate local shape information through VoxelNet.This mutual promotion model achieves the state-of-the-art semantic segmentation performance on SUN RGB-D1 dataset.• A Softmax weighted fusion stack is proposed to adaptively learn the varying contribution of different models.
It can fuse the score maps from different models accord-ing to their respective confidence levels.The number of input models for fusion can be arbitrary.This stack can be inserted to any kind of network to perform fusion style end-to-end learning.• A dense 3D semantic mapping system integrating Pixel-Voxel network with RGB-D SLAM is developed.Its runtime can be boosted to 11 − 12Hz using an i7 8cores PC with Titan X GPU, which can nearly satisfy the requirement of real-time applications.The rest of this paper is organized as follows.The related work is reviewed in Section II firstly.Then the details of the proposed methods are introduced in Section III.The experimental results and analyses are given in Section IV.Finally, we conclude the paper in Section V.

II. RELATED WORK
The existing works are categorized and described in the following two subsection, dense 3D semantic mapping in Section II-A and semantic segmentation in Section II-B, followed by a discussion in Section II-C
The first kind of methods such as SLAM++ [10] can only recognise the known 3D objects in a pre-defined database.It is limited to can only be used in the situations where many repeated and identical objects are present for semantic mapping.
For the second kind of methods, both [12] and [13] adopt human-design features with Random Decision Forests to perform per-pixel label predictions of the incoming RGB videos.Then all the semantically labelled images are associated together using a visual odometry to generate the semantic map.Because of the state-of-the-art performance provided by the CNN-based scene understanding, SemanticFusion [14] integrates deconvolution neural networks [20] with Elastic-Fusion [3] to a real-time capable (25Hz) semantic mapping system.All of those three methods require fully connected CRF [21] optimization as an offline post-processing, i.e., the best performance semantic mapping is not an online system.Zhao et al. [16].proposed the first system to perform simultaneous 3D mapping and pixel-wise material recognition.It integrates CRF-RNN [22] with RGB-D SLAM [1] and the post-processing optimization is not required.Keisuke et al. [15] proposed a real-time dense monocular CNN-SLAM, which can perform depth prediction and semantic segmentation simultaneously from a single image using a deep neural network.
All the above methods mainly focus on semantic segmentation using a single image and they only perform 3D label refinement through a recursive Bayesian update using a sequence of images.However, they do not take full advantage of the associated information provided by multiple viewpoints of a scene.Yu et al. [17] proposed a DA-RNN integrated with Kinect Fusion [2] for 3D semantic mapping.DA-RNN employs a recurrent neural network to tightly combine the information contained in multiple viewpoints of an RGB-D video stream to improve the semantic segmentation performance.Ma et al. [18] proposed a multiview consistency layer which can use multi-view context information for object-class segmentation from multiple RGB-D views.It utilizes the visual odometry trajectory from RGB-D SLAM [1] to wrap semantic segmentation between two viewpoints.In addition, Armin et al. [19] proposed a network architecture for spatially and temporally coherent semantic co-segmentation and mapping of complex dynamic scenes from multiple static or moving cameras.

B. Semantic segmentation
According to the type of input data, semantic segmentation can be further grouped into three main sub-categories: [23] and 3D [9][24] semantic segmentation.
FCN [4] is the first end-to-end fashion network instead of using hand-crafted features for semantic segmentation.It replaces the fully connected layers of the classification network with the convolution layers to output the coarse map and utilizes a skip architecture to refine it.DeconvNet [20] composing of deconvolution and unpooling layers, utilizes the fractionally strided convolutions to alleviate the limited resolution of labelling problem.SegNet [5] proposed an encoder-decoder architecture, which records the indices of max pooling for up-sampling.DeepLab [6] makes use of dilated convolutions [25] to increase the receptive field without down-sampling the feature map.CRF as RNN [22] reformulates the mean-field inference in dense CRF as an RNN architecture that enables it to integrate with CNN as a fully end-to-end network.
FuseNet [7] can fuse RGB and depth image cues in a single encoder-decoder CNN architecture for RGB-D semantic segmentation.LSTM-CF [8] network fuses contextual information from multiple channels of RGB and depth image through stacking several convolution layers and a long shortterm memory layer.FuseNet normalises the depth value into the interval of [0, 255] to have the same spatial range as color images, while LSTM-CF network transforms depth image to HHA image to have 3 channels as the color image.The HHA representation can improve the depth semantic segmentation, however, HHA representation requires high computational cost and hence cannot be performed in realtime.In addition, STD2P [23] proposes a novel superpixelbased multi-view convolutional neural network for RGB-D semantic segmentation, which uses Spatio-temporal pooling layer to aggregate information over space and time.
The forerunner work PointNet [9] provides a unified architecture for both classification and segmentation which consumes the raw unordered point clouds as input.PointNet only employs a single max-pooling to generate the global feature which describes the original input clouds, thus it does not capture the local structures induced by the 3D metric space points live in.In the improved version Point-Net++ [24], it proposed a hierarchical neural network.It applies PointNet recursively on a nested partitioning of the input point set, which enables it to learn local features with increasing contextual scales.

C. Discussion
For the RGB semantic segmentation, CNN-based methods always struggle with the balance between global and local information.The global context information can alleviate the local ambiguities to improve the recognition performance, while local information is crucial to obtain accurate perpixel accuracy, i.e., shape information.But after several of pooling layers, the resolution of the feature map decreases significantly.It means a lot of shape information is lost.How to increase the receptive field to get more global context information and meanwhile, preserve a high resolution of feature map is still an open problem.3D geometric data such as point cloud which has additional dimension can provide very useful spatial information.But because of the unordered property of point cloud, the conventional pooling layer cannot be used.It is difficult to obtain the context information in different scales for the point cloud.On the other hand, the resolution of point cloud would not decrease because of the absence of conventional pool layers, i.e., it can keep the original spatial information of the data.
Intuitively, combining RGB-based network and point cloud-based network together can alleviate each other's drawbacks and take advantage of each other's advantages.During RGB-D mapping, both RGB image and point cloud can be obtained directly from the RGB-D camera, which is easily available and enables a potential combination of the context information from RGB image and 3D shape information from the point cloud for semantic mapping.That is the main reason why a dense RGB-D semantic mapping with a Pixel-Voxel neural network is proposed in this paper.
In addition, the network in [4][7] [8] all simply fuse the score maps from different models using equal weights.Each model should have the different contributions in different situations for different categories.So in this paper, a Softmax weighted fusion stack is proposed for adoptively learning the varying contributions of each model.

A. Overview
The pipeline of dense RGB-D semantic mapping with a Pixel-Voxel neural network is illustrated in Figure .1.The RGB image and point cloud are obtained directly from an RGB-D camera Kinect V2.The RGB and point cloud datapair of each key-frame is fed into the Pixel-Voxel network, as shown in Figure .2,for semantic segmentation.Then the semantically labelled point clouds are combined incrementally through the visual odometry of RGB-D SLAM.Meanwhile, label probability of each voxel is refined by a recursive Bayesian update.Finally, the dense 3D semantic map is generated.

B. Pixel neural network
The PixelNet is comprised of three units: truncated CNN, context stack and skip architecture.The input of PixelNet is an RGB image.For the truncated CNN, the VGG-16 or ResNet (truncated after pool5), pre-trained on ImageNet can be employed as the baseline.After truncated CNN, the resolution of feature map decreases 32 times comparing with the input image, i.e., it drops significant shape information.
Inspired by [26], the context stack is on the top of pretrained truncated CNN, which is composed of chained 6 layers of 5×5×512 convolution stack (Conv+BN +ReLU ).For the VGG-16 network, the receptive field after pool5 and f c6 layers are respectively 212 × 212 and 404 × 404, which is not sufficiently large enough to cover the 512 × 512 image that we used.The receptive field of the context stack can be described as below: Here RF 0 and S 0 are the receptive field and stride product before the first context stack.RF j , S j and k j are the receptive field, stride and kernel size of the context stack j. n = 6 is the number of context stacks.The context stack can expand the receptive field progressively to cover all the elements in the current feature map(the whole original image).In addition, the score maps of all the context stacks are fused together to aggregate multi-scale context information.The spatial dimensionality of the feature maps in context stack is unchanged as before.
The skip architecture consists of 3 skip stacks(Conv + BN + ReLU + Conv(score)) following pool2, pool3 and pool4 separately.In order to prevent the network training divergence, the smaller learning rate is usually adopted for the skip architecture training as mentioned in [4].But the skip architecture in PixelNet can be trained using a bigger learning rate because batch normalization stabilizes the backpropagated error signals.The skip architecture retains the low-level feature of the RGB image.

C. Voxel neural network
The input of VoxelNet is unordered point cloud which is represented as a set of 3D points {p i |i = 1, 2, ...n} stored in a n × 6 long vector.n is the number of points.p i is a feature vector containing 6 dimension information: position information x, y, z in the world coordinate and color information R, G, B.
Here f mlp is the multi-layer perception network, i.e., Conv + BN + ReLU .k is the number of multi-layer perception network before max pooling.Its kernel size is 1 × 1 and each point shares the same convolution weights.Inspired by PointNet [9], we also use max pooling operation M as the invariant function.Its kernel size is n × 1.This [ Then the new per point features are extracted though multi-layer perception network using the global and local combined point features.m is the last multi-layer perception network.The reshape operation R transforms the shape of score map from n × 1 to h × w through back-projection according to the x, y, z values and camera intrinsic parameters, so that it can be fused with the score map of PixelNet.
The spatial dimensionality is unchanged as the input data in VoxelNet, so it can preserve all the original shape information.

D. Softmax weighed fusion
Unlike simply fusing score maps from different models using equal weights, a Softmax weighted fusion stack is designed to learn the varying contribution of each model in different situations for different categories.
To be precise, define the score maps F 1 , F 2 ...F n ∈ R c×h×w are generated from n different models.c equals the number of categories and h × w is the shape of score map.
f conv is the convolution operation and W conv ∈ R n•c×n•c×1×1 is the weights of convolution operation.F f usion ∈ R n•c×h×w is the fusion score map.The convolution operation can learn the correlations of the multiple score maps from n different models.
Softmax operation normalizes the channel values of are the corresponding weights of score maps, which denotes how confidently each model can be relied on.
F sum ∈ R c×h×w is the weighted fusion score map. is the element-wise multiplication operation and 1 ∈ R h×w .This Softmax weighted fusion stack can fuse the score maps of arbitrary number models, and it also can be inserted to any kind of network to be trained end-to-end.As shown in Figure .2, it fuses 3 score maps from PixelNet and VoxelNet together according to their respective confidence levels.

E. Class-weighted loss function
Imbalanced class distribution is quite common in most datasets.So focusing more on the rare classes to boost their recognition accuracy can improve the average recognition performance significantly.But the overall recognition performance will decrease lightly.We adopt the class-weighted negative log-likelihood as the loss function: Where L is the likelihood function, S is the training data, F i is the final score map and y i refers to the training label. 1 yi=j is a function that returns 1 if y i = j, otherwise 0. p j is the occurrence frequency of class j and 2 log 10 (δ/pj ) is the weight of class j. δ is the threshold of frequency criteria for the rare class.is the integer ceiling operation.In this way, the rare classes can be assigned a higher weight growing exponentially.The δ is set to 2.5% following the 85%-15% rule in [27], i.e., the frequency sum of all the rare classes is 15%.[1] is employed for dense 3D mapping.Its visual odometry can provide the transformation information between two adjacent semantically labelled point clouds.It is used for generating a global semantic map and enabling incremental semantic label fusion.

RGB-D SLAM
RGB-D SLAM is a graph-based SLAM system which consists of a front-end and a back-end units.The former unit processes the RGB-D data to calculate geometric relationships between key-frames through visual features based on RANSAC.The later unit registers pairs of image frames to construct a pose graph.Subsequently, G2O 2 is used for graph optimization to obtain a maximum likelihood solution for the camera trajectory.Finally, the point clouds are combined incrementally to generate a dense global 3D map.

G. 3D label refinement
After obtaining the semantically labelled point clouds from different viewpoints, label hypotheses are fused by a 2 http://www.openslam.org/g2o recursive Bayesian update to refine the 3D semantic map.Each voxel in the semantic point cloud stores both the label value and the corresponding discrete probability.The voxels from different viewpoints can be transformed to the same coordinate through the visual odometry of RGB-D SLAM.Then the voxel's label probability distribution can be updated by the means of a recursive Bayesian update as Equation 9.
) where l i is the label prediction, I k is the k th frame and Z is the normalizing constant.It is applied to all label probabilities of each voxel to generate a proper distribution.

IV. EXPERIMENTS
A large-scale indoor scene dataset, i.e., SUN RGB-D dataset, is adopted for the Pixel-Voxel network evaluation.It contains 5285 synchronized RGB-D image pairs for training/validation and 5050 synchronized RGB-D image pairs for testing.The RGB-D image pairs with different resolutions are captured by 4 different RGB-D sensors: Kinect V1, Kinect V2, Xtion and RealSense.The SUN RGB-D scene understanding challenge is to segment 37 indoor scene classes such as the table, chair, sofa, window, door and etc.The pixel-wise annotation is available and it has extremely unbalanced class instances.As mentioned in Section III-E, the rareness frequency threshold is set to 2.5% in the classweighted loss function following the 85%-15% rule.

A. Data augmentation and preprocessing
For the PixelNet training, all the RGB images are resized to the same resolution 512 × 512 through a bilateral filter.We randomly flip the RGB image horizontally and scale the RGB image slightly to augment the RGB training data.
For the VoxelNet training, there is still no large-scale ready-made 3D point cloud dataset available.We generated the point cloud using the RGB-D image pairs and the corresponding camera intrinsic parameters.Similar as mentioned in [7]

B. Network training
The whole training process can be divided into 3 stages: PixelNet training, VoxelNet training and Pixel-Voxel network training.All the networks are trained with SGD with momentum.The batch size is set to 10, the momentum is fixed to 0.9 and the weight decay is fixed to 0.0005.The new parameters are randomly initialized using Gaussian distribution with variance 10 −2 .
In the PixelNet training stage, the step learning policy is adopted.The learning rate is initialized to 10 −3 and decreases 10 times after 15 epochs (25 epochs in total).The learning rate of newly-initialized parameters is set to 10 times higher than that of pre-trained parameters.
In the VoxelNet training stage, the polynomial learning policy is adopted.The learning rate is initialized to 10 −3 , the power is set to 0.9 and the max iteration is set to 50000.
In the Pixel-Voxel network training stage, we load the pre-trained PixelNet and VoxelNet models, then finetune the whole network on the synchronized RGB and point cloud data.Because there are three Softmax weighed fusion stacks in the network, 3 times fine-tuning are required.The same learning policy as VoxelNet training is adopted.The learning rate of newly-initialized parameters in each Softmax weighted fusion stack is set to 10 times higher than that of fine-tuning parameters.

C. Overall performance
Following [4], three standard performance metrics for semantic segmentation: pixel accuracy, mean accuracy, mean IoU are used for the Pixel-Voxel network evaluation.The three metrics are defined as below: where n cl is the number of classes, n ij is the number of pixels of class i classified as class j, and t i = j n ij is the total number of pixels belong to class i.
The qualitative results of Pixel-Voxel network on the SUN RGB-D dataset are shown in Fig. 3.Because of preserving 3D shape information through VoxelNet, it is can be seen that the results have accurate boundary shape such as the shape of the bed, close-stool and especially the legs of furniture.
The comparison of overall performance and class-wise accuracy on the SUN RGB-D dataset are shown in Table I and Table II.The class-wise IoU of Pixel-Voxel network is also provided.We achieved 79.04% overall pixel accuracy with 0.64% improvement, 57.65% mean accuracy with 4.25% improvement and 44.24% mean IoU with 1.94% improvement over the state-of-the-art method [28].The improvements of class-wise accuracy are achieved on 30 classes.In addition, the method [28] is painfully slowly because of the usage of high computational CRF optimization in different scales.
The system with a pre-trained network is tested in the real-world environment, i.e., a living room and bedroom containing the curtain, bed and etc., as shown in Figure 4.It can be seen that most of the voxels are correctly segmented and the results have accurate boundary shapes.But there are still some voxels in the boundary to be assigned wrong predictions.Some error predictions are caused by upsampling the data through a bilateral filter to the same size as Kinect V2 data.Another reason is that this network is trained using the public SUN RGB-D dataset but it is tested using the real-world data.So some errors result from illumination variances, categories variances and etc.In addition, the noise of Kinect V2 also causes some error predictions.
The runtime performance of our system is 5 − 6Hz using the QHD data from Kinect2.During real-time RGB-D mapping, only a few key-frames are used for mapping.Most of the frames are abandoned because of the small variance between two consecutive frames.It is not necessary to segment all the frames in the sequence but only the keyframes.As mentioned in [12], 5Hz runtime performance can nearly satisfy the real-time dense 3D semantic mapping.The runtime performance can be boosted to 11 − 12Hz using the half scale data.It is a trade-off between runtime and accuracy.
All the source code will be published upon acceptance of this paper.A real-time demo can be found in this link https://youtu.be/UbmfGsAHszc.

V. CONCLUSION
In this paper, a dense RGB-D semantic mapping system is developed for the real-time applications.The runtime of the system can be boosted to 11 − 12Hz using an i7 8-cores PC with Titan X GPU.A Pixel-Voxel network is proposed that achieves the state-of-the-art semantic segmentation performance on SUN RGB-D benchmark dataset.The proposed Pixel-Voxel Network integrates: 1) PixelNet that aggregates

Fig. 1 :
Fig. 1: The pipeline of dense RGB-D semantic mapping with Pixel-Voxel neural network.The RGB image and point cloud are obtained directly from an RGB-D camera Kinect V2.The RGB and point cloud data-pair of each key-frame is fed into the Pixel-Voxel network for semantic segmentation.Then the semantically labelled point clouds are combined incrementally through the visual odometry of RGB-D SLAM.Meanwhile, the label probability of each voxel is refined by a recursive Bayesian update.Finally, the dense 3D semantic map is generated.

Fig. 2 :
Fig. 2: The architecture of the Pixel-Voxel Network.The PixelNet comprises three units: truncated CNN, context stack and skip architecture.The VoxelNet is composed of the convolution stacks, local and global information combination stack and reshape layer.It obtains global context information through PixelNet and meanwhile, preserves accurate local shape information through VoxelNet.The Softmax weighted fusion stack can fuse 3 score maps from PixelNet and VoxelNet together according to their respective confidence levels in different situations.
, there are 514 training and 558 testing RGB-D image pairs to be excluded.Because those raw depth images contain a lot of invalid values, which gives a strong wrong supervision during training.We also randomly flip the 3D point cloud horizontally to augment the point cloud training data.It is a huge computation complexity if the original point clouds are used for VoxelNet training.So we uniformly down-sample the original point cloud to sparse point cloud in 3 different scales.The number of these sparse point clouds are 16384, 4096 and 1024.Please note that the input data of VoxelNet is unordered point cloud stored in a long vector.

Fig. 3 :
Fig. 3: The qualitative results (best viewed in colour) of Pixel-Voxel network on the SUN RGB-D dataset.For different scenes in each row, the following images are displayed: RGB image(row 1), 3D point cloud(row 2), ground truth image(row 3), 2D semantic image(row 4) and 3D semantic point cloud(row 5).The Pixel-Voxel network produces the results with accurate boundary shape such as the shape of the bed, close-stool and especially the legs of furniture.

Fig. 4 :
Fig. 4: The dense 3D map and dense 3D semantic map (best viewed in colour) of a living room and bedroom.

TABLE I :
[5] comparison of overall performance on the SUN RGB-D dataset.Some reported results are copied from[5].

TABLE II :
The comparison of class-wise accuracy on the SUN RGB-D dataset.Not all the methods in TableIprovide the class-wise accuracy in their papers.The class-wise IoU of Pixel-Voxel network (PVNet) is also provided.