The SUN RGB-D dataset contains 5285 synchronized RGB-D image pairs for training/validation and 5050 synchronized RGB-D image pairs for testing. The RGB-D image pairs with different resolutions are captured by 4 different RGB-D sensors: Kinect V1, Kinect V2, Xtion and RealSense. The task is to segment 37 indoor scene classes such as table, chair, sofa, window, door, etc. Pixel-wise annotations are available in these datasets. However, the extremely unbalanced distribution of class instances makes the task very challenging. The rareness frequency threshold is set to in the class-weighted loss function following the 85–15% rule.
The NYU V2 dataset provides synchronized 1449 pixel-wise annotated RGB-D image pairs captured by Kinect V1, which includes 795 frames for training/validation and 654 frames for testing. The task is to segment 13 classes similar to the SUN RGB-D dataset in an indoor scene. Comparing with the other larger RGB-D datasets, the NYU V2 dataset provides raw RGB-D videos rather than discrete single frames. Therefore, using the odometry of RGB-D SLAM, the semantic segmentation based on multiple frames can be evaluated for the dense semantic mapping.
4.1. Data Augmentation and Preprocessing
For the PixelNet training, all the RGB images are resized to the same resolution through bilateral filtering. We randomly flip the RGB image horizontally and rescale the image slightly to augment the RGB training data.
For the VoxelNet training, there is still no available large-scale ready-made 3D point cloud dataset. We generated the point cloud using the RGB-D image pairs and the corresponding intrinsic parameters of the camera through back-projection, e.g., Equation (
5) for the SUN RGB-D and NYU V2 datasets. Following [
14], 514 training and 558 testing RGB-D image pairs containing invalid values, which might lead to incorrect supervision during training, are excluded from the SUN RGB-D dataset. We also randomly flip the 3D point cloud horizontally to augment the training data. There is huge computational complexity if the original point clouds are used for VoxelNet training. Therefore, we uniformly down-sample the original point cloud to a sparse point cloud in 3 different scales. The numbers of points in these sparse point clouds are 16,384, 4096 and 1024, respectively.
4.2. Network Training
The whole training process can be divided into 3 stages: PixelNet training, VoxelNet training and Pixel-Voxel network training. Firstly, PixelNet and VoxelNet are each trained separately. Then, the pre-trained weights are inherited for the Pixel-Voxel network training.
All the networks are trained using stochastic gradient descent with momentum. The batch size is set to 10, the momentum fixed to and the weight decay fixed to . The new parameters are randomly initialized from a Gaussian distribution with variance . The step learning policy is adopted for PixelNet training, and the polynomial learning policy is adopted for PixelNet and Pixel-Voxel Network training. The learning rate is initialized to , and the learning rate of the newly-initialized parameters is set to 10-times higher than that of the pre-trained parameters. Because there are 3 softmax weighed fusion stacks, 3 rounds of fine-tuning are required during the Pixel-Voxel network training.
4.3. Overall Performance
Following [
11], three standard performance metrics for semantic segmentation are used for the evaluation: pixel accuracy, mean accuracy and mean intersection over union (IoU). The three metrics are defined as:
Pixel accuracy:
Mean accuracy:
Mean IoU:
where is the number of classes, is the number of pixels of class i classified as class j and is the total number of pixels belonging to class i.
In the experiment on the SUN RGB-D dataset, the performance of the Pixel-Voxel network and all the baselines are evaluated on a single frame. In the second experiment, the results are obtained by fusing multiple frames (provided by the raw data). To be more specific, visual odometry is employed to associate the pixels in consecutive frames, and then, a Bayesian-update-based 3D refinement is used to fuse all predictions. Similar strategies are used in the baseline methods, i.e., Hermans et al. [
21], SemanticFusion [
22] and Ma et al. [
28].
From
Figure 5 and
Figure 6, it is clear that after combining VoxelNet with PixelNet, the edge prediction can be improved significantly. Preserving 3D shape information through VoxelNet, the results have accurate boundaries, such as the shape of the bed, toilet and especially the legs of the furniture.
The comparison of overall performance on the SUN RGB-D and NYU V2 datasets are shown in
Table 1 and
Table 2. The class-wise accuracy on the SUN RGB-D and NYU V2 datasets are shown in
Table 3 and
Table 4. The class-wise IoU of the Pixel-Voxel network is also provided. For the SUN RGB-D dataset, we achieved
for overall pixel accuracy,
for mean accuracy and
for mean IoU. After combining VoxelNet edge refinement, the pixel accuracy increased slightly from
–
for VGG-16 and from
–
for ResNet101, while the mean accuracy shows a significant increase from
–
for VGG-16 and from
–
for ResNet101. For the NYU V2 dataset, we achieved an overall pixel accuracy of
, a mean accuracy of
and a mean IoU of
. After combining VoxelNet edge refinement, the overall accuracy increases slightly from
–
for VGG-16 and from
–
for ResNet101, while the mean accuracy shows a significant increase from
–
for VGG-16 and from
–
for ResNet101.
Modelling the global context information and simultaneously preserving the local shape information are the two key problems in CNN-based semantic segmentation. The main idea of Pixel-Voxel net is to leverage the advantages of two complementary modalities, to extract high-level context features from RGB and fuse them with low-level geometric features from the point cloud. The improvement can be attributed to three parts: the hierarchical convolutional stack in PixelNet, the boundary refinement by VoxelNet and the
softmax weighted fusion stack. First, the hierarchical convolutional stack can learn the high-level contextual information through an incrementally-enlarged receptive field. As shown in
Table 1 and
Table 2, the standalone PixelNet can achieve a very competitive performance. Second, the proposed VoxelNet can refine the 3D object boundaries through learning the low-level geometrical features from the point clouds. As shown in
Figure 5, the objects have finer boundaries after combining with VoxelNet. As shown in
Table 1 and
Table 2, the quantitative performance improves significantly through 3D-based shape refinement from VoxelNet. Third, the proposed softmax fusion layer can adaptively learn the confidence of each modality. As a result, the predictions from different modalities can be fused more effectively. As shown in
Table 1 and
Table 2, the quantitative results also increase slightly through the
fusion stack. Note that the overall accuracy cannot be improved significantly, as pixels/voxels on the object edge only occupy a very small percentage of the whole pixels/voxels. However, the mean accuracy experiences a substantial improvement due to the increased accuracy on rare classes, for which the edge pixels occupy a relatively large percentage of all pixels.
Most state-of-the-art methods employ multi-scale CRF or a 2D/3D graph to refine the object boundaries. Their main limitation is slowness because of the excessive usage of multi-resolution high computational CRF or graph optimization. Although their performance is slightly better than ours, these methods are unlikely to be applied to real-time robotics applications. Our method can preserve the fine boundary shape through learning the low-level features from 3D geometry data. There is no computational optimization in the Pixel-Voxel network, so it is faster than most state-of-the-art methods.
4.4. Dense RGB-D Semantic Mapping
The dense RGB-D semantic mapping system is implemented under the ROS (
http://www.ros.org/) framework and executed on a desktop with i7-6800k (
Hz) 8-core CPU and NVIDIA TITAN X GPU (12G). Kinect V2 is used to obtain the RGB images and point clouds. IAI Kinect2 package2 (
https://github.com/code-iai/iaikinect2/) is employed to interface with ROS and calibrate with the Kinect2 cameras. The Pixel-Voxel network is implemented using the Caffe (
http://caffe.berkeleyvision.org/) toolbox. The network is trained on a TITAN X GPU, accelerated by CUDA and CUDNN.
The system with a pre-trained network was also tested in a real-world environment, e.g., a living room and bedroom containing a curtain, bed, etc., as shown in
Figure 7. It can be seen that most of the point clouds are correctly segmented, and the results have accurate boundaries, but there are still some points on the boundary with wrongly-assigned labels. Some error predictions are caused by upsampling the data through a bilateral filter to the same size as the Kinect V2 data. Furthermore, this network was trained using the SUN RGB-D and NYU V2 datasets, but was tested using the real-world data. Therefore, some errors occur due to illumination variances, category variances, etc. In addition, the noise of the Kinect V2 also causes some errors in predictions.
Using the quad high definition (QHD) data from Kinect2, the runtime performances of our system are
Hz (VGG16) and
Hz (ResNet101) when the RGB is resized to
and the point cloud is down-sampled to three scales, 16,384 × 1, 4096 × 1 and 1024 × 1. During real-time RGB-D mapping, only a few key-frames are used for mapping. Most of the frames are abandoned because of the small variance between two consecutive frames. It is not necessary to segment all the frames in the sequence, but only the key-frames. As mentioned in [
21], the 5-Hz runtime performance is nearly sufficient for real-time dense 3D semantic mapping. It is worth noting that the running time can be boosted to
Hz (VGG16) and
Hz (ResNet101) using half-sized data with a corresponding decline in segmentation performance. Thus, there is a trade-off between performance requirement and time consumption. The inference running time of Pixel-Voxel Net using different sizes of data can be found in
Table 5, and the corresponding decline in performance can be found in
Table 6.