Fast 3D reconstruction with semantic information in road scenes is of great requirements for autonomous navigation. It involves issues of geometry and appearance in the field of computer vision. In this work, we propose a fast 3D semantic mapping system based on the monocular vision by fusion of localization, mapping, and scene parsing. From visual sequences, it can estimate the camera pose, calculate the depth, predict the semantic segmentation, and finally realize the 3D semantic mapping. Our system consists of three modules: a parallel visual Simultaneous Localization And Mapping (SLAM) and semantic segmentation module, an incrementally semantic transfer from 2D image to 3D point cloud, and a global optimization based on Conditional Random Field (CRF). It is a heuristic approach that improves the accuracy of the 3D semantic labeling in light of the spatial consistency on each step of 3D reconstruction. In our framework, there is no need to make semantic inference on each frame of sequence, since the 3D point cloud data with semantic information is corresponding to sparse reference frames. It saves on the computational cost and allows our mapping system to perform online. We evaluate the system on road scenes, e.g., KITTI, and observe a significant speed-up in the inference stage by labeling on the 3D point cloud.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited