PLIN: A Network for Pseudo-LiDAR Point Cloud Interpolation

LiDAR sensors can provide dependable 3D spatial information at a low frequency (around 10 Hz) and have been widely applied in the field of autonomous driving and unmanned aerial vehicle (UAV). However, the camera with a higher frequency (around 20 Hz) has to be decreased so as to match with LiDAR in a multi-sensor system. In this paper, we propose a novel Pseudo-LiDAR interpolation network (PLIN) to increase the frequency of LiDAR sensor data. PLIN can generate temporally and spatially high-quality point cloud sequences to match the high frequency of cameras. To achieve this goal, we design a coarse interpolation stage guided by consecutive sparse depth maps and motion relationship. We also propose a refined interpolation stage guided by the realistic scene. Using this coarse-to-fine cascade structure, our method can progressively perceive multi-modal information and generate accurate intermediate point clouds. To the best of our knowledge, this is the first deep framework for Pseudo-LiDAR point cloud interpolation, which shows appealing applications in navigation systems equipped with LiDAR and cameras. Experimental results demonstrate that PLIN achieves promising performance on the KITTI dataset, significantly outperforming the traditional interpolation method and the state-of-the-art video interpolation technique.


I. INTRODUCTION
In many computer vision tasks, dense and precise depth information of an outdoor environment is extremely important for various applications of autonomous driving and robotics.Recently, in 3D object detection [15], [18], 3D semantic segmentation [23], [24], and depth completion tasks [12], [16], [21], point clouds obtained by LiDAR have gained more and more attention due to their accurate spatial information.However, LiDAR sensors suffer from a low frequency for an autonomous driving system 1 .Therefore, there is a mismatching time stamp between the LiDAR and other sensors such as cameras 2 .In order to match the scenes collected by these two sensors and achieve system synchronization, the frequency of the camera has to be decreased to that of the LiDAR sensor, which significantly wastes resources and results in inferior performance for high speed applications.Therefore, it is quite appealing for autonomous driving systems to increase the frequency of its LiDAR sensor, and further pursuit the high-quality synchronized perception of a multisensor system.Fig. 1.Overall pipeline of the proposed method.PLIN aims to address the mismatching problem of frequency between camera and LiDAR sensors, generating both temporally and spatially high-quality point cloud sequences.Our method takes three consecutive color images and two sparse depth maps as inputs, and interpolates an intermediate dense depth map, which is further transformed into a Pseudo-LiDAR point cloud using camera intrinsics.
To address the above problem, one possible solution is to interpolate an intermediate point cloud using two consecutive point clouds.However, directly working on the 3D space and generating a new point cloud is challenging.Therefore, previous researches [3], [9], [19] prefer to achieve different tasks on 2D depth maps or other projected views.Moreover, the target point cloud can be constructed in the form of Pseudo-LiDAR [22] using known camera intrinsics.For Pseudo-LiDAR interpolation, an intermediate depth map is first generated and then back-projected into the 3D space.This method is superior to direct point cloud interpolation methods in term of feasibility and efficiency.
Interpolation techniques have been widely used in lots of computer vision and robotics tasks, which can be classified into two categories, i.e., temporal interpolation [1], [8], [14] and spatial interpolation [10], [12], [26].In video processing, video interpolation aims to temporally generate an intermediate frame using two consecutive frames.This technique has attracted more attentions due to the increasing demand for high-quality slow-motion videos.For example, Peleg et al. [14] formulated the interpolated motion estimation problem as classification rather than regression.This method achieves real-time temporal interpolation for high resolution videos.Bao et al [1] perceived both the depth and flow information to address the strong occlusion problem during new frame synthesis.To interpolate more in-between frames, Jiang et al. [8] proposed a variable-length multi-frame interpolation method to generate a frame at any time step between two given frames.However, this temporal interpolation technique has not yet been investigated in the field of point cloud interpolation.
In contrast to video interpolation, depth completion spatially fills missing depth values in a sparse depth map to generate a dense depth map.This technique becomes an essential enhancement process for LiDARs as they usually only provide sparse measurements.For example, Ma et al. [12] fed the concatenation of a sparse depth map and a color image into an encoder-decoder network to produce a dense depth map using self-supervised learning.To obtain more accurate dense depth maps, Zhang et al. [26] employed a weight matrix to describe a surface normal and occlusion boundary.To perceive both surface normal and contextual information, Lee et al. [10] presented an end-to-end convolutional neural network for depth completion, which consists of a geometry network and a context network.
Motivated by aforementioned studies, we propose a Pseudo-LiDAR point cloud interpolation network (PLIN) to generate both temporally and spatially high-quality point cloud sequences.The overall pipeline of the proposed method is illustrated in Fig. 1.Specifically, PLIN consists of a motion guidance module, a scene guidance module, and a transformation module.In the motion guidance module, we obtain bidirectional optical flow maps from color images, then warp two sparse depth maps into an approximate intermediate depth map.The original and warped depth maps, as well as the estimated optical flow maps, are fed into a coarse interpolation network to generate a coarse intermediate depth map.Subsequently, to produce a more accurate and dense depth map, we design a refined interpolation network with the guidance of realistic scenes.This scene guidance module generates a refined depth map using the intermediate color image and estimated depth map.Finally, in the transformation module, the refined depth map is used to construct the target Pseudo-LiDAR point cloud using camera parameters.Compared with video interpolation and depth completion tasks, our method simultaneously performs interpolation in both spatial and temporal domains.To the best of our knowledge, this is the first deep framework for Pseudo-LiDAR point cloud interpolation.Experimental results demonstrate that PLIN achieves promising performance on the KITTI [4] dataset.
In summary, we conclude the following three contributions of the proposed method: • To mitigate the low frequency limitation of LiDAR sensors, we present a Pseudo-LiDAR interpolation network to generate both temporally and spatially high-quality point cloud sequences.• We use the bidirectional optical flow as an explicit motion guidance for interpolation.In addition, a warping layer is applied to improve the accuracy of depth prediction by approximating an intermediate frame.Finally, the in-between color image is leveraged to provide rich texture information of the realistic scene for more accurate and dense spatial reconstruction.• We evaluate the proposed model on the KITTI benchmark [4], which reasonably recovers the original intermediate 3D scene and outperforms other interpolation methods.

II. APPROACH
In this section, we describe the detailed architecture of the proposed Pseudo-LiDAR interpolation network (PLIN).Given three consecutive color images captured by a camera and two depth maps obtained by LiDAR, we first interpolate an intermediate 2D dense depth map and then backproject the depth map into Pseudo-LiDAR using prior camera intrinsics.In addition, we explore the guidance of motion and scene to generate a realistic dense depth map, and adopt a warping layer to improve the accuracy of spatial reconstruction.As illustrated in Fig. 2, PLIN consists of a motion guidance module, a scene guidance module, and a transformation module.As a benefit of this coarse-to-fine cascade structure, our method can progressively perceive the multi-modal information and generate a temporally and spatially high-quality point cloud sequence.Moreover, we introduce the whole training loss function of PLIN in this section.

A. Intermediate Depth Map Interpolation
In this part, we introduce the method for intermediate depth map synthesis.We first present a baseline network to generate an interpolation map using only two consecutive sparse depth maps.Then, to construct more reasonable slowmotion results, we use the motion information included in a bidirectional optical flow to guide the interpolation process.Moreover, a warping operation is applied to input depth maps to produce an intermediate coarse depth map, which contains the explicit motion relationship.Finally, we use the in-between color image to refine the coarse depth map with the guidance of the scene, resulting in a more accurate and dense intermediate depth map.
1) Baseline Network: As mentioned in Section I, due to challenges of 3D point clouds, previous works [3], [9], [19] perform different vision tasks on 2D depth maps or other projected views.Inspired by this principle, we first interpolate an intermediate depth map using two consecutive depth maps.
Given two sparse depth maps d t−1 and d t+1 , our goal is to synthesize a depth map d t for the intermediate frame.
A straightforward way is to train a baseline network (i.e., an encoder-decoder structure) to predict the depth value of each pixel in the intermediate frame.Specifically, the encoder consists of a set of convolutions to increase the number of channels and reduce the feature resolution.The decoder has a symmetric structure.Moreover, to compensate the interaction of different information, the network structure contains multiple skip connections to combine low-level and high-level features at the same spatial resolution.Before feeding into the encoder, consecutive sparse depth maps are processed using a convolutional layer with eight 3×3 kernels and these extracted features are concatenated as the input to the network.All convolutions are followed by a batch normalization and a ReLU layer in the baseline network, with the exception of the last convolution layer, where a linear activation function is used.In the encoder part, we use ResNet-34 [5] as our backbone.In the decoder part, five fractionally-strided convolutional layers are designed to increase the resolution of image.After these five convolutions layers, a multi-channel feature map is obtained.The feature map is then passed through a 1×1 convolution kernel to generate a single-channel depth map.Thus, the intermediate dense depth map derived from the baseline network can be expressed as: where the depth map d t−1 is the previous frame depth map, d t+1 is the latter frame depth map, and H b is an interpolation function learned by the baseline network.
The baseline network adopts a violent approach to learn the relationship between the intermediate depth map and adjacent depth maps without any other guidances.However, due to the sparsity of the data, the complex motion relationship among d t , d t−1 , and d t+1 is difficult to estimate using only depth maps, and thus the obtained result usually shows an inferior appearance with blur artifacts.The results generated by the baseline network are shown in Section III-B.For the interpolation task, neural networks should not only learn to generate the appearance of two input depth data distributions, but also accurately perceive the motion relationship among consecutive depth maps.In order to achieve more reasonable interpolation results, we introduce optical flow to guide the generation of dense depth maps.
2) Motion Guidance Module: To consider the motion relationship between consecutive sparse depth maps, we design a motion guidance module to exploit optical flow to explicitly guide the generation of dense depth maps.In the video interpolation task, the optical flow is often used as an important input component, because it represents the direction and level of motion.Inspired by video interpolation, we introduce the optical flow into our Pseudo-LiDAR interpolation problem.
Instead of directly investigating the optical flow on sparse depth maps, we learn the motion relationship on dense color images due to their abundant and precise contextual information.Recently, deep neural networks have shown excellent performance in optical flow estimation [6], [7], [25].Given two consecutive color images C t−1 and C t+1 , video interpolation [1], [8], [14] aims to generate the intermediate color image C t using a bidirectional optical flow: where H f is an optical flow estimation function learned by neural networks.Assuming that the motion of adjacent frames is smooth, the optical flow F t→t−1 for color images C t and C t−1 , and the optical flow F t→t+1 for color images C t and C t+1 can be calculated as follows. or Eq. 3 and Eq. 4 can be further combined into the following equation: Different from the aforementioned video interpolation task, we devote to the generation of an intermediate point cloud.Note that the intermediate color image C t is available in this problem because the frequency of the camera is higher than that of LiDAR.Therefore, we can easily get F t→t−1 and F t→t+1 using the optical flow estimation network: Considering that the LiteFlowNet [6] outperforms FlowNet2 [7] on the challenging Sintel final pass [2] and KITTI benchmarks [4], while being 30 times smaller in model size and 1.36 times faster in running speed, we exploit the LiteFlowNet to estimate the optical flow of the consecutive color images.The original sparse depth maps and the bidirectional optical flow are collectively fed into the motion guidance module.Thus, the intermediate depth map can be expressed by: where H m is a depth map interpolation function learned by the motion guidance module.
To make full use of the information provided by the optical flow, we leverage the bidirectional optical flow to directly produce an approximate intermediate depth map dt , in terms of the warping operation: where γ refers to the weighting factor of two input depth maps and warp is a backward warping function that can be implemented using bilinear interpolation [8], [11].This warping layer transfers the depth map of adjacent frames to the position of intermediate frame using the estimated optical flow.Instead of roughly feeding the optical flow into neural networks, we further utilize the explicit motion relationship to build an approximate intermediate depth map, contributing to a more accurate 3D reconstruction.
Therefore, the input of the motion guidance module includes the two consecutive sparse depth maps, the estimated bidirectional optical flow, and the warped intermediate depth maps.The final interpolated intermediate depth map d t can be formulated by: 3) Scene Guidance Module: In contrary to the sparse point cloud, color images have richer and denser texture information, which significantly boosts the complete scene understanding.In order to obtain more precise and dense interpolation results, we design a scene guidance module to refine the coarse depth map d c t derived by the motion guidance module.We first utilize two convolutional layers with the channels of 8 to extract features of the coarse depth map and color image, respectively.Subsequently, the convolved features are fused to form the input of the refined interpolation network.The refined interpolation network is a lightweight U-Net structure [17], the number of its layers is less than that of the coarse interpolation network.Specifically, the encoder contains five convolutional layers and the decoder contains four deconvolutional layers.The batch normalization and ReLU activation function are implemented to all convolutional and deconvolutional layers, expect for the last deconvolutional layer that uses the linear activation function.In addition, there are skip connections between feature maps with the same spatial resolution, to facilitate the complementation of local and global information.Thus, the intermediate dense depth map generated by the scene guidance module H s can be expressed as:

B. Transformation Module
Once intermediate depth map is generated, the point cloud can be constructed in the form of the Pseudo-LiDAR.According to the pinhole camera model principle, each spatial point (x, y, z) is corresponded to its pixel coordinates (u, v, d), where d refers to depth value.Through these camera parameters, the interpolated depth map d t can be converted into the coordinates of a point cloud.Here, we can derive the 3D position (x, y, z) of each pixel (u, v) in the camera coordinate system as where (c u , c v ) is the pixel position corresponding to the center of camera aperture, and f v and f u are the vertical and horizontal focal lengths, respectively.By converting all pixels in the depth map into 3D coordinates, we can get a set of points (x i , y i , z i ) n i=1 , where n is the number of points.The point cloud obtained from the intermediate depth map is named as Pseudo-LiDAR [22].

C. Loss Function
The whole loss function of PLIN is a linear combination of the coarse depth loss and the refined depth loss.The ground truth depth map of the intermediate frame can be used to supervise the network prediction.We adopt L2 Loss between the predicted dense depth map pred and the ground truth gt as follows Our final loss function can be expressed as follows: where d coarse refers to the intermediate depth map of the coarse interpolation network, d ref ined refers to the intermediate dense depth map of the refined interpolation network, w 1 and w 2 are weights to balance two different loss functions.In this work, w 1 and w 2 are empirically set to 0.1 and 1, respectively.
III. EXPERIMENTS In this section, we first describe the training dataset and strategy of the proposed PLIN network.We then perform several ablation experiments to verify the effectiveness of different modules in our network.To demonstrate its superiority, we also compare our method with a traditional method and an advanced video interpolation method.As illustrated in Fig. 3, the depth maps obtained by our method show clear boundaries in visual effects and display denser distributions than the ground truth dense depth maps. A. Dataset and Strategy 1) Dataset: The main application scenario of our model is on-board LiDARs for outdoor scenes.Our experiments were performed on the KITTI depth completion dataset [20] and the raw data dataset [4].The KITTI dataset provides depth information and color images.The dataset contains 85,898 training data, 6,852 validation data, and 1,000 test data.Considering that the training set contains some frame sequences with tiny motion, we select 40,000 scenes with relatively large motion to train our network.
2) Strategy: Since the upper part of the depth map of Li-DAR projection does not provide any depth information, our network takes images with 1216×256 by bottom-cropping on original images.In addition, we perform data augmentation operations such as random flipping and color adjustment on training data.The whole network was trained in an end-toend manner.We used the Adam optimizer with an initial learning rate of 10 −5 , and the learning rate was dropped by a factor of 0.1 after every five epochs.Our network was implemented in PyTorch [13] with a batch size of 1 and trained on a 1080Ti GPU for about 60 hours.

B. Ablation Study
To evaluate the effectiveness of each module, we perform ablation study on the proposed network.Firstly, we conduct three experiments on the coarse interpolation network.
• The baseline network only takes two consecutive sparse depth maps as the input (baseline).
• The forward and backward sparse depth maps and estimated optical flow maps are fed into the baseline network (baseline + flow).• The baseline network receives the forward and backward depth maps, the bidirectional optical flow, and the depth maps derived by the warping layer (baseline + warp flow).
Similarly, there are two experiments on the refined interpolation network as follows.
• The refined network takes the intermediate color image and two depth maps as its inputs (baseline + rgb).• The complete configurations including the coarse interpolation network with motion guidance using the warping operation, and the refined interpolation network with scene guidance (ours).For the evaluation of interpolated depth maps, we choose four metrics: the root mean square error (RMSE), mean absolute error (MAE), root mean square error inverse depth (iRMSE), and mean absolute error inverse depth (iMAE).Similar to [10], [12], [26], we primarily focus on RMSE, which is the leading metric on the depth completion benchmark.The results of ablation study are listed in Table I.The ablation study shows that the complete network (ours) achieves the best interpolation performance.For each module of PLIN, due to the provided optical flow between consecutive frames, the baseline network achieves more accurate results with the guidance of motion.Moreover, compared with the direct use of optical flow, the warping layer significantly improves the performance of interpolation, because of the more explicit intermediate representation.As a benefit of the rich texture information in color images, the baseline network with the guidance of scene outperforms the baseline network that only takes two consecutive depth maps as inputs.For the other minor evaluation metrics, the motion guidance module slightly increases their values, as the estimated optical flow obtained by LiteFlowNet [6] contains some noises.To intuitively compare these different performances, we visualize the interpolated results of a scene obtained by the above methods in Fig. 4, the complete network generates the most realistic details and distributions of the intermediate point cloud.

C. Comparison Results
Because PLIN is the first work for point cloud interpolation, we only compare our method with the traditional   interpolation method that averages the two consecutive depth maps and the state-of-the-art video interpolation network Super Slomo [8].Note that, we retrain the Super Slomo network using the KITTI depth completion dataset [20]. 1) Quantitative Comparison: Table II reports the quantitative evaluation results of different methods.The traditional method averages two consecutive depth maps to obtain an intermediate depth map.However, the pixel values are relatively sparse and there is no obvious correspondence, so that the traditional method is not suitable for the interpolation of point clouds.Moreover, the video interpolation network [8] cannot handle the point cloud interpolation problem due to the challenging motion perception on the sparse depth map.Compared with these two methods, the proposed PLIN network is specially designed for the point cloud interpolation task and jointly guided by the explicit motion and realistic scenes, achieving the best performance.
2) Visual Comparison: For visual comparison, we show three interpolated results achieved by different methods.As illustrated in Fig. 5, suffering from the plain average interpolation, the traditional method generates fake objects which do not exist in the original scene.Super Slomo [8], however, produces disordered point clouds due to its the insufficient learning capability on motion and scenes.In contrast, our model produces more sharp outlines and boundaries for small objects such as cars and people.In addition, the whole distribution of Pseudo-LiDAR is more similar to that of the ground truth point cloud.
IV. CONCLUSIONS In this paper, we have proposed a network to generate both temporally and spatially high-quality point cloud sequences.In order to gradually perceive different modal conditions, we adopted a coarse-to-fine cascade structure.Specifically, the bidirectional optical flow explicitly guides consecutive sparse depth maps to generate an intermediate depth map, which is further improved by the warping layer.To obtain more accurate and dense depth information, the scene guidance module exploits the intermediate color image to refine the coarse depth map.To the best of our knowledge, this is the first deep framework for Pseudo-LiDAR interpolation, which increases the frequency of LiADR sensor and shows appealing applications for more efficient multi-sensor systems.

Fig. 2 .
Fig. 2. Overview of the proposed Pseudo-LiDAR interpolation network (PLIN).The whole architecture consists of three modules, including the motion guidance module, scene guidance module and transformation module.

Fig. 3 .
Fig. 3. Results of interpolated depth map obtained by PLIN.For each example, we show the intermediate color image, sparse depth map, dense depth map, and our result.Our method can recover the original depth information and generate much denser distributions.

Fig. 4 .
Fig. 4. Visual results of the ablation study.We show the color image, interpolated dense depth map, two views of the generated Pseudo-LiDAR, and enlarged areas.The complete network produces more accurate depth map, and the distribution and shape of Pseudo-LiDAR are more similar to those of the ground truth point cloud.

Fig. 5 .
Fig. 5. Visual comparisons of the point cloud obtained by different methods.We show the intermediate color images, ground truth, and interpolation result of Pseudo-LiDAR point clouds by three methods.Our model produces more sharp outlines and boundaries for small objects such as cars and people.

TABLE I ABLATION
STUDY: PERFORMANCE ACHIEVED BY OUR NETWORK WITH AND WITHOUT EACH MODULE.

TABLE II QUANTATIVE
EVALUATION RESULTS OF THE TRADITIONAL INTERPOLATION METHOD, SUPER SLOMO [8], AND OUR METHOD.