The cost of multi-beam LiDAR has always been a significant obstacle to the widespread adoption of autonomous driving and mobile intelligent robots. The point clouds captured by low-cost LiDAR are typically sparse and have low resolution, which limits the performance of deep learning models and makes data annotation more challenging. Furthermore, we consider that the collection of point clouds comes from moving autonomous vehicles or mobile intelligent robots, which means that the point cloud data with spatial information also possesses temporal information. Our objective is to generate stable and accurate dense point clouds of road scenes from consecutive sparse point clouds with temporal features. To achieve this, we propose a TSE-UNet model designed for upsampling the point cloud from low-cost LiDAR, specifically for the task of point cloud super-resolution.
The entire pipeline of the proposed method is illustrated in
Figure 1. Firstly, datasets for training and evaluating the model performance are collected on both the CARLA simulator and our real-world platform, including CARLA32-128 (with 128 laser beams, RS-128) and Ruby32-128 (with 128 laser beams, RS-128). Subsequently, we transform the point cloud super-resolution task from a 3D spatial interpolation problem to a 2D image super-resolution problem, reducing the computational complexity of the deep learning model and enabling real-time application. Specifically, based on coordinate mapping, we project 3D sparse point clouds onto 2D projection sequential images and then encode the 2D images into feature maps using transposed convolution layers. Following this, we process the feature maps using the proposed TSE-UNet model. This model, built upon the encoder–decoder structure of UNet, takes multiple consecutive frames as input, representing multiple consecutive feature maps. We combine feature maps from two adjacent frames, enhance important features temporally using channel attention modules, and reinforce spatial correlations between adjacent frames using spatial attention modules, providing more accurate object localization and semantic information. The encoder further increases spatial receptive fields and downsamples features through dilated convolution layers, progressively extracting high-level semantic features. The decoder starts from the highest-level semantic feature map of the encoder to restore low-level semantics, ultimately outputting the super-resolution prediction for the current frame. Finally, the obtained prediction results are mapped back to 3D space through reverse projection. Subsequent sections provide detailed explanations of each part of the pipeline.
3.2. Point Cloud Projection and Back-Projection
The density of point clouds varies with different laser beams and rotating speeds of LiDAR, resulting in different vertical angular resolution and horizontal angular resolution. Based on the 3D coordinates (
) of point clouds and the vertical angle (
), azimuth angle (
), and offset (
), we project the point clouds in 3D space onto a two-dimensional plane according to Equation (
1), making it easier to process these data using deep learning methods based on two-dimensional images. Subsequently, we use Equation (
2) to back-project the processed prediction results from the model into 3D space.
Figure 4 illustrates the results of projecting 3D point cloud data from CARLA32-128 and Ruby32-128 onto 2D images. It can be observed that the widths of the corresponding 2D images are the same for CARLA-32 and CARLA-128 because of their identical horizontal FOV and angular resolution during data acquisition. However, the difference lies in the heights of the 2D images, with the image for CARLA-128 being four times taller than that for CARLA-32, as shown in the first and second rows of
Figure 4. Similarly, Ruby-32 and Ruby-128, shown in the third and fourth rows of
Figure 4, share the same characteristics. Therefore, our TSE-UNet model aims to recover dense point clouds from sparse ones.
3.3. TSE-UNet Model
To reconstruct a dense point cloud from a sparse point cloud, the model needs to have the ability to extract features and high-level semantic understanding from the sparse point cloud, as well as the ability to reconstruct high-level semantics into fine-grained low-level information, to achieve point cloud super-resolution. This process coincides with the design idea of the semantic segmentation model UNet. In addition, because the point cloud is dynamically collected by an autonomous vehicle, the semantic information of a certain frame of point cloud has a strong correlation with the semantic information of the previous frames of point cloud; at the same time, a certain point on the current frame of point cloud has a strong spatial correlation with the surrounding point cloud. Therefore, we propose a method combining the TSE module with UNet, namely TSE-UNet, to achieve the task of point cloud super-resolution.
Figure 5 shows the architecture of the TSE-UNet.
TSE-UNet incorporates the encoder–decoder structure and skip connections of UNet while introducing temporal feature attention aggregation modules and spatial feature enhancement modules.
The encoder is responsible for extracting high-level semantic features from point cloud data. The encoder consists of a pre-processing layer and four stages.
The input to the pre-processing layer includes the current frame and the projected images of the previous 15 frames, denoted as
, where the dimension of
is
. The pre-processing layer applies the following formula to each input image:
where
represents the transposed convolution operation with a stride of 2 in the height dimension. The dimension of
is
, where
. The purpose of the pre-processing is to increase the height of the sparse point cloud image to four times its original size, aligning it with the height of the dense point cloud for pixel-level predictions.
The structure is consistent across the four stages, comprising adjacent temporal feature aggregation, temporal and spatial feature-enhanced modules, and a down-sampling large-scale dilated convolution. Taking the operations on
and
in stage 1 as an example, the computation for the adjacent temporal feature aggregation operation is given by the following formula:
where
denotes the concatenation operation along the depth dimension, resulting in
with dimensions
.
Subsequently, through the temporal and spatial feature-enhanced (TSE) module,
where
represents a 2D convolution with a kernel size of 3 and a stride of 1, utilized to further integrate features between adjacent frames.
denotes the sigmoid function.
represents channel-wise average pooling, which calculates the importance of temporal features in the channel dimension and injects it into the original features through element-wise multiplication (⊙), thereby achieving temporal feature enhancement.
is a point-wise convolution with a kernel size of 1, a stride of 1, and an output dimension of 1, used to assess the importance of spatial dimensions and inject it into the original features through element-wise multiplication, completing spatial feature enhancement. The TSE module maintains the feature dimensions at
. Then, the computation continues with the following formula:
where
denotes a dilated convolution with a kernel size of 3, a stride of 2, and a dilation rate of 2, employed to enlarge the receptive field while accomplishing feature downsampling. The resulting feature
has dimensions
. At this point, the calculation of adjacent frame features in a single stage is complete. The computation in other stages within the encoder and between other adjacent frames follows a similar procedure.
The decoder also consists of four stages, each including a feature upsampling layer, producing features with the same dimensions as the corresponding encoder level. This allows the features to be concatenated and continue participating in the computation. The encoder ultimately outputs a prediction output (denoted as
P) with dimensions
. This output, along with the dense point cloud projection image corresponding to the input
(current frame) through the SSIM (structural similarity index measure) calculation, is denoted as
G. The calculation formula for the value of
is as follows:
where
and
represent the mean values of the images
G and
P, respectively;
and
are their variances; and
is the covariance.
and
are constants used to prevent division by zero in the denominator. In practical applications, the loss value is computed as
, where a smaller value indicates greater similarity between images. Subsequently, all model parameters are optimized through backpropagation.