1. Introduction
Among the various ways to describe 3D data, point clouds are widely used for 3D data processing due to their small data size and finer rendering capabilities. Real-world point cloud data are typically acquired using laser scanners, stereo cameras, and low-cost RGB-D depth cameras. However, limitations in occlusion, light reflection, transparency of surface materials, sensor resolution, and viewing angles result in the loss of geometric and semantic information. Therefore, how to use the limited missing point cloud data to restore the original complete shape has become a hot research topic in current point cloud processing algorithms, which has important research value for point cloud 3D reconstruction and target identification.
Because 3D point clouds are unstructured and disordered, most deep learning-based methods for processing 3D data convert point clouds into sequential image collections [
1] or voxel-based 3D data representations [
2] in the practical task of dealing with point cloud completion. However, multiple views and voxel-based representations lead to unnecessary data redundancy and limit the output resolution. PCN [
3] and FoldingNet [
4] focus on learning the general features of a category rather than the local details of a specific object during the completion process and are less effective in recovering the local details of the complete point cloud. Most of the subsequent works [
5,
6,
7,
8,
9] use a two-stage model, where the missing point cloud is first passed through a “simple” network, such as multilayer perceptrons (MLPs), to generate a coarse complete point cloud, and then the coarse point cloud is fed into a more “complex” network, such as auto-encoders and transformers, to enhance the local details of the complete point cloud. However, the different network complexity of the two stages results in learning more of a mapping from coarse to fine point clouds, lacking the ability to generate point clouds with fine local integrity directly from the missing ones.
In this paper, we propose a new, conceptually simple, general framework and end-to-end approach for point cloud completion. As shown in
Figure 1, this approach consists of a continuous network of encoders–decoders: a multi-resolution encoder and a progressive deconvolution decoder. In the first network framework, the complete point cloud is used to generate an “approximate” missing point cloud by the multi-view missing point cloud generation method and to form a point cloud pair with the complete point cloud as the basis for training, as shown in the mask points in
Figure 1. In this process, considering previous point cloud feature extraction methods, such as Pointnet++ [
10] and DGCNN [
11], both first transform the point clouds into point cloud blocks using the k-nearest neighbors algorithm, and extract the features of each point cloud block by convolutional neural networks (CNNs), or customized convolutional modules. However, converting the whole point cloud into point cloud blocks is too costly, redundant, and does not consider the interaction between the point cloud blocks. We propose a simpler interactive feature fusion module, which is to use iterative farthest point sampling (IFPS) to sample key points in the complete point cloud, and then use the key points to make k-nearest neighbors to form point cloud blocks, which reduces the computational cost and the redundancy between point cloud blocks. Further, to increase the feature interaction among the point cloud blocks, we introduce the attention mechanism, which has made a big splash in both computer vision and natural language processing. Finally, considering that this interactive feature fusion module loses some feature information, we also employ the multi-resolution feature extraction to extract deeper features of the missing point cloud. The missing point clouds are mapped into potential feature vectors through this network framework, as shown in
Figure 1.
In the second network framework, the potential feature vectors are fed into the progressive deconvolution decoder to predict point clouds with fine details localization and completeness, as shown in
Figure 1. In this process, considering such structures as PCN [
3] and MSN [
7] that output the complete point cloud directly at the decoder stage, the process is complicated, and it is difficult to recover the local details of the complete point cloud. For example, the basic shape of the chair can be recovered, but the connection between the chair legs will be ignored. We adopted a progressive point cloud generation method that can predict point clouds with different resolutions from layers of different depths. However, in the process of progressive upsampling, the classical linear interpolation and bilinear interpolation are both non-learning interpolation methods, which cannot adapt to different classes of 3D models. Both only use the point cloud of the previous resolution for the interpolation, completely ignoring the information that already exists in the missing point cloud at that resolution. Therefore, inspired by a PDGN [
12], we propose a learning-based trilinear interpolation method, which can simultaneously use the coordinate space, feature space, and neighborhood information of each point in the missing point cloud of the same resolution to generate fine-grained missing regions.
We finally compare the predicted complete point cloud with the real complete point cloud from the training process by multi-level CD loss to guide the prediction of our method. After the prediction model is trained, the missing point clouds are fed into the model to predict the complete aircraft point clouds. We experimented with the method on the Modelnet40 dataset [
13] and ShapeNet-Part dataset [
14], and both achieved excellent performance. In addition, that the current PCN [
3] and Vrc-Net [
9] datasets are complicated to produce and can only be applied to supervised learning relatively singularly was taken into consideration. Accordingly, we proposed a simple and efficient multi-viewpoint missing point cloud generation method using a 3D point cloud hidden point removal algorithm [
15] and generated eight viewpoints of missing point clouds for each complete 3D point cloud model in the Modelnet40 dataset and ShapeNet-Part dataset, which provides a database for subsequent in-depth research. The main contributions of PCA-Net can be summarized as follows:
We propose a new end-to-end approach to the point cloud completion network framework.
We propose an interactive feature fusion module that uses an attention mechanism to increase the interactive fusion of features between each point cloud block.
We develop a new progressive inverse folded product network that uses learning-based trilinear interpolation to generate complete point clouds with fine detail localization.
This paper also proposes a simple and efficient multi-view missing point cloud generation method, which provides a database for subsequent in-depth research.
The rest of this paper is organized as follows.
Section 2 describes the related work. In
Section 3, we present the asymptotic end-to-end point cloud completion model.
Section 4 presents the multi-view missing point cloud datasets generation.
Section 5 presents the experimental results, and
Section 6 presents the conclusions.
2. Related Work
Our work builds on prior work in several domains: point-based deep learning, attention mechanism, and point cloud completion.
Point-based deep learning. The current research on 3D deep learning in the field of point clouds is divided into two main approaches. One is to transform 3D point clouds into regular structured data and then use the currently available deep learning methods to process them. The main ones are to transform 3D objects into a collection of 2D views [
16], which can be processed using CNNs, transformers, etc. However, this approach increases the computational effort and lacks a 3D view. There is also the voxelization of 3D objects [
17], but this approach leads to a heavy memory burden and high computational complexity. Another approach is to construct special operations suitable for 3D unstructured geometric data for 3D deep learning. PointNet [
18] was the first to directly combine deep learning with 3D point clouds. Subsequently, PointNet++ [
10] was proposed to group and layer point clouds and use PointNet to capture the local and global information of point clouds. Point-GNN [
19] combines 3D point clouds with graphical neural networks (GNNs), which are widely used in 2D, and achieves good results. PointCNN [
20] proposes a convolution operation on irregular point cloud data by X-transform. EdgeConv [
11] proposes an EdgeConv operator that can learn point cloud features by local topology. Recent work has shown a very competitive and compelling performance on standard datasets. For example, the state-of-the-art methods SpecGCN [
21], SpiderCNN [
22], DGCNN [
18], and PointCNN [
20] achieve perfect accuracy for object classification tasks on the ModelNet40 dataset.
Attention mechanism. The attention mechanism aims to mimic the human visual system by focusing attention on features relevant to the target rather than on the whole scene containing some irrelevant background. For image-related tasks, attentional maps can be generated based on spatial [
23] or channel-related information [
24], while some approaches combine both for better information integration. Currently, there is a corresponding integration of attention mechanisms with many domains as well. Swin Transformer [
25] uses computation by shifting windows in image feature processing, which allows self-attentive computation to be limited and brings higher efficiency and accuracy. There is also SwinFusion [
26] which combines self-attentive intra-domain fusion units and cross-attentive-based inter-domain fusion units to mine and integrate long dependencies within and across the same domain. In addition, point cloud processing tends to utilize self-attentive structures, which estimate random dependencies without considering a specific order between elements. Among them, PointASNL [
27] uses the self-attention mechanism to obtain finer local point group features, and PCT [
28] applies transformers to process point cloud features to enhance the characterization of point cloud features. It also achieves a perfect performance in the areas of the classification, segmentation, and completion of 3D point clouds.
Point cloud completion. The in-depth study of deep learning applied to 3D point clouds has gradually transformed the point cloud completion task to one where the shape of a complete 3D point cloud can be recovered using a partial point cloud as input. Among them, a PCN [
3] maps the global features learned from a partial input point cloud to a coarse complete point cloud and restores it with secondary refinement by a folding decoder. TopNet [
5] proposes to predict the complete point cloud shape using a tree structure decoder. AlasNet [
6] further represents the 3D shape as a collection of surface parameters and generates more complex shapes by a 2D mesh to the 3D surface element. MSN [
7] predicts a complete but coarse-grained point cloud, a set of parametric surface elements, through a linear folding generation method as the first stage. Then, in the second stage, the coarse-grained predicted point cloud is merged with the input point cloud by a novel sampling algorithm to generate a fine-grained point cloud. GRNet [
8] converts the point cloud into an equally spaced voxel grid, extracts features in the grid using a 3D convolutional layer, and inputs the extracted 3D feature vectors into a demeshing layer to generate the predicted point cloud. Regarding PF-Net [
29], in order to maintain the original partial spatial arrangement, a point cloud fractal network for repairing missing point clouds is proposed, which takes partial point clouds as input and only outputs the missing part of the point cloud instead of the whole object. VrcNet [
9] proposes a variational framework network to repair missing point clouds by learning feature information of the complete point cloud in the auto-encoder and optimizes the network by another point cloud enhancement of the local details.
4. Multi-View Missing Point Cloud Datasets Generation
The amount of data is crucial in the training of deep learning, but it is difficult to obtain such paired data. Both a PCN [
3] and PF-Net [
29] use 3D mapping software to draw some common objects in reality and generate missing point clouds by some missing methods. Among them, the missing point clouds generated in PF-Net [
29] are different from the missing point clouds in daily depth cameras and LiDAR, which do not reach a good completion effect. In addition, the PCN [
3] artificially uses third-party software to guide the generation of missing point clouds. Although the point clouds generated by this method are extremely similar to the missing point clouds encountered in daily life, the production process is more complicated. Another great drawback is that the paired data generated by various methods can only be applied to single supervised learning but are less applicable to the current research in the self-supervised and unsupervised fields. Sagi [
15] proposes a method for 3D point cloud hidden point removal; given only one viewpoint, the visible points in the 3D point cloud are controlled by adjusting a visibility threshold. A larger threshold R indicates that more points are visible points. Based on this paper, a simpler and more efficient method of generating partial point clouds based on 3D point clouds with multiple viewpoints is designed, and R = 100 is chosen, but the resolution of the generated partial point clouds exists differently.
The 3D point cloud of the ShapeNet-Part dataset and Modelnet40 dataset is taken as the experimental object. The missing point cloud is generated by the method, which forms a data pair with the complete point clouds to form the training dataset. Borrowing from the generation of the datasets in the PCN, the following
Figure 5a shows the camera pose map. Each orange circle indicates a camera pose, where the relative poses between the eight camera poses are fixed, but each training camera pose is randomly selected, and
Figure 5b–i show the missing point clouds generated by the eight camera poses, respectively.
Compared with previous missing point cloud datasets, PCN [
3], PF-Net [
29], and VrcNet [
9] have the following advantages: 1. The object is a 3D point cloud that can be embedded in the network without additional storage space, and the experiment is fast and convenient. 2. Using a uniformly distributed camera view, the number of camera poses can be adjusted according to demand, and the ability to generate complete 3D shapes under partial conditions can be better evaluated by using fewer complete shapes during training. 3. The number of missing and complete point cloud pairs can be arbitrarily increased to generate corresponding missing point clouds for complete point clouds of different resolutions. 4. In addition, 3D shape completion methods can also be used for other missing point cloud tasks, such as classification, alignment, key point extraction, and some new self-supervised domain learning.
5. Experiments
Experimental datasets. The ModelNet40 dataset [
13] contains 12311 CAD models in 40 object classes and is widely used for point cloud shape classification and surface normal estimation benchmarking. The standard 9843 objects were used for training and 2468 objects for evaluation. The ShapeNet-Part dataset [
14] contains 13 different objects with a total number of 14,473 shapes (11,705 for training and 2768 for testing). All input point cloud data are centered at the origin and the coordinates are normalized to [−1, 1]. The training point cloud data were created by sampling 2048 points per object FPS.
Evaluation metrics. The proposed method is evaluated by calculating the Chamfer Distance (CD) between the predicted complete shape and the true complete shape. Considering the sensitivity of the CD to the outliers, the distance between the object surfaces is evaluated using the reconciled mean between the accuracy of the F-score and the chamfering rate.
Implementation details. The method proposed in this paper is implemented on PyTorch, all modules are trained alternatively by the ADAM optimizer with an initial learning rate of 0.0001, a batch size of 16, and are trained using a Tesla P100 GPU. The batch normalization (BN) and RELU activation units are used in the encoder and the adjacent points are set at different resolutions of 2048. In the decoder, the output size of the different resolution deconvolution networks is 256, 512, 1024, and 2048, respectively. Mapping from high-dimensional features to point cloud coordinates is achieved using the MLP to generate the coordinates of the point clouds. The Tanh activation function is used after the MLP is completed. Each network was trained 100 times separately.
5.1. Unsupervised Point Cloud Completion Results
Quantitative evaluation. The missing point cloud generation method introduced in
Section 4 is used, and the datasets are generated from 40 classes of high-quality 3D point clouds in ModelNet to generate missing and complete point cloud pairs. These point cloud pairs are divided into a training set (9843 shape pairs) and a test set (2468 shape pairs), and none of the point cloud pairs in the test set are included in the training set. The training process divides the ModelNet dataset [
13] into ModelNet10 (containing ten categories) and ModelNet40 (containing 40 categories). Validating the experimental effect of the method in this paper can better evaluate the ability of the method in this paper to generate complete shapes under the missing condition. The same training strategy was used to train all the methods on each of the datasets. The CD loss and F-scores of all the evaluated methods trained on the ModelNet10 dataset are shown in
Table 1 and
Table 2, respectively. The table also shows each class’s CD loss and the F-scores values for comparison. The CD loss and F-scores values of all the evaluated methods under ModelNet40 dataset training are shown uniformly in
Table 3. The methods in this paper outperform the existing competing methods in terms of both the CD loss and F-scores values.
To validate the applicability of the methods in this paper on other datasets, validation was performed on the benchmark dataset ShapeNet-Part [
14], where it was trained using the same training strategy as the previous datasets. The CD losses of all the evaluated methods are shown in
Table 4. The experiments validate that the method in this paper outperforms the existing competing methods.
Qualitative evaluation. The results of the qualitative comparison are shown in
Figure 6. Compared with other methods, PCA-Net keeps the input missing point cloud unchanged while recovering the missing structure. For example, for the missing legs of the chair and table (the first and second rows in
Figure 6), we can not only predict the location of the missing legs accurately but also make the recovered shape more uniform. In the third row of
Figure 6, we reconstruct the complete chandelier from one-half of the chandelier. The other methods ignore the interface between the lamp cord and the shade, but our approach retains this detail of the complete chandelier. In the missing car point cloud in the fourth row of
Figure 6, most other methods only recover the shape of the car and the tires, while we recover the more complex cab shape of the car. In the fifth row of the guitar point cloud in
Figure 6, the missing part of the point cloud makes the completion more difficult, but we can still generate a relatively complete and fine completion result. Therefore, PCA-Net can effectively reconstruct the complete shape when dealing with some more complex structures.
5.2. Supervised Point Cloud Completion Results
The above experiments are a self-supervised learning–training process, while the actual point cloud completion 3D benchmark is a supervised learning process, to evaluate the point cloud completion capability of PCA-Net on supervised learning, after generating the missing point clouds on the ModelNet10 dataset (using the method in
Section 4) normalized to [−1, 1] and then forming point cloud pairs with the complete point clouds. The experimental process is trained in the same way as the 3D benchmark for the point cloud completion. PCA-Net and other methods are given in the following
Table 5, and the comparison of the results of the CD loss and F-scores values shows that PCA-Net outperforms the other compared methods.
5.3. Qualitative Evaluation of PCA-Net Network
Analysis of attentional mechanisms. To demonstrate the effectiveness of the attention module for the efficient interaction between the multi-scale features, PointCNN [
20] is used as the baseline model. We trained PointCNN [
20], PCA-Point (attention using point-attention), PCA-CBAM (attention using convolutional block attention module), PCA-Offset (attention using offset-attention), and PCA-SE (attention using squeeze and excitation), and the overall classification accuracy was evaluated. The experimental results are shown in
Table 6, and PCA-SE shows the best performance.
Analysis of skip connection settings in the decoder. In this paper, we try to use different methods as decoders to generate complete point clouds and verify the necessity of skipping connections in decoders by two methods: learning-based bilateral interpolation and learning-based trilinear interpolation. Validating the network improvements on the ModelNet10 dataset, as can be seen from
Table 7, the learning-based trilinear interpolation method has improved the completion results compared to the learning-based bilateral interpolation method.
5.4. Shape Completion on Real-World Partial Scans
To further evaluate PCA-Net, validation was performed using partial car data from the KITTI dataset [
37], which uses LIDAR-captured point clouds. The method is trained on the ShapeNet-Part dataset [
14] to complete the sparse LiDAR data in the KITTI dataset [
37], and the qualitative completion results are shown in
Figure 7, where the target point cloud generated by PCA-Net is complete and smooth compared to the PCN [
3].