Point Projection Network: A Multi-View-Based Point Completion Network with Encoder-Decoder Architecture

: Recently, unstructured 3D point clouds have been widely used in remote sensing application. However, inevitable is the appearance of an incomplete point cloud, primarily due to the angle of view and blocking limitations. Therefore, point cloud completion is an urgent problem in point cloud data applications. Most existing deep learning methods ﬁrst generate rough frameworks through the global characteristics of incomplete point clouds, and then generate complete point clouds by reﬁning the framework. However, such point clouds are undesirably biased toward average existing objects, meaning that the completion results lack local details. Thus, we propose a multi-view-based shape-preserving point completion network with an encoder–decoder architecture, termed a point projection network (PP-Net). PP-Net completes and optimizes the defective point cloud in a projection-to-shape manner in two stages. First, a new feature point extraction method is applied to the projection of a point cloud, to extract feature points in multiple directions. Second, more realistic complete point clouds with ﬁner proﬁles are yielded by encoding and decoding the feature points from the ﬁrst stage. Meanwhile, the projection loss in multiple directions and adversarial loss are combined to optimize the model parameters. Qualitative and quantitative experiments on the ShapeNet dataset indicate that our method achieves good results in learning-based point cloud shape completion methods in terms of chamfer distance (CD) error. Furthermore, PP-Net is robust to the deletion of multiple parts and different levels of incomplete data.


Introduction
With the rapid development of 3D scanning technology, point clouds, as an irregular set of points that represent 3D geometry, have been widely used in various modern vision tasks, such as remote sensing application [1][2][3], robot navigation [4][5][6], autonomous driving [7][8][9], and object pose estimation [10][11][12]. However, owing to occlusion, limited viewing angles, and sensor resolution, real-world 3D point clouds captured by LiDAR and/or depth cameras are often irregular and incomplete. Therefore, point cloud completion has always been an urgent problem in point cloud data applications. Most traditional methods of shape completion are based on the geometric assumption [13][14][15] that the incomplete area and some parts of the input are geometrically symmetric. These assumptions significantly limit the real-world applications of these methods. For example, Poisson surface reconstruction [16][17][18] can usually repair the holes in 3D model surfaces, but discard fine-scale structures. Another geometry-based shape completion method is retrieval matching or shape similarity [19][20][21]. Such methods are time consuming when applied to the matching process according to the database size, and cannot tolerate noise in the input 3D shape. Owing to the disadvantages of structural assumptions and matching time in traditional loss of local detail. To solve these problems, this study uses a multi-view-based method with an encoder-decoder architecture to leverage the structure and local information of sparse 3D data.
The multi-view-based method [42][43][44][45][46][47] is used to project shapes into multiple views, to extract profile features in multiple directions of point clouds. In MVCNN [42], a 3D shape classification method based on multiple views is employed for the first time. A 2D rendering graph obtained from the different perspectives of the 3D model is then used to generate a 3D shape classifier. The method then max-poles multi-view features into a global descriptor to assist the classification. MHBN [43] uses harmonized bilinear pooling to generate global descriptors, which integrate local convolution features to make the global descriptor more compact. On this basis, several other methods [44][45][46] have been proposed to improve the recognition accuracy. In the latest paper by Wei, View-GCN [47] applies graph convolutional networks to multiple views, and uses 2D multi-views of 3D objects to construct view-graphs as graph nodes. The experiments show that the view-GCN can obtain the best 3D shape classification results.
Given that it can be challenging for networks to directly exploit edge features in irregularly distributed incomplete point clouds, this study introduces a multi-view-based method for point cloud completion, and designs a convolutional neural network with an encoder-decoder architecture, comprising (1) multi-view-based boundary feature point extraction and (2) point cloud generation based on the encoder-decoder structure. In the first stage, the point cloud is projected in multiple directions. The 3D point clouds can easily cause higher density in the overlapping regions, and increase the computational cost when projected onto a plane. Therefore, a new boundary extraction method is used to sample each projection. This method eliminates the overlap caused by projection, and makes the network focus on characteristic profile information. In the second stage of the point projection network (PP-Net), an encoder-decoder structure is designed based on point cloud multi-directional projection. It extracts global features, and combines profile features from the projection and boundary feature points in different directions, which are fused into the feature vector by the encoder; then, a point cloud with fine profiles is generated by the decoder. In addition, a joint loss that combines the distance loss of multi-directional projections of a point cloud with adversarial loss is proposed to make the output point cloud more evenly distributed and closer to the ground truth.
The main contributions of the study follow.

1.
A multi-view-based method using encoder-decoder architecture is proposed to complete the point cloud, which is performed through projections in multiple directions of an incomplete point cloud.

2.
For the projection stage, a boundary feature extraction method is proposed, which can eliminate the overlap caused by projection and make the network focus on the characteristic profile information.

3.
A new joint loss function is designed to combine the projected loss with adversarial loss to make the output point cloud more evenly distributed and closer to the ground truth.

Data Preprocessing
The point cloud data generated from a subset of the Shapenet [23] dataset were used to train the network model. It contains 13 object types in ShapeNet: airplane, skateboard, car, chair, table, lamp, pistol, guitar, bag, cap, mug, laptop, and motorbike. There are 14,473 models in total; 11,705 are used for training, and 2768 are used for testing. The original ground truth point cloud was obtained by sampling 2048 points for each point cloud. As shown in Figure 1, an incomplete point cloud is obtained by deleting a certain number of points around a random center point. In addition, the incomplete point cloud is randomly generated in real time during each training, meaning that the missing parts of the same model in each iteration will be different, thereby enhancing the robustness of the network significantly. When compared with other methods, this study used a point cloud with 25% missing data for training and testing. Note that data preprocessing was not performed on the training dataset using operations such as rotation and translation. However, the proposed network is still robust to these operations because of the embedding provided by the PointNet and FoldingNet [48] modules.

Network Structure Overview
Most existing deep learning point cloud completion models first generate a rough frame based on the input global features, and then, refine the frame to obtain a complete point cloud. There are two main problems with this method. (1) Only the global features of the point cloud are used in the encoding process, while the local features are ignored. (2) During the decoding process, generalization to complete the point cloud also generalizes the unique structure of the model. A multi-view-based point completion network with an encoder-decoder architecture is designed to solve these two problems. This network takes multiple projections of the point cloud as input and directly generates a complete point cloud. The projection is taken as the input to ensure that both the global features of the point cloud and multi-directional boundary features are utilized, and the complete point cloud is directly generated to avoid the loss of local information caused by the refinement process. The network structure is illustrated in Figure 2. The entire network structure comprises four basic modules: projection boundary extractor (PBE; Section 2.3), multiresolution encoder (MRE; Section 2.4), FBD (Section 2.5), and discriminator (Section 2.6). The PBE is the first stage of the projection-to-shape manner of two stages. It extracts the boundary feature points of the multi-directional projection of the point cloud as the input of the encoder. The MRE and FBD form the second stage of the projection-to-shape manner of two stages. It takes the boundary feature points of the first stage as input, uses the MRE to extract a 1792-dimensional feature vector, and then, serves as the input of the decoder module. The decoder generates the predicted point cloud through two consecutive folding operations; to ensure that the existing point cloud structure will not be destroyed, only the missing part of the point cloud is generated when generating the predicted point cloud. To optimize the network parameters, a joint loss function is designed (Section 2.6), and divided into two parts: adversarial loss and multi-directional projection distance loss. The point cloud is input into the discriminator module to obtain adversarial loss. The discriminator is then trained to ensure that the output of the real point cloud is as close to 1 as possible, and the output of the predicted point cloud is as close to 0 as possible. Simultaneously, the predicted point cloud generated by other modules makes the output of the discriminator as close to 1 as possible. The two are alternately trained to make the generated point cloud more realistic. The multi-directional projection distance loss is defined as the chamfer distance (CD) error between the output point cloud and ground truth of the multi-directional projection, which is trained to optimize the overall shape of the complete point cloud.

PBE
The PBE is used to project the point cloud in multiple directions and extract feature points. The PBE is divided into three stages: projection transformation, overlap elimination, and boundary extraction. In the first stage, projection transformation is used to project the incomplete 3D point cloud in different directions. In the second stage, overlap elimination is used to eliminate overlapping effects caused by projecting. In the third stage, boundary extraction is used to extract the boundary feature points of each projection.
In the first stage, a multi-view-based method is used to map the point cloud from 3D space to a 2D plane. As shown in Figure 3a, projection planes are automatically generated by the program. The 3D point cloud coordinates are recorded through the spatial rectangular coordinate system, and the point cloud is projected onto three planes: xoy, yoz, and zox. The point cloud is projected after being rotated by 0 • , 30 • , and 60 • along the x-, y-, and z-coordinate axes to obtain nine projection surfaces. In the second stage, the point cloud projection onto a plane can easily lead to overlapping areas. Regions with different densities increase the computational cost and affect feature extraction. To eliminate the effect of overlap, farthest point sampling (FPS) is used to downsample each projection. FPS is a sampling strategy applied to PointNet++, which can obtain a good set of skeleton points from the point cloud.
In the third stage, boundary extraction is used to extract the boundary feature points. As shown in Figure 3e, a chair with only 341 boundary points can also describe the shape of the chair, and it is more evenly distributed. To extract the boundary feature points of the downsampled projection, this research proposes boundary recognition based on the number of adjacencies, which is the number of points within a certain distance from a point in the point cloud. This distance is determined by multiplying a hyperparameter α and the average of the distance between all points in the point cloud. The number of points around the boundary points was found to be generally less than that of the nonboundary points. As shown in Figure 3b, a group of points with the least number of adjacencies in the point cloud is selected as the boundary points. As shown in Figure 3c-e, the boundary of the chair projection is extracted. The figure shows that this method can extract the peripheral boundary and hollow backrest boundary.

MRE
Notably, all the results of point cloud repair and completion should be unaffected by the rotation or translation of the input shape. In the current deep learning method, the PointNet encoder effectively solves the problems of rotation and disorder of the point cloud input. However, PointNet only extracts high-level feature information, and does not effectively use low-and mid-level features that contain rich local information. To fully extract the input data information, this study introduces combined multi-layer perceptron (CMLP) in the model coding stage. As shown in Figure 4, the structure of each layer of the encoder is the same as that of the PointNet encoder; it comprises two layers: a parametersharing multi-layer perceptron (MLP) and maximum pooling layer. Different layers of the MLP encode each point into different dimensions (64-128-256-512-1024), and the output of the last three layers is maxpooled and concatenated to obtain a 1792-dimensional feature vector. To fully utilize the input incomplete point cloud, the input of the network is nine projections of size N/6 × 2. The nine projections are input to the encoder to obtain nine individual combined latent vectors F i , where i= 1, . . . , 9. F i represents the feature extracted from the projection of the point cloud. All F i are then concatenated, forming a latent feature map M with a size of 1792 × 9 (i.e., nine vectors each with a size of 1792). MLP (9-1) is then used to integrate the latent feature map into a final feature vector V = 1792.

FBD
The decoder structure of the PP-Net is based on the FoldingNet [48] decoder. The decoder based on FoldingNet duplicates the encoded 512-dimensional codeword, and concatenates with the 2D grid. The completed point cloud is generated after using two consecutive folding operations. As FoldingNet notes, the folding operation is equivalent to a "transformation" such as deforming, cutting, or stretching, which can fold a 2D plane into the target 3D shape. The feature codeword can store the required "transformation." A two-stage decoding structure from a plane to point cloud based on FoldingNet is used to generate the predicted point cloud. The first stage generates a 2D square plane with uniform grid points. In the second stage, a folding operation is applied to the plane, and the plane obtained in the first stage is folded into a predicted point cloud. Figure 5 shows the network details of the decoder architecture. Before the folding operation, to match the output of the encoder with the input of the decoder, the feature vector V generated by the encoder is input into the MLP to obtain the 512-dimensional codeword as the input of the folding operation. Then, two consecutive folding operations are used to help restore the lost shape and structure. The folding operation in FoldingNet is implemented using the MLP, because the activation function in the MLP provides a nonlinear transformation that can simulate 3D space transformations, such as folding and stretching. Therefore, the MLP has sufficient expression capability to effectively simulate most of the transformation operations. Specifically, the first stage generates a square plane with uniform grid points with a size M × 2. Here, the size of M is the square number, which is close to the number of missing points; for example, when the number of missing points is 512, the number of M is 576. In the second stage, two consecutive folding operations are performed. First, the M × 512 codeword matrix is obtained by repeating the 512-dimensional feature codeword M times. Then, the grid points and codeword matrix are concatenated to form an M × 514 matrix, and a three-layer MLP is used for the first folding operation to generate an M × 3 intermediate point cloud. In addition, the codeword matrix and intermediate point cloud are concatenated to form an M × 514 matrix, and then, the second folding operation is performed to obtain the final M × 3 point cloud. The PP-Net includes two consecutive folding operations. The first operation folds the 2D grid into 3D space, and the operation folds inside the 3D space. The decoding result of these two operations can generate the missing point cloud data, and the folding operation can reduce the number of network parameters and accelerate the network training.

Loss Function
A joint loss function was designed to generate a more realistic point cloud with fine boundary profiles. It contains two parts: (1) multi-directional projection distance loss and (2) adversarial loss. Multi-directional projection distance loss optimizes the distance between prediction and ground truth to generate a point cloud with fine profiles. Concurrently, adversarial loss compares the difference between the predicted point cloud and ground truth to make the prediction result more realistic.

Multi-Directional Projection Distance Loss
Owing to the disordered property of discrete point cloud data, the loss function should also be insensitive to the order of the sampling points. Fan [49] has proposed two permutation-invariant methods to measure the distance between unordered point clouds, which are CD and Earth Mover's Distance (EMD). In practical applications, EMD calculation is time consuming and requires two point clouds to have the same size; therefore, the CD was selected to calculate the loss.
Here, CD calculates the shortest distance from each point in the point cloud to a point in another point cloud, and then, sums and averages the distances of all points. It calculates the average closest distance between the predicted point cloud and ground truth, which contains two items: (1) CD from the ground truth to the predicted point cloud and (2) that from the predicted point cloud to the ground truth. The first iteration makes the predicted point cloud closer to the ground truth, and the second iteration forces the predicted point cloud to cover the ground truth. The PP-Net uses projections in each direction of the point cloud to assist in optimizing the network parameters, and the multi-directional projection distance loss is composed of four items in Equation (2) (d CD xyz , d CD xoy , d CD yoz , and d CD xoz ) that are weighted by hyperparameter β. The first item calculates the squared distance between the predicted points Y pre and ground truth of the missing region Y gt . The following items are used to calculate the squared distance between the predicted points (Y pre xoy , Y pre yoz , Y pre xoz ) and ground truth (Y gt xoy , Y gt yoz , Y gt xoz ) of the three projection planes.

Adversarial Loss
The adversarial loss of the PP-Net is based on the adversarial loss of PF-Net. First, F is defined as F() := FBD(MRE()). The partial input X is mapped to the missing point cloud Y through F. Then, the discriminator (D()) is used to distinguish the missing area Y from the true missing area Y. The discriminator differs from the MRE as it uses a serial MLP layer (64-64-128-256). The outputs of the last three layers are maxpooled to obtain the feature vector f i , where size f i := 64, 128, 256 for i = 1, 2, 3, respectively. The three layers are concatenated into a latent vector F, where the size of F is 448. Then, F is passed through the fully connected layer (256, 128, 16, 1). Finally, the sigmoid classifier is used to obtain the predicted value. The adversarial loss in PF-Net is defined as follows: where x i ∈ X, y i ∈ Y, i = 1, . . . , S. S is the dataset size of X and Y. Both F and D are optimized jointly using alternating ADAM during training. As proposed by the GAN, the discriminator ensures that the predicted value is close to the true value. The discriminator is trained to ensure that the output of the real point cloud is as close to 1 as possible, and the output of the predicted point cloud is as close to 0 as possible. Concurrently, the predicted point cloud generated by the PP-Net makes the output of the discriminator as close to 1 as possible. The two are alternately trained to make the generated point cloud more realistic.

Joint Loss
A new joint loss function was designed to train the network; it comprises two parts: multi-directional projection distance loss and adversarial loss. The multi-directional projection distance loss measures the difference between the real point cloud and predicted point cloud in the missing area. The adversarial loss attempts to make the point cloud more realistic by optimizing the encoder and decoder. L = λ com L com + λ adv L adv (4) where L pro represents the multi-directional projection distance loss, L adv represents the adversarial loss, λ com and λ adv represent the weights of the multi-directional projection distance loss and adversarial loss, respectively; here, λ com +λ adv =1.

Experiment and Result Analysis
This section first introduces the environment and parameters when training the completion network, and then, quantitatively and qualitatively evaluates the PP-Net and other existing point cloud completion methods. These methods will be used to complete some actual examples of point clouds for comparison and to visualize their completion results.

Experimental Implementation Details
To make the proposed PP-Net converge quickly during the network training, the mean value of the sampling point coordinates of the incomplete and complete point cloud models is normalized to zero; that is, the range of coordinates of each sampling point is scaled to (−1,1). PyTorch is then used to implement the proposed network. All network modules are alternately trained using the ADAM optimizer, with an initial learning rate of 0.0001 and a batch size of 25. Batch normalization and RELU activation units were used in the MRE and discriminator, but only used RELU activation units in the FBD.
In the data preprocessing, complete point cloud data are read in and processed to generate the incomplete point cloud in real time during each training. In the projection boundary extraction, the number of projection points is set to 2N3 , where N is the size of the incomplete point cloud. The boundary takes 14 of the number of projection points, the number is N6 , and the hyperparameter α is 0.5; nine projections of size N6 are obtained. In the MRE, the network uses a five-layer PointNet encoder, and the output feature sizes are 64, 128, 256, 512, and 1024. The network inputs nine projections separately, and connects the output of the last three layers to obtain a 9 × 1792-dimensional feature vector. Finally, the feature vector V is obtained through a three-layer MLP (9-1). In the first stage of the FBD, the decoder generates M × 2 grid points, where the decoder sets M to the number of squares, which is close to the number of missing point clouds; for example, if the number of missing point clouds is 512, M is set to 576 (24 × 24). The grid points are then converted into an M × 2 matrix. In the second stage of the FBD, before the folding operation, to match the output of the encoder with the input of the decoder, the decoder inputs the feature vector V (with a size of 1792) generated by the encoder into the three-layer MLP (the output dimensions of each layer are 1792, 1792, and 512) to obtain a 512-dimensional codeword as the decoder input. Then, two consecutive folding operations are performed to obtain the final predicted point cloud. The MLP output sizes of the two folding operations were 512, 512, and 3. In the joint loss, the hyperparameter β of multi-directional projection distance loss is 0.2, the hyperparameter λ com of the multi-directional projection distance loss is 0.95, and the hyperparameter λ adv of the adversarial loss is 0.05.

Evaluation Standard
The network uses the point cloud completion accuracy of 13 categories in the dataset to evaluate the performance of the model. The evaluation used in this study contains two types of errors: predicted point cloud (Pred) → ground truth (GT) error and ground truth (GT) → predicted point cloud (Pred) error, which has been used in other papers [50,51].
The Pred→GT error calculates the CD from the predicted point cloud to the ground truth, which represents the difference between the predicted point cloud and ground truth.
The GT→Pred error calculates the CD from the ground truth to the predicted point cloud, which represents the extent to which the predicted point cloud covers the real point cloud. The error of the complete point cloud is caused by the change in the original point cloud and the prediction error of the missing point cloud. Because only the missing part of the point cloud is output, the original part of the shape is not changed. To ensure that the evaluation is fair, the Pred→GT and GT→Pred errors of the missing point cloud are compared. When the two errors are smaller, the complete point cloud generated by the model and the ground real point cloud are more similar, and the model performs better.

Experimental Results
After the data were generated, the proposed completion network was verified on the ShapeNet-based dataset. Figure 6 shows part of the results of the shape completion. For each point cloud model, the first column shows the input point cloud model, the second column shows the result output of the completed network, and the third column shows the ground truth. The high-quality point cloud predicted by the PP-Net matches well with part of the input. Figure 6. Visualization of partial point cloud completion results. "Input" represents the input incomplete point cloud, "PP_Net" represents the completion result of the network, and "GT" represents the ground truth. Table 1 shows the average value of the 13-category point cloud completion accuracy of some classic point cloud completion methods (details are in Section 3.4). In the table, the Pred→GT error (left side) represents the difference between the predicted point cloud and ground truth, and the GT→Pred error (right side) represents the extent to which the predicted point cloud covers the ground truth. It can be seen that the PP-Net has advantages in both errors, indicating that the proposed method is effective. The PP-Net can encode the multi-directional projection of an incomplete point cloud as a 1792-dimensional feature, which represents the global feature of the 3D shape and multi-directional boundary feature. To verify its robustness for point cloud completion with different degrees of missing areas, the network parameters were adjusted to train it to repair point clouds with missing degrees of 25%, 50%, and 75%. Figure 7 and Table 2 show the performance of the network in the test set. Figure 7 shows that, even in the case of a large missing area, the network can still fully identify and repair the outline of the overall point cloud. Table 2 show that, for predicted point clouds generated with different degrees of missing areas, the error between the predicted point cloud and ground truth is unchanged, which proves the robustness of the proposed network to varying degrees of missing information. To further prove the robustness of the network, the network was trained to complete missing point clouds at multiple locations. The results are shown in Figure 8. The network can still correctly predict the missing point cloud, while ensuring that the error is unchanged.

Comparison with Other Methods
To verify the advanced nature of the proposed method, in this study, three existing strong baseline point cloud completion methods were selected for comparison with the PP-Net. These three methods are the same as those in the PP-Net. The network was trained based on an encoding-decoding structure. All methods were trained and tested using the same dataset for a quantitative comparative analysis.
L-GAN [37]: L-GAN is the first point cloud completion method based on deep learning, which also uses an encoder-decoder structure, specifically, a PointNet-based encoder and simple fully connected decoder in the decoding module.
PCN [39]: This is the most well-known method for point cloud completion. It provides good results, and is one of the best performing methods for point cloud completion. Similar to the PP-Net, PCN uses an FBD to output the final result.
PF-Net [36]: PF-Net employs a CMLP based on PointNet, which concatenates the features extracted by MLP to obtain the feature vector. The encoder of the PP-Net is inspired by the CMLP. It proposes a three-stage point cloud completion method from rough to fine in the decoding module.
The results are presented in Table 3. Comparing the results of 13 different categories of different objects of point cloud completion, the proposed method (PP-Net) outperforms the existing methods in 6 of the 13 categories for the Pred→GT and GT→Pred errors, namely, airplane, car, laptop, motorbike, pistol, and skateboard. One of the Pred→GT and GT→Pred errors for PP-Net is better than those for the existing methods in four categories: cap, bag, table, and lamp. There are also three types of completion results that are not dominant, namely, chair, guitar, and mug. It can be found that the completion result is mainly affected by the following three factors: (1) whether object is symmetrical, (2) whether there are subtle fine structures, and (3) whether there is occlusion. The PP-Net projects the point cloud in various directions; for symmetrical objects, the missing structure can be inferred from the projection. Objects, such as airplanes, cars, laptops, motorbikes, pistols, and skateboards, are symmetrical in at least one direction, meaning good results can be obtained. The shape of a guitar with a sound hole is not necessarily symmetrical, thus affecting the completion result. The decoder of the PP-Net is based on a folding decoder. It is difficult to deform the grid into subtle, fine structures; because such structures exist in bag, table, chair, and mug, the completion is affected. The disadvantage of the multi-view-based method is that information loss is inevitable when projecting complex structures. Most lamps are equipped with lampshades. During projection, the structural information of the lamp cannot be extracted, thus affecting the completion result. However, in general, the PP-Net achieved better results in some categories, while demonstrating advantages in the average error of all categories. In Figure 9, the output point cloud generated by the abovementioned methods is visualized, and all were from the test set. Compared with other methods, the PP-Net prediction shows a clear boundary, with a more complete recovery level and finer profile. In (1), (5), and (9), the outputs of the other methods are blurred in the fine profile. In (3) and (8), the outputs of the other methods fail to generate a reasonable shape. In (6), (7), and (8), there is a certain deviation in the outline of the other methods. We also take advantage of PF-Net. Only the missing parts are output, and the hollows and backrests are properly filled in (2) and (4). To summarize, the proposed approach focuses more on boundaries and produces finer profiles.

Discussion
This section discusses three sets of comparative experiments designed to analyze some of the design of the network structure: (1) comparison between using and not using boundary extraction, (2) comparison of grid point folding and projection folding, and (3) comparison of using joint loss and only using CD loss between two point clouds.

Boundary Extraction Analysis
In the projection boundary extraction module, a new boundary extraction algorithm is proposed to extract the boundary points that can reduce the computational cost, while retaining the boundary information of the point cloud, and make the point cloud focus on the structural features. To prove that the proposed method is effective, the boundary extraction module was removed and the projection was directly input into the encoder. The generated result was then compared with the boundary extraction result. The results are shown in Figure 10. It can be seen that the points in the upper half of the red box are dense; these points represent the borders of the chair. As shown in Figure 3d, the borders are more likely to overlap during the projection. The uneven distribution of points when extracting features leads to an uneven distribution of points when generating results. In the lower part of the red frame, a part of the chair legs was not generated. Through comparison, it can be concluded that boundary extraction makes the network focus on the boundary features, while eliminating the influence of overlap. A quantitative comparative experiment was performed on chairs, and the results are shown in Table 4. Here, the boundary extraction method has significantly optimized the error of the results, which proves the effectiveness of boundary extraction.

Plane Folding Analysis
Both folding and PCN adopt a strategy of forcing the concatenation of 2D point grid features. By visualizing the experimental results, it was found that the edge of the generated point cloud geometry was extremely smooth when using these methods. In fact, the original idea of the PP-Net is to use a point cloud projection for folding. Quantitative comparison experiments were performed on chairs; one of which was folded with grid points, and the other was folded with projection. The results are listed in Table 5. Notably, the GT→Pred error using grid point folding is smaller, implying that the completion point cloud covers the ground truth to a higher degree, because of the grid points being folded from the entire plane and the coverage being wider. The Pred→GT error using projection folding is smaller, and represents the difference between the predicted point cloud and real point cloud. Because the projection records the profile information of the point cloud, it can generate a predicted point cloud closer to the real point cloud. Although analyzed from a quantitative perspective, the results generated by the two are almost the same in sum, except for that from the analysis of the visualization effect; the results are shown in Figure 11. Compared with the projection, the distribution of the point cloud generated based on 2D grid point folding is uniform in the red box. The experimental results show that concatenating the features of a 2D point grid can improve the quality of the completed point cloud. Table 5. Pred→GT and GT→Pred error obtained via grid point and projection as decoder input.

Loss Function Analysis
The PP-Net uses a joint loss function that combines multi-directional projection distance loss and adversarial loss to optimize network parameters, making the profile of the point cloud finer and closer to the ground truth. To prove that the method is effective, the conventional CD loss function was used between two point clouds for comparison; the results are shown in Figure 12. It can be seen that, without the constraints of multidirectional projection distance loss and adversarial loss, the edge of the point cloud in the red box is blurred. Quantitative comparative experiments were performed on chairs; one of which used joint loss, and the other used conventional loss. The results are listed in Table  6. It can be seen that the joint loss mainly optimizes Pred→GT error, which represents the difference between the predicted point cloud and ground truth. Because the joint loss includes multi-directional projection loss and adversarial loss, the multi-directional projection loss makes the profile of the predicted point cloud closer to the true value, and the adversarial loss makes the predicted point cloud more realistic, both of which optimize the predicted point cloud, thus reducing the Pred→ GT error.  Table 6. Pred→GT error and GT→Pred error obtained via joint loss and CD loss as loss function.

Conclusions
This study proposes a new network, PP-Net, to accomplish the task of point cloud shape completion. It directly processes the raw input point cloud with a certain noise without any voxelization or structural assumption. The PP-Net uses a multi-view-based method to directly generate fine point clouds through projections in various directions of the point cloud. The method based on multi-view projection combines global features and multi-directional boundary features to input into the encoder. The MRE of the PP-Net can extract low-, medium-, and high-level features. For the decoder, the PP-Net uses a folding operation to make the distribution of the generated point cloud more uniform. Further, the combination of multi-directional projection distance loss and adversarial loss is used to guide the continuous optimization of the network; finally, a more realistic point cloud with fine profiles is obtained. The experimental results showed that the PP-Net achieved good results and is robust to the lack of different positions and different degrees of incompleteness. The good effect of PP-Net in many categories shows its wide applicability in the field of remote sensing, such as the repair and completion of photogrammetric models in urban basic information mapping and the optimization of 3D shapes in the construction of smart city databases.
However, the completion network is occasionally unable to recover these subtle fine structures. Potential reasons for this are that these structures have small surface areas; this makes the feature extraction more difficult for the encoder, and makes it difficult for the decoder to deform a 2D grid into subtle fine structures. Future work will need to consider methods for improving the feature extraction of these fine structures by combining their local geometric features.