PU-MFA: Point Cloud Up-Sampling via Multi-Scale Features Attention

Recently, research using point clouds has been increasing with the development of 3D scanner technology. According to this trend, the demand for high-quality point clouds is increasing, but there is still a problem with the high cost of obtaining high-quality point clouds. Therefore, with the recent remarkable development of deep learning, point cloud up-sampling research, which uses deep learning to generate high-quality point clouds from low-quality point clouds, is one of the fields attracting considerable attention. This paper proposes a new point cloud up-sampling method called Point cloud Up-sampling via Multi-scale Features Attention (PU-MFA). Inspired by prior studies that reported good performance at generating high-quality dense point set using the multi-scale features or attention mechanisms, PU-MFA merges the two through a U-Net structure. In addition, PU-MFA adaptively uses multi-scale features to refine the global features effectively. The PU-MFA was compared with other state-of-the-art methods in various evaluation metrics through various experiments using the PU-GAN dataset, which is a synthetic point cloud dataset, and the KITTI dataset, which is the real-scanned point cloud dataset. In various experimental results, PU-MFA showed superior performance of generating high-quality dense point set in quantitative and qualitative evaluation compared to other state-of-the-art methods, proving the effectiveness of the proposed method. The attention map of PU-MFA was also visualized to show the effect of multi-scale features.


Introduction
A point cloud is one of the most popular formats for accurately representing 3D geometric information in robotics and autonomous vehicles.Recently, the number of studies using point clouds has been increasing with the development of 3D scanners, such as LiDAR [1,2].Along with this trend, there is an increasing demand for high-quality point clouds that are low-noise, uniform, and dense.However, the high cost of collecting high-quality point clouds remains problematic.Therefore, point cloud up-sampling, which generates a low-noise, uniform, and dense point set from noisy, non-uniform, and sparse point sets, is an interesting study.
Similar to learning-based image super-resolution studies [3,4], various learning-based point cloud up-sampling studies [5,6,7] show better performance than traditional point cloud up-sampling studies [8,9].Intuitively, the image super-resolution tasks and the point cloud up-sampling tasks are similar.Unlike the image super-resolution tasks, which process regular format images, the point cloud up-sampling tasks, which process irregular formats, require additional consideration.First, the up-sampled point set should have a uniform distribution and a dense set of points.Next, the up-sampled point set should represent the details of the target 3D mesh surface well [10].
A traditional learning-based point cloud up-sampling study usually consists of a feature extractor and an up-sampler.In addition, most studies use multi-scale features or attention mechanisms.PU-Net [11], 3PU [12], and PU-GCN [7] extract multi-scale features from sparse point sets.These studies reported excellent performance, but the last feature extracted by the feature extractor has a limitation in that the details of the sparse point set are diluted features because the output of each layer is used as the input of the next layer.Dis-PU [13], PU-EVA [6], and PU-Transformer [5] showed successful performance using the self-attention mechanism to learn long-range dependencies between points.However, there is a limit to applying the attention mechanism with limited information because the key, query, and value of the self-attention mechanism are generated from the same input.
Focusing on these limitations, this paper proposes PU-MFA, a novel method to fuse multi-scale features and attention mechanisms.PU-MFA solves point cloud up-sampling through an attention mechanism that uses an adaptive feature for each layer.The contributions of this research are as follows: • This paper proposes a point cloud up-sampling method of U-Net structure using Multi-scale Features (MFs) adaptively to Global Features (GFs).
• Global Context Refining Attention (GCRA), a structure for effectively combining MFs and attention mechanisms, is proposed.To the best of the authors' knowledge, this is the first MultiHead Cross-Attention (MCA) mechanism proposed in point cloud up-sampling.
• This study demonstrates the effect of MFs by visualizing the attention map of GCRA in ablation studies.
This method was compared with various state-of-the-art methods using the Chamfer Distance (CD), Hausdorff Distance (HD), and Point-to-Surface (P2F) evaluation metrics for the PU-GAN [10] and the KITTI [14] dataset.As a result, the effectiveness of this method was confirmed by showing better performance.
2 Related Work

Optimization-Based Point Cloud Up-Sampling
Various optimization-based studies have been performed to generate a dense set of points from a sparse set.Alexa et al. solved up-sampling by inserting new points into the Voronoi diagram of the local tangential space computed based on the moving-least-squares error [8].Lipman et al. explained the up-sampling using the Locally Optimal Projection (LOP) operator [9].In this study, the points were re-sampled by using L 1 norm.Huang et al. up-sampled a noisy and non-uniform set of points using an improved LOP that is a weighted LOP [15].Later, Huang et al. proposed an advanced method called Edge-Aware Re-sampling of a set of points (EAR).The EAR first re-samples the edges and then uses edge-aware up-sampling to resolve the up-sampling [16].

Learning-Based Point Cloud Up-Sampling
With the successful performance of learning-based image super-resolution, many studies have proposed a learningbased point cloud up-sampling method.
As with image analysis, many studies have used MFs in point cloud up-sampling.PU-Net [11], the first attempt at deep learning for point cloud up-sampling, showed good performance by extracting MFs through hierarchical feature learning and interpolation based on the framework of PointNet++ [17].3PU [12] performed well using MFs via an Intra-Level Dense connection and Inter-Level Skip connection.PU-GCN [7] uses MFs extracted by Inception DenseGCN.In this study, Inception DenseGCN could effectively extract MFs with an InceptionNet-inspired structure [18].
Because of the advantages of learning the long-range dependency of the self-attention mechanism, it is used in various point cloud up-sampling studies.In PU-GAN [10], generators are trained using discriminators that apply a self-attention mechanism.Pugeo-Net [19] showed good performance by using it in Feature Recalibration.PU-EVA [6] showed successful up-sampling performance using an EVA Expansion Unit with the mechanism.Dis-PU [13] performed well using the Local Refinement Unit with self-attention applied to the generated point set.PU-Transformer [5], which applied the transformer structure for the first time in point cloud up-sampling, uses Shifted Channel MultiHead Self-Attention to show the state-of-the-art performance.

Problem Description
The problem of generating low-noise, uniform, and dense point set Q = {q i } rN i=1 was addressed using the Ground Truth (GT) point set D = {d i } rN i=1 and sparse point set S = {s i } N i=1 , where N is the input patch size and r is the up-sampling ratio.Figure 1 shows the problem description of this study.

Method
This method consists of a Multi-scale Feature Extractor (MFE), Global Context Refiner (GCR), Coarse Point Generator (CPG), and Self-Attention Block (SAB).As shown in Figure 2, MFE extracts MFs, an adaptive feature for use in GCR.GCR uses MFs to refine GFs adaptively and finally produce Q ∆ , where Q ∆ is defined as 1), where ⊕ is an element-wise sum.

Query
Coarse Point Generator (CPG) Figure 2: Illustration of the proposed framework.Here, 3 is the coordinate dimension, and H is the depth of the layer.
In Multi-scale Feature Extractor (MFE) and Global Context Refiner (GCR), C is the channel, and K is the expansion ratio.In Coarse Point Generator (CPG) and Self-Attention Block (SAB), C is the channel and K is the expansion ratio.

Multi-scale Feature Extractor
Because the GFs extracted from Q via SAB is a feature extracted from a set of points in which geometric information about the original input S is diluted, MFE using Point Transformer (PT) [20], an advanced point cloud analysis technique, extracts MFs from S. As shown in Figure 2, the MFE consists of H PT, and the set of point-wise features extracted from the h th PT is F h ∈ R N ×K h−1 C .MFs are the set of F h extracted from all layers of the MFE.The extracted MFs and F h are formulated as in equation (2), where f h i is a point-wise feature extracted from the h th PT. (2)

Point Transformer
PT consists of two elements.The first is the K-Nearest Neighbor (KNN), and the second is the Vector Self-Attention (VSA) mechanism.At the h th PT, the point-wise feature f h−1 i of point s i ∈ S is updated to f h i through VSA, which uses s i and patch i as the inputs.The patch i is generated through KNN using s i as the input.This operation works on all points in S, updating the point-wise feature of all points [20].This is formulated in equation ( 3), where patch_size is the size of K in KNN.
Inspired by this operation, this considered patch i is equivalent to the CNN's kernel.In CNN, even if the kernel of the CNN is fixed, a deeper layer, means a wider receptive field.Therefore, even if the patch size of the KNN in PT is fixed, the deeper the layer, the more s i can interact with a wider range of points.Figure 3 is an example with a KNN patch size of four.In Figure 3 (a), when the h th PT updates f h−1 i to f h i , VSA is performed on patch i , which is composed of s i and incidental points, to update f h−1 i .In Figure 3 (b), the h th PT updates each feature by performing a VSA for each patch in all incidental points.In Figure 3 (c), the (h + 1) th PT updates f h i to f h+1 i by performing VSA using patch i similar to the h th PT.However, the (h + 1) th PT updates f h i to f h+1 i using a wider receptive field than the receptive field of the h th PT because the features of the incidental points of the (h + 1) th PT are updated by the h th PT.This operation allows the MFE to extract the MFs effectively.

Global Context Refiner
Because GCR and MFE are U-Net [21] structures, they are composed of H GCRA. GCRA effectively refines GFs by querying MFs, which is adaptive geometric information applied to each layer.As shown in Figure 2, the (H − h + 1) th

GCRA generates RGC
as a query and RGC H−h as a pool.However, R N ×r3 was used instead to prevent RGC H from becoming R N × C K .After refining the GFs, the linear layer was used to perform the transformation.PixelShuffle was then used to generate the Q ∆ ∈ R rN ×3 .

Global Context Refining Attention
Inspired by Skip-Attention [22], which acts as a communicator between the encoder and decoder, GCRA uses MFs and GFs to apply the MCA mechanism.In various studies, self-attention mechanisms are used to extract the features of point sets or to generate up-sampling point sets [5,10].However, the self-attention mechanism is limited because it uses only limited information due to the structure in which key, query, and value are generated from the same input.With these limitations in mind, GCRA in the H hierarchy uses GFs∈ R N ×K C as the pool (key, values) and MFs as the queries, progressively refining the GFs through MCA.GCRA consists of MCA [23], Batch Normalization (BN) [24], and Feed Forward.As shown in Figure 4, the output shape of applying MCA using the query and pool is R N ×Fp .The pool was then refined by adding the pool and the MCA output.The BN was used for stable training after addition.Feed Forward transforms the output of the BN and produces a Refined Global Context (RGC)∈ R N ×Fo .

MultiHead Attention
Figure 4: Illustration of Global Context Refining Attention (GCRA).F p is the pool input channel, F q is the query input channel, and F o is the output channel.

Coarse Point Generator
CPG generates Q .In CPG, PT [20] and PixelShuffle [25,5] generate S ∆ from S, where, S ∆ is defined as . The structure of CPG consists of four layers, such as the structure of the 3PU's Feature Extraction Unit [12].As shown in Figure 2, to make the final output into 3D coordinates, first, PT was first used to expand the features, and then gradually reduce them.Subsequently, PixelShuffle generates 3D coordinates using those features.Q is generated through the element-wise sum of the generated S ∆ and duplicate(S, r) ∈ R rN ×3 .This process is formulated as equation (4).

Self-Attention Block
Inspired by self-attention, which learns long-range dependency [23], we use MultiHead Self-Attention (MSA) was used to extract the GFs from Q .As shown in Figure 2, the shape of Q was changed from R rN ×3 to R N ×3r , and the coordinates of Q were used as features of the original point set S. The GFs∈ R N ×K C was then extracted using the changed shape Q as the input to the MSA.

Datasets
All methods were trained using the most popular PU-GAN [10] dataset in these experiments and evaluated using the PU-GAN dataset and the KITTI [14] dataset.The PU-GAN dataset was a synthetic point cloud dataset produced from 147 3D meshes, and the KITTI dataset was a real-scanned point cloud dataset collected using real LiDAR.The training phase used 120 3D meshes from the PU-GAN dataset.All patches were generated via the Poisson disk sampling after converting the original mesh to a point cloud, just like the patch-based up-sampling approach.The sampling resulted in 24,000 input-output pairs.In the evaluation phase, 27 3D meshes from the PU-GAN dataset were converted into point clouds to test the synthetic point up-sampling, and the real-scanned point up-sampling test was performed using the KITTI dataset.The generated patches should cover all point sets when evaluating the synthetic point cloud and real-scanned point cloud up-sampling.After merging each up-sampled patch, the up-sampled point set was reconstructed by farthest point sampling.More details can be found at study in PU-GAN [10].This dataset was downloaded and used from https://github.com/liruihui/PU-GAN.

Loss Function
In most point cloud reconstruction methods, CD is used as the loss function [26,22,27].However, it was confirmed empirically that the Density-Aware Charmfer Distance showed good performance, considering the uniformity of the points set on the CD.Therefore, the total loss was formulated as equation ( 5), where α is linearly interpolated from 0.1 to 1 during training and (5)

Metric
This study evaluated the method using CD, HD, and P2F metrics, as in previous studies [13,6,5].CD is a metric that measures the similarity between a set of GT points and a set of predicted points for each point, and HD is an evaluation metric that measures the outliers in a set of predicted points based on a set of GT points.P2F is an index that measures the similarity between the original mesh and the predicted point set and measures the quality of the predicted point set.The parameter complexity was also measured by measuring the number of parameters.For all metrics, a lower the number, meant better performance.

Comparison Methods
The proposed method was compared with three state-of-the-art methods: Dis-pu [13], PU-EVA [6], and PU-Transformer [5] to validate the method.For an exact comparison, all methods were implemented using pytorch [28] version 1.7.0 on Ubuntu 20.04 and trained on the same Intel i9-10980XE CPU and NVIDIA TITAN RTX environment.

Implementation Details
All methods for the experiment were trained with a batch size of 64 for 100 epochs, and the Adam [29] optimizer with a learning rate of 0.0001 was used.The patch size of KNN used in PT is set to 20 as in PU-Transformer [5].Rotation, scaling, random perturbation, and regularization were applied to the training dataset.as in previous studies [11,10].The up-sampling ratio r was four and the input patch size N was 256.The CPG's C and K were 32 and 8, respectively.For MFE and GCR, C and K were 16 and 4, respectively.The layer depth of MFE and GCR, H, was four.The head number of MCA and MSA was set to eight, as in the previous study [23].

Results on 3D Synthetic Datasets
Table 1 lists the quantitative performance comparisons for ×4 and ×16 up-sampling.×4 up-sampling sampled 2,048 points to 8,192 points.×16 up-sampling sampled 512 points to 8,192 points by repeating the ×4 up-sampling twice.As shown in Table 1, the present method showed good performance compared to the other state-of-the-art methods.Although it is not the most efficient in parameter complexity, it showed good efficiency compared to the performance of the present method.
Figure 5 presents the visualization result of ×4 up-sampling, and Figure 6 is the visualization result of ×16 upsampling.Figures 5 (b), (c), and (d), show the set of points representing the bird's leg, the space between the kitten's body and tail, the statue's leg, and the camel's hoof with noisy or unclear boundaries.However, Figure 5 (e) shows low-noise and clear boundaries.In Figure 6 (b) and (d), the chair back does not represent the original shape well, and (c) maintains the shape to some extent, but there is considerable noise.On the other hand, Figure 6 (e) has relatively little noise and represents the original shape well.

Results on Real-scanned Datasets
Dis-PU, PU-EVA, PU-Transformer, and the present method were evaluated using the KITTI dataset for ×4 upsampling.Figure 7 shows ×4 up-sampling.In Figure 7 (b), (c), and (d), the boundary between the window and the door of the vehicle was unclear.However, Figure 7 (e) generated by the present method, showed that the boundary was clearer.

Ablation Study
This method, was evaluated by performing various ablation studies using the PU-GAN dataset.

Effect of Components
To demonstrate the effectiveness of the contribution, four cases were divided into ablation studies.The cases were as follows.: Case 1 was a structure using GCR, CPG, and SAB, with the MultiHead Attention (MHA) of GCR and SAB consisting of self-attention with one head.Case 2 was a structure changed from Case 1 to eight heads.Case 3 was a structure using GCRA composed of MCA by adding MFE to Case 2, where the query of all GCRA becomes F 4 , the final output of MFE.Case 4 was PU-MFA.As shown in Table 2, all contributions affected the method performance.

Multi-scale Features Attention Analysis
By visualizing the attention maps of all GCRAs, it was confirmed that the GCRAs of GCR with H = 4 refined the GFs by adaptively using the MFs extracted from receptive fields of various sizes.Figure 8 shows the results visualized by choosing three attention heads in the GCRA and selecting 30 points, which had the highest attention score in S, from each head.The attention map was visualized using Case 3 in Table 2 without MFs in Figure 8 (b) to compare that MFs operated adaptively.As shown in Figure 8 (a), in the low-layer GCRA, an attention map was formed for a wide range of points in a point set, and in high-layer GCRA, an attention map was formed for a relatively narrow range of points.On the other hand, in Figure 8 (b), a wide range of attention maps was formed regardless of the high and low levels of the hierarchy.This phenomenon confirmed that PU-MFA uses the adaptive point feature for each layer of the GCRA.

Effect of Noise
Table 3 lists the ×4 up-sampling results of Dis-PU [13], PU-EVA [6], PU-Transformer [5], and the present method using the PU-GAN dataset with various noises added.The noise effect evaluated the result obtained by adding different levels of Gaussian noise N (0,noise level) to a set of input points.As shown in Table 3, the proposed method showed the most robustness to various noise levels.As shown in Figure 9, it can be seen that the boundary between the fingers blurred in the dense set of points generated by the state-of-the-art methods as the noise level was increased.On the other hand, the proposed method showed that the boundary between the fingers was maintained in the dense set of points generated by the present method.

Conclusion
In this paper, we proposed PU-MFA, a point cloud up-sampling method of U-Net structure that combines multi-scale features and attention mechanism.One of the most significant differences from the previous point cloud up-sampling methods was that PU-MFA used multi-scale features adaptively and effectively through fusion with the cross-attention mechanism.Also, the PU-MFA is the first method to apply the cross-attention mechanism to point cloud up-sampling to the best of the authors' knowledge.Various experiments were performed on PU-MFA and other state-of-the-art methods using the PU-GAN and the KITTI dataset.As a result, PU-MFA showed better performance than other state-of-the-art methods in various experiments.In addition, ablation study showed that multi-scale features are very useful in PU-MFA for generating high-quality point sets by choosing receptive field size adaptively for each layer.
Despite the successful performance of PU-MFA, PU-MFA cannot cope with an arbitrary up-sampling ratio.A method that can respond to an arbitrary up-sampling ratio is planned in the future to overcome this limitation.

Figure 1 :
Figure 1: Illustration of an overview of the proposed method.

Figure 3 :
Figure 3: Illustration of KNN and VSA in PT.

Figure 7 :
Figure 7: Visualization result of ×4 up-sampling of the KITTI dataset.
point : head 2 attention point : head 3 attention point : input point   attention map   attention map   attention map   attention map (a) Using the MFs GCRA attention map   attention map   attention map   attention map   attention map (b) Not Using the MFs GCRA attention map

Figure 8 :Figure 9 :
Figure 8: Visualization of attention map generated using MFs as a query in GCR with H = 4.

Table 1 :
Comparing the quantitative evaluation of ×4 and ×16 up-sampling with the state-of-the-art methods.

Table 3 :
MethodVarious noise levels test at x4 Up-sampling(CD with 10 −3 ) Quantitative evaluation results of the noise effects using the PU-GAN dataset.