PointSCNet: Point Cloud Structure and Correlation Learning Based on Space Filling Curve-Guided Sampling

Geometrical structures and the internal local region relationship, such as symmetry, regular array, junction, etc., are essential for understanding a 3D shape. This paper proposes a point cloud feature extraction network named PointSCNet, to capture the geometrical structure information and local region correlation information of a point cloud. The PointSCNet consists of three main modules: the space-filling curve-guided sampling module, the information fusion module, and the channel-spatial attention module. The space-filling curve-guided sampling module uses Z-order curve coding to sample points that contain geometrical correlation. The information fusion module uses a correlation tensor and a set of skip connections to fuse the structure and correlation information. The channel-spatial attention module enhances the representation of key points and crucial feature channels to refine the network. The proposed PointSCNet is evaluated on shape classification and part segmentation tasks. The experimental results demonstrate that the PointSCNet outperforms or is on par with state-of-the-art methods by learning the structure and correlation of point clouds effectively.


Introduction
Point cloud is an ubiquitous form of 3D shape and suitable for countless applications in computer graphics due to its accessibility and expressiveness for 3D representation.The points are captured from the surface of objects by equipments such as 3D scanner, Light Detection and Ranging (LiDAR) or RGB-D camera, or sampling from other 3D representations [1].While containing rich information about the surface, structure and shape of 3D objects, they are unlikely to be ordered and structured data as images which are arranged on regular pixel grids.Hence, although many classical deep neural networks have shown tremendous success in image processing, there are still a lot of challenges when it comes to deep learning methods for point cloud [2].
To coordinate these incompatibilities, an intuitive idea is transforming the point cloud into a structured representation.Earlier multi-view methods try to project the 3D object into multiple view-wise images to fit 2D image processing approaches [3][4][5][6][7].On the other hand, volumetric methods voxelize the point cloud to a regular 3D grid representation and adopt extensions of the 2D networks, such as 3D Convolutional Neural Network (CNN), for feature extraction [8].Moreover, some following voxel-based researches introduce certain data structure(such as octree) to reorganize the input shape [9][10][11].While achieving impressive performances at the time, these methods are often considered to have some inevitable shortcomings, such as losing 3D geometric information during the 2D projection or high computational and memory costs when processing voxels.Against this backdrop, nowadays researches on directly consuming raw point clouds via end to end networks become increasingly popular.The well-known PointNet [12] and subsequent PointNet++ [13] are the pioneer works of direct point cloud processing based on deep learning methods.The introduce of symmetric function reflected by the networks adapts to the inherent characteristics of the 3D points set.Inspired by PointNet and Pointnet++, many following researches adopt the idea for points feature extraction or encoding to achieve the permutation invariance of point clouds [14][15][16][17].
The basic methodology of these point-based network is exacting point-wise high dimensional information and then aggregating a local or global representation of the point cloud for downstream tasks.Followed-up researches based on this idea have demonstrate that the hierarchical structure with the subset abstraction procedure is effective for point cloud reasoning.It has been figured out that sampling central subsets of input points is essential for the hierarchical structures [18,19].However, the most popular sampling and grouping method, Farthest Point Sampling (FPS) and K-Nearest Neighbor (KNN), is based on low-dimension Euclidean distance exclusively, without sufficient consideration to the semantically high-level correlations of the points and their surrounding neighbors.
In the real world, there are inherent correlation between local regions of 3D objects, especially for those Computer Aided Design (CAD) models or industrially manufactured products [20], such as the symmetric wings design of an airplane, the regular arrays of wheels for a car or the distinct structure between the collar, sleeves and body part of a shirt.These geometric correlations of local regions play crucial role in 3D object understanding and are significant for typical point cloud processing tasks such as shape classification and part segmentation.
Besides, in the procedure of high dimensional information extraction, a basic and effective approach is using shared Multi-Layer Perceptron (MLP) or 1D CNN to project the input feature to a high dimensional space.Inspired by applications of attention mechanism for image processing, it can be inferred that, similar to image processing, information of critical local areas and feature channel of the point cloud has more impact on specific tasks.
Based on these issues above, this paper proposes a point cloud feature extraction network, namely PointSCNet, which captures global structure and local region correlations of the point cloud for shape classification and part segmentation tasks.As shown in Figure 1, a space filling curve guided sampling module is proposed to choose key points which represent geometric significant of local regions from the point cloud.Then, an information fusion module is designed to learn the structure and correlation information between those local regions.Moreover, a channel and spatial attention module is adopted for the final point cloud feature refinement.
The main contributions of this paper are summarized as following: • An end-to-end point cloud processing network, namely PointSCNet, is proposed to learn structure and correlation information between local regions of a point cloud for shape classification and part segmentation tasks.• The idea of space filling curve is adopted to points sampling and local sub-cloud generation.
Specifically, points are encoded and sorted by Z-order curve coding, which makes the points contain meaningful geometric ordering.• An information fusion module is designed to represent local region correlation and shape structure information.The information fusion is achieved by correlating the local and structure feature via a correlation tensor, and skip connection operations.• A channel-spatial attention module is adopted to learn the significant points and crucial feature channels.The channel-spatial attention weights are learned for the refinement of the point cloud feature.

Related Work
This paper uses deep learning method to extract point cloud feature with construction and correlation information.In this section, recent researches in highly related area of our work, including traditional point cloud processing, point-wise embedding, point cloud structure reasoning and attention in point cloud processing, are briefly summarized and analyzed.

Traditional point cloud processing methods
One of the biggest challenges in processing point clouds is to deal with unstructured point cloud data.The early methods of processing point clouds are mostly indirect representation conversion.Some methods try to convert the point cloud to structured data, such as octree, kd-tree [21] to reduce the difficulty of analysis.Another classical kind of methods convert the point cloud to voxel models.The voxel-based methods [3,[22][23][24] use 3D convolution, which is a direct extension of image processing applications for point cloud.The advantages of the methods are that they can preserve the spatial relationship well at high voxel resolution, but these methods are computationally very expensive.And if the resolution of voxelization is reduced, the geometric information that the voxels can represent will be significantly lost.FPNN [25]and Vote3 [26] proposed special methods to deal with the sparse problem, but their methods still cannot handle large-scale point cloud data well.Therefore, it is quite difficult to achieve real-time performance while considering the balance between accuracy and computational cost.the traditional methods inevitably lead to the loss of geometric information.This paper uses a point-by-point feature extraction method to overcome the high cost of voxel-based methods and is not conducive to processing low-resolution point clouds.

Point-wise embedding
The research of PointNet [12] is the breakthrough work for deep learning based directly point clouds processing method.Its groundbreaking proposal of max-pooling symmetric function solves the problem of disordered point clouds.The MLP layer extracts the features and uses the maximum pooling aggregation to obtain the global features of the point clouds.Then, the PointNet++ [13] proposed a multi-layer sampling&grouping method to improve the PointNet.A lot of later researches on point cloud processing [27][28][29][30][31] followed the idea of point-wise and hierarchical point feature extraction.However, the feature extraction in the PointNet ignores geometrical structure information and the potential relationship between the local regions.Therefore, in this paper, the point are embedded based on the idea of PointNet++ at first, and a space filling curve guided downsampling method and a information fusion method are proposed to learn the structure and correlation information of the point cloud.

Point cloud structure reasoning
As an extension of point-wise feature leanring, various methods have been proposed to reason the structure of points.The DGCNN [32] captures the features between point neighborhoods through graph convolution.The network extracts point cloud structure information by capturing the topological relationship between points.MortonNet [19] proposed an unsupervised way to learn the local structure of point clouds.In PCT [33], the KNN method is adopted to extract the features between the point fields.The SRN [14] uses a concatenation for structural features and position coding between local sub-clouds, and the multi-scale features extracted by the method are used for point cloud processing, which improves the PointNet++ [13].However, these methods mainly pay attention to the relationship between local regions and ignore the relationship between the local region and the global shape.In this paper, a more effectively structure reasoning method is designed to capture the correlation between local regions and the shape structure.

Attention in point cloud processing
Due to the advancement of attention mechanism based method in many deep learning applications [18,[34][35][36], the attention mechanism meets the demand of dealing with unstructured data and is well applied in point cloud processing [33,37,38].The Point Transformer [38] and Point Cloud Transformer [33] have made precedents for the application of the Transformer [36] in point cloud processing and achieved the state-of-the-art performance.The adoption of attention mechanism for point cloud is mainly for exploring relation of points and enhancing the feature representation of attended points.Therefore, inspired by this idea, a channel-spacial attention module [39] is designed for feature refinement by enhancing key points and crucial feature channels.

Method
As shown in Figure 2, the proposed PointSCNet first uses the original point set P ∈ R N×C as the input.C is the feature channel of the point set.After a regular Sampling&Grouping [13] block, we get the sampled point set of N ′ points with the original spatial position information, denoted as , and the embedded sampled point set with C ′ dimension feature, in which each point represents information of the surrounding points within a certain radius, denoted as X embedding ∈R N ′ ×C ′ .Then we send X embedding and X position to a Z-order sampling module respectively for further sampling based on points' geometrical relation.The sampled point set contains the shape structure and local regions correlation information, denoted as X ′ Z−order ∈R N ′′ ×C ′ and X Z−order ∈R N ′′ ×3 .After that, an information fusion module is designed to establish the correlation between each local sub-cloud and the entire point cloud for the shape structure and local region correlation information learning.Moreover, after the information fusion procedure, the point cloud feature is forwarded to a channel-spatial attention module for feature refinement.
The pipeline of classification and segmentation module is similar to the PointNet++ [13].The dimension of local point cloud features is increased to 1024 first, and then an aggregate function pooling is adopted to obtain X global ∈R 1×1024 global features.For the shape classification task, after fed into the fully connected layers, the dimension of the global feature is reduced to 1 × k as the output of the PointSCNet, where k is the number of classes.For the part segmentation task, the output is the segmentation result N × k ′ obtained by up-sampling the global feature X global , where k ′ is the number of part classes.

Initial Sampling&Grouping
The PointSCNet first uses the original point cloud data as input.A series of points X position ∈R N ′ ×3 in the space are sampled via FPS, and the ball query method is used to get all points which are within a radius to the sampled point, denoted as d(X r , X position ) < r, X r ∈R N r ×3 , where X position ∈R N ′ ×3 are the points sampled by FPS, X r ∈R N r ×3 are the points around X position , d() is the Euclidean distance.These points are encoded to a high-dimensional space through MLPs, and aggregated to the sampled point via the aggregation function Pooling() to get X embedding ∈ R N ′ ×C ′ ,the aggregation function can be denoted as where X embedding ∈R N ′ ×C ′ are the encoded points feature, max-pooling function is used for pooling operation , the Concat() function represents points feature concatenation, the Conv() function is the 1D convolution operation.
After this procedure, the feature information of neighboring points has been aggregated to all sampled points X embedding .

Z-order Curve Guided Sampling Module
The principle of space filling curve is to use a continuous curve to pass through all points in the space, and each point corresponds to a position code.After the FPS based sampling&grouping, the Z-order curve coding function is adopted to further down-sampling the local sub-cloud X embedding to get local regions with semantically high-level correlations.
After the Z-order encoding, the 3D position coordinates of the local sub-cloud will be mapped to the 1D feature space, as shown in Figure 3.The locality of the original point can be well-preserved due Z-order 1024 points 64 points to the nature of Z-order curve, which means direct Euclidean neighbors in 1D tend to be similar to those in 3D.After the points are encoded and sorted, equally spaced points are sampled, as shown in

Information fusion of local feature and structure feature
After obtained the Z-order based sampled point cloud, the local sub-cloud feature and the structure feature are correlated to learn the shape structure and local region correlation information.As shown in Figure 5, a correlation tensor, represented as N ′ × N ′′ ×2C, is developed to evaluate the correlation between a local sub-cloud feature, represented as N ′ ×C and a structure feature represented as N ′′ ×C.The generation of the correlation tensor can be formalized as Fusion(X, Y) where X ∈ R n×c and Y ∈ R m×c , X, Y have the same numbers of feature channel.X i ∈ X and Y j ∈ Y are single point in point sets respectively.The Concat() function is proposed to concatenate the feature channels of X i and Y j .Then 2D convolution layers are designed to obtain the structure and local correlation of the point cloud, as shown in Figure 2.After the information fusion, a point cloud feature X C&S ∈R N ′ ×C ′ containing structure and correlation information is extracted.This process can be formalized as X C&S = H(P structure , P position ) = Pooling(Relu(g(Concat(P structure , P position )))), where g() is the Conv2d function, Concat() is the concatenation operation.Finally, as shown in Figure 2, theX C&S , X embedding and X position are fused together to the fusion feature X f usion via skip connections and the process can be formalized as where X embedding ∈R N ′ ×C represents local point cloud features, X structure ∈R N ′ ×C represents skeleton point cloud features, X position ∈R N ′ ×3 represents point cloud location features,the Concat function represents feature dimension concatenation of points.

Points channel-spatial attention module
As shown in Figure 6, a channel-spatial attention module which paralleling the channel attention block and the spatial attention block is adopted to strengthen the PointSCNet's ability by capturing the most important points and feature channels.In the channel attention module, the point cloud feature is aggregated by max-pooling and average pooling operation and then forwarded to convolution layers respectively.The design of the convolution layer is to reduce the feature dimension first and then raise it for better feature extraction.The outputs of convolution layers are summated and activated to learn the weight of each feature channel.The channel attention block is formalized as where X∈R N ′ ×C ′ ,Max() and Avg() represent max-pooling and average-pooling function.In the spatial attention module, the feature is fed to the MLPs with shared weights, and then the information on each channel is aggregated through the batch normalization layer and the pooling layer to obtain the spatial position attention weight.The spatial attention block is formalized as Spatial(X) = Pooling(BN(MLP(X))). (8)

Experiments
In this section, some quantitative and qualitative experiments are designed to demonstrate the performance of our proposed PointSCNet.At first, the network is evaluated on shape classification and part segmentation tasks respectively.Then, more quantitative analyses of the network are presented.Moreover, some more visualization experiments are performed to demonstrate the ability of PointSCNet quantitatively.Finally, the ablation study is designed to show the effectiveness of each module of PointSCNet.

Implementation Details
The development environment is Ubuntu18.04+Cuda11.1+Pytorch1.8.0, and the hardware environment is GPU device is RTX3080 single discrete graphics.In the classification task, we set random sampling of 1024 points as the input of the PointSCNet, while random sampling of 2048 points in the segmentation task.Our training hyperparameters are set to the batch size of 24, the number of iterations is set to 200, the initial learning rate is set to 1e − 3 , and the learning rate decays to the original 0.9 after every 20 iterations.The optimizer is Adam and weight decay rate is 1e − 4. In Z-order sampling, the number of sampled points is set to 64.The loss is measured by calculating the cross entropy between the real label and the predicted value.We set the number of local sub-clouds N ′ = 256, and the number of local sub-cloud feature channels C = 192, The number of skeleton sub clouds N ′′ = 64.

Shape Classfication on ModelNet40
The shape classification experiment is performed on the ModelNet40 [23] dataset which is the most commonly used dataset for training point cloud classification networks.The dataset has 9,843 training data and 2,468 test data, belonging to 40 different shape classes.
Table 1.Comparison with state-of-the-art methods on the ModelNet40 classification dataset.The column of "Acc" means overall accuracy(%).All results quoted are taken from the cited papers."xyz" in the column of Input means 3D coordinate of points and nr mean normals.
PointNet++ [13] is the pioneer work of hierarchical point cloud feature extraction, which captures the multi-scale local structure with hierarchical layers.It aggregates local features through a simple maximum pooling operation without using their structural relationship.The DGCNN [40] and the SRN [14] are both classical methods to learn structural relation of point cloud.The DGCNN [40] simply concatenates the feature relationships of local sub-clouds in different dimensions, and the captured structural relationship features cannot fully represent the structure of the point cloud.The SRN [14] adopts regular FPS based sampling&grouping method to obtain local point cloud and simply concatenates the points position and geometry feature to capture the structure relationship.The PointSCNet uses the space filling curve to sample the points in the point cloud that can characterize the point cloud structure, and then process them through a specially designed feature fusion module to explore the correlations between local regions and the structure of the point cloud.The performance is significantly improved compared to these baseline methods.

Part Segmentation on ShapeNet
The ShapeNet [50] 2. By capturing the skeleton structure features of the point cloud, the PointSCNet significantly outperforms Pointnet [12], Pointnet++ [13], SRN [14], and the performance of PointSCNet is particularly outstanding in some specific classes, as shown in the table.Figure 7 shows the visualization part segmentation result of PointSCNet.
Table 2.The performance of part segmentation task on ShapeNet.The metric is part-average Intersection-over-Union(IoU, %).All results quoted are taken from the cited papers.

Additional Quantitative Analyses
The number of model parameters reflect the training speed of the network indirectly.Our PointSCNet adopts the space filling curve guided sampling strategy to capture few points to represent local regions and structure of the point cloud, which reduces the number of model parameters.The PointSCNet achieves outstanding classification accuracy with relatively few model parameters, as shown in table 3.For the PointNet [13], its multi-layer sampling structure introduces redundant information and slow down the training speed.Figure 8 shows the loss curve of PointSCNet decreases more rapidly compared to the Pointnet++.The PCT [33] network uses the Transformer structure repeatedly to capture the structural relationship characteristics of the point cloud.Hence it has excessive parameters and its convergence speed is slow.The SRN [14] adopts regular FPS based sampling&grouping method to obtain sub regions and adopts duplicated SRN module, which leads to a large number of parameters and a slow convergence speed too.1.748M 91.9 SRN [14] 3.743M 91.5 DGCNN [40] 1.811M 92.9 NPCT [33] 1.36M 91.0 SPCT [33] 1.36M 92.0 PCT [33] 2

Additional Visualization Experiments
The heat map for points with high response to PointSCNet is shown in Figure 9.The points are colored according to their response to the network and those with higher response are colored darker.The darker points in of the mug display the model structure.The darker points in the airplane mainly gather in one side of the symmetry axis, which indicates the symmetry of the airplane model.In the  According to the visualization results, those points with higher response either present the structure of a point cloud or show the geometrical and locational interactions of local regions, which proves the points sampled by the Z-order sampling module represents the meaningful geometrical local regions and the information fusion block extracts the structure and correlation information effectively.Figure 10 shows the performance of our PointSCNet in feature extraction.By using t-SNE [52] to reduce the dimension of high-dimensional features to 2D, the classification ability of our network is visualized as shown in the figure.It can be seen that most of the classes are divided into separate clusters.For some clusters with similar point cloud structures, such as tables and stools are close in semantic space, the PointSCNet can still distinguish them precisely.

Ablation Study
A set of ablation studies are designed to test the impacts of critical components of our network, including the Z-order sampling block (section 3.2), structure and correlation information fusion module (section 3.3) and the channel-spatial attention block (section 3.4).The ablation strategies and results are shown in Table 4.It can be found that all the critical components of the PointSCNet improve the network performance.The Z-order sampling block and C&S module provide obvious improvement.The convergence speed is slow while only use the information fusion module.When all these three modules are used at the same time, the model training speed is greatly improved, and the highest accuracy of the classification task is achieved, which further proves the importance of each module.

Conclusion
In this paper, a point cloud processing network named PointSCNet is proposed to learn the shape structure and local region correlation information based on space filling curve guided sampling.Different from most existing methods using FPS method for down-sampling, which only utilizes the low-dimension Euclidean distance, our proposed space filling curve guided sampling module uses the Z-order curve for sampling to explore high-level correlations of points and local regions.The feature of sampled points are fused in the proposed information fusion block, in which the shape structure and local region correlation are learned.Finally, the channel-spatial module is designed to enhance the feature of key points.Quantitative and qualitative experimental results demonstrate that the proposed PointSCNet learns the point cloud structure and correlation effectively, and achieves superior performance on shape classification and part segmentation tasks.The idea of structure and correlation learning can be adopted for related vision tasks other than 3D points processing.Hence, in future we plan to optimize our network and apply the method to more vision scenarios [53][54][55].

Figure 1 .
Figure 1.Learning structure and correlation on point cloud based on space filling curve guided sampling.Columns shown from left to right are the original input point cloud, points sampled by the Z-order space filling curve, and the point cloud heat map based on response of points to the proposed network, respectively.

Figure 2 .
Figure 2. Model architecture of PointSCNet:The original point cloud is fed to a sampling&grouping block.Then a Z-order sampling block is designed for further generation of local regions.After the sampled point cloud feature is extracted, the feature fusion module is designed to learn the structure and correlation information.At last the point cloud feature is forwarded to the PointCSA block which is based on channel-spatial attention mechanism to get the refined feature for classification and segmentation.

Figure 3 .
Figure 3.The point cloud structure obtained by sampling 1024 points in the original point cloud using the Z-order space filling curve.

Figure 4 .
Figure 4.Then, the point set with N ′′ points and C ′ dimension feature, denoted as X ′ Z−order ∈R N ′′ ×C ′ , and the point set with N ′′ points and 3D coordinate, denoted as X Z−order ∈R N ′′ ×3 are sampled.The final sampled point set represents the global structure and local correlation of the original point set.

Figure 4 .
Figure 4. Sampling strategy based on Z-order curve sorting.Equally spaced points are sampled, the spacing is set to 3 in the figure.

Figure 5 .
Figure 5.The correlation tensor is designed for the evaluation of the correlation between the local feature and structure feature.N ′ and N ′′ represent the number of points sampled via FPS and Z-order sampling block, C is the feature channel of points.

Figure 6 .
Figure 6.Points channel-spatial attention module: The points feature is fed to the channel-spatial module to capture the most important points and feature channels.In channel attention module,channel weights are obtained via the two aggregation functions and convolution layers.In spatial attention module, spatial weights are obtained via shared MLPs.

Figure 7 .
Figure 7. Results of our PointSCNet on the part segmentation.
dataset covers 55 common object categories and there are approximately 51,300 3D models.The part segmentation task is performed on the ShapeNet part segmentation dataset with 16,880 models and 16 categories.The 3D models are divided into 14,006 training point clouds and 2,874 test point clouds, where each point is associated with a point-by-point label of the point cloud segmentation task.In the point cloud component segmentation task, we randomly sampled 2048 point features as the original input of PointSCNet.The quantitative results of PointSCNet and some classical state-of-the-art methods are shown in table

Figure 9 .
Figure 9. Heat map for points with high response to PointSCNet.

Figure 10 .
Figure 10.Visualization results of t-SNE on the ModelNet40 dataset.

Table 3 .
The performance of PointSCNet on the ModelNet40 dataset to test classification tasks Experiment on the drop speed of loss curve.

table model ,
both model structure and repetitive arrayed table leg are emphasized.The points with high response appear on the circle rim of the bowl.

Table 4 .
The strategies and results of ablation studies."ZS" represents the Z-order curve guided sampling block."C&S" represents the structure and correlation information fusion module."AM" is the channel-spatial attention module."✓" represents existence, "×" represents inexistence."ToBestAcc" is the minimum number of the epoch when the PointSCNet achieve the highest accuracy in the training phase.