Learning Polynomial-Based Separable Convolution for 3D Point Cloud Analysis

Shape classification and segmentation of point cloud data are two of the most demanding tasks in photogrammetry and remote sensing applications, which aim to recognize object categories or point labels. Point convolution is an essential operation when designing a network on point clouds for these tasks, which helps to explore 3D local points for feature learning. In this paper, we propose a novel point convolution (PSConv) using separable weights learned with polynomials for 3D point cloud analysis. Specifically, we generalize the traditional convolution defined on the regular data to a 3D point cloud by learning the point convolution kernels based on the polynomials of transformed local point coordinates. We further propose a separable assumption on the convolution kernels to reduce the parameter size and computational cost for our point convolution. Using this novel point convolution, a hierarchical network (PSNet) defined on the point cloud is proposed for 3D shape analysis tasks such as 3D shape classification and segmentation. Experiments are conducted on standard datasets, including synthetic and real scanned ones, and our PSNet achieves state-of-the-art accuracies for shape classification, as well as competitive results for shape segmentation compared with previous methods.


Introduction
With the development of 3D sensors, point clouds are becoming an important data type in applications such as autonomous driving, archaeology, robotics, augmented reality [1][2][3]. For these applications, shape classification and segmentation are two of the fundamental research topics, which aim to automatically recognize 3D object categories or predict point labels [4][5][6][7], and they are also the topics of our work. However, the processing of a point cloud is an intractable problem with significant challenges [4,5], i.e., the irregular and orderless properties of a point cloud make it impossible to directly apply Convolutional Neural Networks (CNNs) to them. In Figure 1, we present two objects from ScanObjectNN [8] represented by point clouds. As shown in the figure, the points are orderless, and they are irregularly distributed. Furthermore, there are noisy points from the background and holes in the point clouds. These factors all cause difficulties for the processing of a point cloud. In our work, we focus on the processing of irregular and orderless point clouds, and we aim to extract effective point features with a novel point convolution for object categorization and point cloud segmentation.
To process the irregular and orderless 3D point cloud for object category recognition or point label prediction, tremendous deep learning methods on 3D data have been proposed in recent years. Inspired by the significant success of CNNs on 2D images, some works firstly convert the point cloud to grid data and then apply CNNs to these regular data. These methods can be commonly divided into voxel-based and view-based methods. Voxelbased methods [9,10] convert 3D data to a collection of voxels and then design networks on the regular 3D voxels as in 2D images, while view-based methods [11,12] represent 3D data with images rendered from multiple views and then take the rendered images as input for their works. These methods have achieved impressive performances on various 3D tasks such as shape classification and retrieval. However, both of them need to convert raw point cloud data to voxels or images, which brings additional computational cost, and they also suffer from computational complexity brought about by 3D voxel or multi-image representations. Moreover, the voxel data usually leads to shape detail loss and data sparsity when voxelizing the point cloud. The view-based methods highly depend on camera positions to capture shape geometric details. Therefore, algorithms based on the original point cloud, i.e., point-based methods, have become a hot research field recently, directly work on the 3D objects by taking the point cloud as input. The original point cloud contains rich geometric and semantic information, so it is easier for algorithms to realize shape recognition or scene perception. Previous works have certified the advantages and successes of point-based methods for 3D shape analysis tasks such as classification, retrieval, segmentation and detection [4][5][6][7][13][14][15]. For point-based methods, to process the irregular and orderless point cloud, an essential and intractable challenge is that it is infeasible to apply standard CNNs directly to point clouds. Tremendous works have been proposed to generalize CNNs and design point convolution operations that are adaptive for point clouds. Many methods first update local pointwise features and then aggregate them by max-pooling operation to capture features with the strongest activation, without leveraging local structure [4,5,13,14]. Some works, such as [7,16,17], try to convert local point data to regular representation on-line in the network and then design traditional convolution on the converted data. The authors of [18,19] design point convolution with the help of regular representations by kernel points or weights. In addition to the strategies, many works design convolution with customizable spatial filters based on point coordinates or relations within points [6,20,21]. We present a more detailed description of point-based methods in the Related Work section. Although these point-based deep learning methods have made remarkable progress in the past years, they still face difficulty in designing an effective convolution operation for feature learning, especially for designing convolution filters that are adaptive to the irregular point clouds with noise and holes. Most of them perform better in the analysis of synthetic 3D objects, such as computer-aided design (CAD) data, which are complete, well-segmented, and noise-free, while their performances drop when operating on real scanned data [8]. We think that those drops result from the representation capability of their convolution, because the shape implied in irregular points is difficult to capture.
In our work, to deal with the irregular and orderless point clouds, we present an intuitive method to achieve a more precise approximation of ideal convolution kernels. We propose a novel point convolution, i.e., Polynomial-based Separable Convolution (PSConv), to process points, with the convolution constructed based on polynomials. This design benefits from the expressive power and approximation ability of the polynomials. Compared with previous methods, this polynomial-based strategy can better capture local shape geometry. With our PSConv as the basic layer, we further propel it with a novel and efficient strategy by a separable formulation. This separable formulation can significantly reduce the parameter size and computational cost, making it capable of building a multilayer deep convolutional network on 3D point clouds. The primary contributions of this work are summarized as follows.
Firstly, we design a novel point convolution to extract pointwise features, with the convolution kernels constructed based on polynomials of the transformed local point coordinates. Considering that the polynomials can approximate any smooth function, our convolution kernels can approximate ideal convolution kernels and capture the local geometric information hidden behind the unstructured points.
Secondly, we propose a separable formulation for our convolution on a 3D point cloud. A simple application of our proposed point convolution would bring about huge computational cost. By this separable formation, the parameter size and computational complexity are significantly reduced. This separable convolution is efficient to apply, which makes it possible to build a deep convolutional network on 3D point clouds.
Thirdly, with our PSConv, we design a hierarchical architecture, i.e., PSNet, for 3D point cloud classification and segmentation tasks. Our PSNet achieves better or competitive performances compared with state-of-the-art methods on a standard synthetic dataset and scanned real-world dataset. For example, it achieves 93.1% OA for classification on ModelNet40 [9] and 86.2% IoU for segmentation on ShapeNet Part [22], and it also achieves the best shape classification accuracy on ScanObjectNN [8].
The rest of this paper is organized as follows. In Section 2, the literature on point-based deep learning methods is reviewed. In Section 3, we introduce the proposed method in detail. In Section 4, we evaluate our method on standard datasets, with a presentation, comparison and discussion of the results. Section 5 is the conclusion.

Convolution on 3D Point Cloud
As a kind of data type, a 3D point cloud is irregular and orderless, and the traditional CNNs that work on regular data such as images can not be directly utilized. To deal with 3D object and extract pointwise descriptors directly on 3D point cloud data, various methods have been proposed.
One general strategy for point cloud analysis is to directly work on the 3D points by first updating the pointwise features and then pooling them with max-pooling operation across points. PointNet [4] pioneers these works by first designing a multi-layer perceptron (MLP) shared among points to extract a pointwise descriptor and then applying maxpooling to aggregate these point features to form a global shape descriptor, which is finally sent to Fully-Connected (FC) layers and Softmax operation for shape label prediction. PointNet++ [5] advances PointNet by applying it on local points to extract point features and then gradually coarsening the shape with the Farthest Point Sampling technique. Succeeding local PointNet and coarsening operation are applied such that a hierarchical architecture is derived. They aggregate the point features by the most coarse shape with max-pooling and finally predict the shape label with FC layers and Softmax operation. Considering that the PointNet [4] and PointNet++ [5] learn point features with MLP, more methods are proposed to propel them by various feature learning strategies. RSCNN [13] first learns to reweight point features with reweighting vectors learned by shared MLP on local geometric relations, and then the reweighted features are max-pooled and updated with another MLP. With Farthest Point Sampling, they construct a hierarchical architecture similar to PointNet++ and predict shape labels with FC layers. DGCNN [14] first computes local points with distance in feature space, within which they can calculate edges. These edges are sent to MLP to learn features at their EdgeConv layer, and the output features of the last EdgeConv layer are aggregated globally with max-pooling or average-pooling to form a global descriptor, which is used to generate classification scores. For these methods, they first update local point features and then utilize symmetric operation such as maxpooling to aggregate them, which can deal with the irregular and orderless properties of point clouds. However, max-pooling operation pools all pointwise features to be a single feature, which may ignore some detailed features encoded for the points.
Another common strategy is to design point operation by converting local irregular and orderless points to regular representation in the network, which is similar to voxelbased and view-based methods, and traditional convolution can be utilized on these regular data. PointCNN [7] first updates local point features by MLP, and then learns an X -transformer based on local point coordinates, which are utilized to reorder local point features. These reordered features are taken as regular data, on which the spatially 1D convolution can be conducted. SPLATNet [16] first interpolates input features onto a permutohedral lattice, then designs convolution over this regular lattice, whose signal is finally mapped back to points. For the work of [17], they proposed Tangent Convolution by firstly projecting local surface geometry on a tangent plane around every point. This yields a set of tangent images, and every tangent image is treated as a regular 2D grid that supports planar convolution. There are also some works that design local point convolution with the help of discrete representations, by which they can design convolution on this fixed number of discrete points or kernels. KPConv [18] defines convolution weights by kernel points, which are applied to the input points close to them, and their locations are continuous in space and can be learned by the network. InterpConv [19] utilizes discrete kernel weights and interpolates point features to neighboring kernel-weight coordinates by an interpolation function. PointGrid [23] proposes a convolutional network that incorporates a constant number of points within a grid cell. A-CNN [24] specifies the regular ring-shaped structures and directions in the computation. 3DmFV [25] utilizes the generalized Fisher Vector to achieve a fixed size representation of a possibly variable number of points in the cloud. These works try to convert local irregular points into regular formation or represent local irregular data with discrete formation, such that traditional CNNs can be employed. However, they suffer from converting raw 3D point clouds to new representations, which may be inefficient and lose geometric details of the raw point clouds.
Alternatively, some works try to generalize and learn convolution filters that are adaptive to the irregular 3D point cloud data, which are then directly utilized to conduct convolution on point clouds. These works are the most related to ours. SpiderCNN [6] first extracts k-nearest neighbors (KNN) points for every point in the shape and then designs the convolution filters as a product of a weight vector and a Taylor expansion of local point coordinates. Then, these convolution filters are employed to conduct convolution on local point features. PointWeb [20] learns convolution kernels as impact functions employed with MLP on the feature differences, which are utilized to first reweight the feature differences and then sum them up as the output of their local operator. PointConv [21] also extends traditional convolution by parameterizing a family of filters, and they treat convolution filters as non-linear functions (MLP) of the local coordinates of 3D points. These convolution filters are used for convolution on point features. The updated features are finally added up as the output of their point convolution. These works generalize traditional convolution on regular data and define point convolution for the irregular and orderless points. They focus on designing point convolution kernels based on local geometry or relations, such that they can extract point features with the help of local shape information. However, they face the challenge of designing effective and expressive kernels, which is of great importance for feature extraction. For our work, we also design point convolution kernels and aim to propose an effective solution for this challenge by learning adaptive kernels. However, we advance them and realize this idea based on polynomials of local transformed point coordinates, which benefit from the approximation and expression abilities of polynomials. Experiments also prove the efficacy of our strategy for point cloud analysis.

Separable Convolution
To reduce parameter size and computational cost, many works design their algorithms with the help of separable convolution [26] to construct lightweight architectures, which have been successfully applied to mobile networks [27,28]. As a special kind of spatial separable convolution, Fast Fourier Transform (FFT) rapidly converts a signal from its original domain to a representation in the frequency domain by factorizing (separating) the discrete Fourier transform matrix into a product of sparse factors. In the work of [29], they decompose the 3D filters with three 1D kernels that work in different directions separately. On the other hand, the depthwise separable convolution consists of a depthwise convolution and a pointwise convolution, and it is firstly utilized in the neural network design in the work of [30]. The depthwise convolution is a spatial convolution performed independently over each channel of an input, and the pointwise convolution is in fact a 1 × 1 convolution, which projects the output of the depthwise convolution onto a new channel space. The depthwise separable convolution is a computational effective equivalent form of the standard convolution, and it is employed as the most critical ingredient in many efficient CNN architectures such as Shufflenet [31] and MobilenetV2 [32]. Both the spatial separable convolution and depthwise separable convolution are efficient to conduct and have achieved impressive performances.
Separable convolution is also modified and utilized for 3D point cloud analysis to accelerate computational speed [7,21,33]. In the work of PointCNN [7], they adopt the depthwise separable convolution as a key step in their proposed convolution on 3D point clouds to reduce both parameter number and computational cost. Specifically, they first update the point features in feature space with MLP and then aggregate them spatially with standard 1D convolution. For PointConv [21], they reformulate their point convolution by reducing it to two standard operations, i.e., matrix multiplication and 1 × 1 convolution, for efficiency. For the method of SegGCN [33], the proposed fuzzy kernel is separated into the depthwise and pointwise operations to make their convolution more efficient. They firstly apply the discrete kernels to depthwise convolutions alone, following which pointwise convolution is readily achieved with 1 × 1 convolutions. For our work, we also utilize the idea of separable convolution. We do not explicitly split the convolution into depthwise and pointwise ones but advance it by separating the convolution kernels into a flexible and adaptive combination. Using this strategy, we significantly reduce the parameter size and computational cost. This efficient point convolution is also effective, as shown in the experiments.

Method
In this section, we introduce our PSConv and PSNet in detail, with the pipeline presented in Figure 2. PSConv is our proposed convolution defined on a local point cloud, and we further propose the separable formulation of our PSConv to reduce parameter size and computational complexity, as shown in Figure 3. With our PSConv as the basis layer, we construct our PSNet with a hierarchical architecture, which can be employed for 3D point cloud classification and segmentation tasks.  (1) and (2). Refer to Section 3.2 for the details of this separable PSConv.

PSConv
We aim to define an effective convolution on a 3D point cloud that directly operates on local points to extract point features. The key idea to our approach is to define a set of customizable convolution filters based on polynomials of transformed local point coordinates. Specifically, we first linearly transform the coordinates of local point cloud and then compute their high-order powers, whose polynomials are learned and utilized as convolution kernels for the convolution on point clouds. The pipeline of PSConv is shown in Figure 2a, and now we introduce it in detail.
Without loss of generality, we take local 3D points where k, i are indices for point and feature channel, respectively. Note that {p k } K k=1 denotes the centralized point coordinates by subtracting the center point coordinate of p 1 , and we sort them by increasing distances to p 1 . K is the number of local points, and we select the local k-nearest neighbors (KNN) points for center point p 1 . For our PSConv, we first conduct linear transform on the coordinates of local points {p k } K k=1 , and achieve where P ∈ R K×L , [·] means the concatenation operation. {a l , b l , c l , d l } L l=1 is the parameter to learn. To better explore the clues hidden behind P and take advantage of polynomials to approximate the ideal filters adaptively, we further introduce high-order powers of these linearly transformed points, i.e., computing m-order power of P as We take P ∈ R K×L×M as the basic element to construct our convolution filters, which are in fact a set of polynomials with learned combination parameters. Specifically, based on P ∈ R K×L×M , we define the filters G ∈ R K×D in ×D out of our PSConv as where Φ = {φ klmij } is the parameter to learn. Conv[L, M]( P, Φ) means convolution with kernel width, height as L, M, respectively, and the input of this convolution is P, Φ is the convolution kernels to learn. Note that g kij is polynomial, and after the network training, we will approximate the ideally effective convolution filters with learned parameters Φ guided by downstream tasks and loss function. Based on the learned convolution kernels G, we finally conduct convolution on the input point feature F and get the output of PSConv F ∈ R D out defined as Our PSConv is a novel convolution on a 3D point cloud with learned filters G based on linear transform followed by polynomial non-linearity over local points. Compared with the traditional convolutions that take non-linear transforms like ReLU and sigmoid to generate convolution filters in a limited range of values, our polynomial-based formulation can flexibly learn convolution filters with an unrestrictive range of values, because polynomials can theoretically approximate any smooth function [34]. As a layer defined on a point cloud, PSConv can be utilized for feature learning and inserted into any network for 3D point cloud analysis tasks.

Separable Formulation of PSConv
The bottleneck of directly conducting our convolution on a point cloud as in Equation (4) is the parameter size and computational cost. In this subsection, we propose an efficient strategy, i.e., a separable formulation of PSConv, to reduce the parameter size and computational complexity. A simple pipeline of our separable PSConv is presented in Figure 3, and now we introduce it in detail.
In Equations (3) and (4), the output feature dimension of PSConv is decided by Φ, and the full PSConv layer can be written aŝ where Φ = {φ klmij } ∈ R K×L×M×D in ×D out is a five-dimensional weight matrix to learn, which also brings about huge computational costs to conduct. To reduce the parameter size of our point convolution, inspired by the separable convolution (e.g., FFT), we constrain that the element of this weight matrix can be decomposed to multiplication of elements from another two matrices Φ = {φ kij } ∈ R K×D in ×D out and Φ = {φ lmi } ∈ R L×M×D in , with the original element of Φ separated by Note that with this separation, the parameter size is significantly reduced. By this assumption, we can rewrite Equation (5) aŝ where we represent h ki = ∑ L,M l,m=1p kmlφlmi . Based on the separable formulation of Equation (6), we further introduce non-linear transform on h ki and define our separable formulation of PSConv aŝ whereĥ ki is the Hadamard product of f ki and β(h ki ), i.e.,ĥ ki = f ki β(h ki ). β is a non-linear transform composed of Batch Normalization (BN) and ReLU operations.
To present the progress clearly, in Figure 3, we illustrate the pipeline of our separable PSConv layer. Compared with the full convolution defined in Section 3.1, with our separable PSConv, we only need to learn parameters Φ = {φ kij } ∈ R K×D in ×D out and Φ = {φ lmi } ∈ R L×M×D in with less parameters. The computational complexity of separable PSConv with Equation (7) is O(KLMD in ), which is efficient to conduct with significantly less computational cost compared with the original PSConv in Equation (5) with computational complexity as O(KLMD in D out ), and our separable PSConv is efficient to conduct. The separable PSConv can be taken as a basic layer to construct a network, and we take it as a basic layer to design our hierarchical PSNet.

PSNet
With PSConv as the basic layer, we introduce how to use it as a basic element to build our hierarchical PSNet for 3D point cloud analysis in this subsection. In Figure 2b, we present a pipeline of our PSNet, which consists of several stages and sampling operations as well as MLP and max-pooling. Specifically, in every stage, we first update pointwise features with the help of local KNN points, i.e., we take MLP and max-pooling operations within local KNN points to extract pointwise descriptors. Note that the MLP and maxpooling are basic operations in PointNet++ [5]. With the updated feature, we then apply four consecutive PSConv layers to strengthen the point descriptors, and their output features are concatenated as output for this stage.
With the basic stage described above, we construct our hierarchical PSNet as present in Figure 2b. Given the 3D shape, we first apply one basic stage (Stage 1) to extract point features, and then coarsen the shape by the Farthest Point Sampling with the same point sampling rate of 25%, followed by another basic stage (Stage 2) with the same structure as Stage 1 to update point features. More stages and sampling operations can be added to form a hierarchical architecture. In our PSNet, we use two stages.
Our PSNet can be applied to 3D point analysis tasks such as classification and segmentation. For shape classification, after the last stage, we further employ one shared MLP to all the point features and then max-pool them to form a shape descriptor, which is finally fed to the last MLP followed with a Softmax operation for shape category prediction. For shape segmentation, after the last stage, we further employ one shared MLP to all the point features and then max-pool them to form a global shape descriptor, which is then propagated from sparse points to dense points gradually based on distances within points. This feature propagation (FP) is also a basic component of PointNet++ [5], which consists of feature interpolation (FI) and MLP operations. We finally predict point labels with MLP and Softmax operation. Cross-entropy loss is applied to our PSNet for both shape classification and segmentation tasks. In Table A1 of the Appendix A, we list the details of our network such as the parameter size and architecture in every stage.
For PSNet, we take 1024 points for both shape classification and segmentation tasks. For PSConv, we set the parameters as L = 10, M = 3, K = 16. We add ReLU after the linear transform of Equation (1), and we use BN and ReLU on the output of PSConv. When training our PSNet, the Adam optimizer is utilized, with the initial learning rate, epoch number, and batch size as 0.001, 250, and 32, respectively. The learning rate is exponentially decayed with a decay rate of 0.7 and decay step of 200,000. We utilize the data augmentation strategy as in [6] to train our network. That is, for point cloud classification, the point cloud is randomly rotated along the up-axis, and the position of each point is jittered by Gaussian noise with zero mean value and 0.01 standard deviation. While for segmentation, we only add the jittered noise. Clean data are utilized for the test of both classification and segmentation tasks.

Results
In this section, we first present the datasets and evaluation methods in Section 4.1 and then simply introduce the compared methods in Section 4.2. The experiment results are shown and discussed in Sections 4.3-4.5, with a further discussion in Section 4.6. We also present ablation studies in Section 4.7 to show the effect of our design.

Datasets and Evaluation Methods
We apply our model to two fundamental 3D point cloud analysis tasks: shape classification and segmentation. For shape classification, we conduct experiments on the synthetic ModelNet40 [9] dataset and the scanned real-object ScanObjectNN [8] dataset. We also evaluate our model on Shapenet Part [22] dataset for the shape segmentation task. We list below the details and experiment setting for each dataset: ModelNet40 [9]. It contains 12,311 CAD objects from 40 categories. We use the official split with 9843 shapes utilized for training and 2468 shapes for the test. We present several objects from this dataset in Figure 4a.  ScanObjectNN [8]. There are 2902 scanned real-world 3D objects in this dataset categorized into 15 classes. We use the standard split in our experiment, i.e., 80% and 20% of the data are utilized for training and testing, respectively. We utilize three variants, i.e., the ScanObjectNN-Vanilla, ScanObjectNN-Background, and ScanObjectNN-PB_T50_RS, to evaluate our method. The Vanilla and Background variants contain ground truth object and object with background points, respectively. The ScanObjectNN-PB_T50_RS contains an object with translation that randomly shifts up to 50% of its size as well as rotation and scaling transforms. Sample objects of these variants are shown in Figure 4b-d, respectively. The results of this dataset are from its official website.
ShapeNet Part [22]. This dataset contains 14,006/2874 training/test synthetic shapes from 16 categories of objects, with each point annotated with a label from 50 parts in total. We present several objects from this dataset in Figure 4e, where points with different colors represent points with different part labels.
For the synthetic ModelNet40 and ShapeNet Part datasets, the categories are highly imbalanced, which poses a challenge to all methods including ours, and different shapes of the same category (e.g., first two columns in Figure 4a) may have significantly different appearances. Different shapes (e.g., first two columns in Figure 4e) may have divergent numbers of part labels. For the real scanned ScanObjectNN datasets, point clouds are noisy, as shown in Figure 4b-d, and the objects have geometric distortions, such as holes, which are extremely challenging to recognize.
Evaluation Methods. For shape classification, the results are evaluated by Overall Accuracy (OA) and mean per-class accuracy (mACC), i.e., the percentage of correctly classified shapes over all shapes and the mean classification accuracy over all categories. For the shape segmentation task, we report the Intersection-over-Union (IoU) accuracy averaged across all part classes, which measures the overlap between correct predictions and ground truth labels. These measures are widely utilized for 3D shape classification and segmentation tasks [4,5,7,9,14,22].
These methods are similar to our method, which is also based on the design of point convolution as well as hierarchical architecture. We compare them in order to demonstrate the effectiveness and novelty of our method. Among these methods, SpiderCNN [6], PointWeb [20] and PointConv [21] are the three most related works to ours, which first design convolution kernels and then conduct point convolution on a point cloud.

Shape Classification on ModelNet40
We first evaluate PSNet for shape classification on the ModelNet40 [9] dataset. We compare our PSNet with state-of-the-art methods and present the classification accuracies in Table 1. Our PSNet taking 1024 points as input achieves the best OA and mACC results among the compared methods, and it achieves better accuracies even compared with those methods using 5000/6800 points. These comparisons verify that our proposed network is effective for the point cloud classification task.
Note that the baseline of our network is PointNet++ [5], whose basic components are MLP, max-pooling and sampling operation, as in our PSNet. Compared with PointNet++, PSNet achieves 93.1% OA classification accuracy, with a significant 2.4% increased accuracy. This comparison demonstrates the effectiveness of our PSConv layer for local point feature extraction.
Furthermore, compared with these methods that design convolution kernels, including SpiderCNN [6], PointWeb [20] and PointConv [21], our method performs better with at least 0.6% higher OA accuracy, and this proves the efficacy of our polynomial-based strategy for the learning of convolution kernels. This higher accuracy can be explained by the effectiveness of the polynomials because they are more flexible and can approximate any smooth functions theoretically. When being utilized in local point convolution, they can capture the geometric information hidden behind the unstructured points.

Shape Classification on ScanObjectNN
We further apply our PSNet on the ScanObjectNN-Vanilla, ScanObjectNN-Background and ScanObjectNN-PB_T50_RS datasets and report classification results in Table 2. Compared with state-of-the-art methods, PSNet achieves the highest accuracies on all the datasets for both OA and mACC measures, which prove the efficacy of our PSNet for analysis of scanned objects in the real world.  Compared with the baseline method PointNet++ [5], our PSNet gains improvements with 2.3%, 4.3%, 4.3% higher OA and 2.2%, 3.5, 3.2% higher mACC accuracies, respectively, on these three datasets. These comparisons show that our PSConv layer is an effective layer to learn a point feature.

M e a s u r e D a t a s e t P o i n t N e t [ 4 ] P o i n t N e t + + [ 5 ] S p i d e r C N
We also present per-class accuracies on the three variants of the ScanObjectNN dataset in Tables A2-A4 of the Appendix, respectively, where our PSNet also performs the best in many categories and outperforms the compared methods.
Considering that the ScanObjectNN dataset consists of scanned real-world objects with noise and geometric distortions, these results and comparisons all demonstrate the effectiveness of our method in real data analysis tasks and in real-world applications.

Shape Segmentation on ShapeNet Part
We finally apply PSNet for the 3D point cloud segmentation task on the ShapeNet Part [22] dataset to predict point labels. We present the IoU accuracies in Table 3. As shown in the table, our PSNet achieves better performance than most of the methods, and it also achieves competitive accuracy with KPConv [18], which takes about 2300 k points as input compared with ours taking 1024 points as input. We also present per-class IoU in this table, and PSNet performs the best on categories of chair, knife, and rocket, etc. Table 3. Shape segmentation IoU on the ShapeNet Part dataset (in %). Our PSNet achieves competitive accuracies, and it also performs best on the categories of chair, knife, rocket and table.  1 91.1 77.8 92.6 88.4 82.7 96.2 78.1 95.8 85.4 69.0 82.0 83.6 PointConv [21] 85.7 - [35] 82 Compared with the baseline method PointNet++ [5], our PSNet achieves 1.1% higher mean IoU on this dataset, and this demonstrates the effectiveness of our PSConv layer for point feature extraction. Compared with the works that design point convolution kernels, such as SpiderCNN [6] and PointConv [21], our method based on polynomials presents better performance.

Method M e a n A e r o B a g C a p C a r C h a i r E a r p h . G u i t a r K n i f e L a m p L a p t o p M o t o r M u g P i s t o l R o c k e t S k a t e T a b l e
In Figure 5, we show the segmentation results of several objects in the ShapNet Part dataset as well as the corresponding ground truth labels for every object. We also show the predicted incorrect labels in the last column in every box, which are highlighted by a dark blue color. As illustrated in the figure, our predicted labels are reasonable and close to the ground truth, and the points with predicted incorrect labels are mainly near the connection of two parts, which are really hard and indistinct to predict.

Ground Truth Our Prediction Wrong Prediction Ground Truth Our Prediction Wrong Prediction
Ground Truth Our Prediction Wrong Prediction Ground Truth Our Prediction Wrong Prediction Figure 5. Shape segmentation results. In every box, we present the object with ground truth labels and our predicted labels in the first and second columns, respectively. We also highlight the wrongly predicted points with dark blue color, which are mainly located near the connection of two parts in one shape.

Ablation Study
In this subsection, we conduct an ablation study on PSNet to justify the effects of our network design, including the effect of linear transform and power operation, the effect of polynomials in the PSConv layer, and the effect of layer number and stage number. We take the baseline PSNet with one stage consisting of four layers of PSConv, which achieves 92.8% OA accuracy on ModelNet40 [9].
Effect of linear transform and power operation. To prove the effect of linear transform and power operation in our PSConv layer, in Table 4, we present the results of our PSNet without linear transform and power operation in the PSConv layer, respectively, i.e., PSNet-noL-trans and PSNet-noPower, and their accuracies are 92.4% and 92.3%, respectively, which are lower than our full PSNet model with 92.8% accuracy. These comparisons show the necessities of our linear transform and power operation in the design of a PSConv layer. Effect of polynomial. In the design of our PSConv layer, the convolution filters are learned based on polynomial as in Equation (1)-(3), and now we present the results of PSNet with a polynomial replaced by other operations to prove its effect. We replace our polynomial with operations such as Linear transform (L-trans.), ReLU, Sigmoid (Sig.), Tanh, Leaky-ReLU (L-ReLU), Exp, FC (followed by BN and ReLU) [36], which are followed by a traditional convolution to learn filters as in Equation (3). We present the results in Table 5, and our method achieves the best performance among the compared methods, showing that our strategy with a polynomial is more effective than that with traditional non-linear operations. Table 5. Ablation study on polynomial in PSConv layer. We report shape classification accuracy on ModelNet40 (OA in %). Sig. denotes Sigmoid operation, L-trans. is short for linear transform. Our design with polynomials achieves the best accuracy. Effect of PSConv layer number and stage number. In our PSNet, we take four sequential PSConv layers in one stage and two stages in PSNet. To show the effect of PSConv layer number and stage number, we present classification results of PSNet in Figure 6a with different layer numbers in one stage and (b) with different stage numbers in PSNet. As demonstrated in the figure, PSNet with four layers of PSConv and two stages achieves the highest accuracies. More layers and more stages will not get higher accuracies in the design of our PSNet. Robustness to noise. To justify the robustness of our PSNet, we train PSNet on the ModelNet40 training dataset and test it on the test data of ModelNet40 with various levels of noise. We add Gaussian noise with a mean value of zero and with a different standard deviation (Std) on each point (coordinates within a unit ball) independently. The OA accuracies are presented in Figure 7. As shown in the figure, our PSNet keeps robustness under noise with a Std of 0.01. In this figure, we also present the accuracy lines of PointNet [4], PointNet++ [5] and SpiderCNN [6] for comparisons. The performances of all of these methods, including ours, drop with the increase in noise level. Compared with the other methods, our PSNet performs better with noise levels of 0.01, 0.05, 0.10 and 0.30.

Discussion
In this subsection, we systematically compare our method with the related point cloud networks in methodology to analyze the distinctive characteristics, novelties and explanation of the effectiveness of our approach.
We first compare our method with the baseline method PointNet++ [5], whose basic components are MLP, max-pooling and sampling operation, as in our network. Our PSNet differs from PointNet++ with the additional PSConv layers. As shown in Sections 4.3-4.5, PSNet achieves significantly higher accuracies for both shape classification and segmentation tasks on various standard datasets. These increased accuracies are mainly attributed to our PSConv layer. This comparison indicates that the PSConv layer is an effective operation for point feature extraction, which helps for shape analysis tasks.
We next compare our separable PSConv with the full formulation of PSConv and the previous separable convolution. For the full formulation of PSConv, we need to learn parameter Φ ∈ R K×L×M×D in ×D out . While for our separable PSConv, we only need to learn Φ = {φ kij } ∈ R K×D in ×D out and Φ = {φ lmi } ∈ R L×M×D in with significantly less parameters. For the computational complexity, it is O(KLMD in D out ) for the full PSConv, while it is reduced to O(KLMD in ) for our separable PSConv. With our separable PSConv layer, it is efficient to conduct 3D shape analysis with remarkably lower parameter size and computational complexity. For our separable PSConv, we would like to highlight that our separable formulation is different from the previous spatial separable convolutions such as FFT and [29] as well as the depthwise separable convolution [30][31][32]. We do not explicitly split the convolution spatially or split it into depthwise and pointwise ones but advance it by separating the convolution kernels into a flexible and adaptive combination. We design our novel separable formulation flexibly to construct the point convolution, which may offer a new strategy for the design of separable convolution.
We finally compare our polynomial-based transform with traditional transforms for the convolution kernel learning of our PSConv layer. Compared with the traditional transforms [36] such as Linear transform, ReLU, Sigmoid, Tanh, Leaky-ReLU, Exp, and FC, etc., the polynomials can theoretically approximate any smooth function with an unrestrictive range of values. With polynomial-based transform, our PSConv layer can better explore the local geometric information hidden behind the irregular local points. The advantage of our polynomial-based strategy for convolution kernel learning is also proved by the results in Section 4.6, e.g., PSConv based on polynomials achieves at least 0.26% higher OA classification accuracy than PSconv based on the other transforms.
In summary, our approach is well motivated by the polynomial approximation of convolution kernels, and it can well reduce the network parameter size as well as computational complexity by separable formulation. Compared with the previous convolutions, our approach based on the above innovations has achieved advantageous performance for shape classification and competitive accuracy for shape segmentation.

Conclusions
With the development of 3D sensors, shape classification and segmentation are two major tasks for the application of 3D point clouds. Designing an effective and efficient point convolution is necessary for feature extraction, which is the target of our work.
In this paper, we first design a novel point convolution, i.e., PSConv, on a 3D point cloud. It is designed based on polynomials of transformed local point coordinates. The polynomial-based kernels with learned parameters are able to approximate ideal convolution kernels with the guidance of loss function by network training. Compared with previous methods, our polynomial-based strategy can better capture the local geometric shape information. To reduce the parameter size and computational cost, we further construct a separable formulation of the PSConv layer. The separable PSConv can be efficiently applied while retaining efficacy, making it capable of building a multi-layer deep convolutional network on 3D point clouds. With PSConv as a basic layer, we design the hierarchical PSNet for point analysis. We evaluate it on standard synthetic and real scanned datasets, and it achieves state-of-the-art results for shape classification. It also has competitive performance for point cloud segmentation tasks.
However, there are limitations to our method that need further exploration. Firstly, with the reduction of parameter size in our separable formulation of PSConv, the representation ability may be reduced, which inspires us to design it with more flexibility and capability in future work. Secondly, although PSNet has achieved the highest accuracy in the real scanned ScanObjectNN dataset, with the increase of noise level, the performance drops. This phenomenon is also observed in the experiment on the ModelNet40 dataset, and this is also the challenge for all the point-based methods. To overcome this difficulty, a more stable and effective strategy for point convolution should be designed. Thirdly, PSConv is designed to operate on a local point cloud, and we design PSNet for 3D object analysis. However, for the applications in indoor and outdoor scenes, a more effective network architecture on large-scale point clouds is essential.
For our future work, it is worthwhile thinking about how to deal with the limitations. To improve the capability for our separable PSConv layer, we plan to introduce the multihead strategy when separating the parameters, which may better balance the trade-off between computation cost and representation ability. To improve the stability of our method, designing robust point convolution based on polynomials by explicitly handling the outliers is one of our future research directions. To apply the PSConv layer to large-scale point clouds, we would like to incorporate our PSConv into mainstream image convolution network architectures, such as ResNet [37] and DenseNet [38]. Furthermore, we are also interested in applying our PSNet for other applications, such as detection, completion and registration, etc., to explore the potential of our PSConv layer. We present the details of our network for shape classification and segmentation tasks in Table A1, such as the parameter size and architecture in every stage. Note that feature propagation is only utilized for the shape segmentation task. We present the per-class classification results of ScanObjectNN-Vanilla, ScanObjectNN-Background and ScanObjectNN-PB_T50_RS datasets in Tables A2-A4, respectively, where our PSNet also performs the best in many categories. Our PSNet performs best on 7/9/10 sub-categories, and it outperforms the second best work PointCNN with the highest score on 6/7/2 sub-categories. Furthermore, the mean accuracies across all sub-categories of PSNet is 1.0%/0.1%/3.5% higher than PointCNN. Compared with other methods, our method also has better performance. Table A2. Per-class accuracies (in %) on the ScanobjectNN-Vanilla dataset. PSNet achieves the highest mACC accuracy. It also performs better than all of the compared methods for the categories of box, cabinet, chair, and sink, etc.  Table A3. Per-class accuracies (in %) for ScanobjectNN-Background. PSNet achieves the highest mACC accuracy. It also performs better than all of the compared methods for the categories of box, cabinet, desk, display, shelf, and toilet, etc.

Method m A C C B a g B i n B o x C a b i n e t C h a i r D e s k D i s p l a y D o o r S h e l f T a b l e B e d P i l l o w S i n k S o f a T o i l e t
PointNet  Table A4. Per-class accuracies (in %) for ScanobjectNN-PB_T50_RS. PSNet achieves the best mACC accuracy. It also performs better than all of the compared methods for the categories of box, desk, display, door, shelf, table, bed, pillow, and sofa, etc.