Mixed Feature Prediction on Boundary Learning for Point Cloud Semantic Segmentation

: Existing point cloud semantic segmentation approaches do not perform well on details, especially for the boundary regions. However, supervised-learning-based methods depend on costly artiﬁcial annotations for performance improvement. In this paper, we bridge this gap by designing a self-supervised pretext task applicable to point clouds. Our main innovation lies in the mixed feature prediction strategy during the pretraining stage, which facilitates point cloud feature learning with boundary-aware foundations. Meanwhile, a dynamic feature aggregation module is proposed to regulate the range of receptive ﬁeld according to the neighboring pattern of each point. In this way, more spatial details are preserved for discriminative high-level representations. Extensive experiments across several point cloud segmentation datasets verify the superiority of our proposed method, including ShapeNet-part, ScanNet v2, and S3DIS. Furthermore, transfer learning on point cloud classiﬁcation and object detection tasks demonstrates the generalization ability of our method.


Introduction
With the remarkable advancements in computer vision, artificial intelligence, and sensor technology, 3D vision has received considerable crucial attention. Semantic segmentation based on point clouds has been widely adopted in the fields of remote sensing [1,2], autonomous driving [3,4], smart city [5], agricultural resource assessment [6], etc. However, existing methods can not segment well on details such as at point cloud boundaries, which has a serious impact on the practical application. Recently, a new paradigm named transformer has provided a powerful ability for feature learning with a superior structure. Nevertheless, the performance enhancement for supervised-learning-based methods depends on costly manual labeling of datasets with high precision. Furthermore, the generalization ability of such methods for different applications remains a big challenge.
Recent progress in deep learning has extended the self-supervised pretraining technology to language and image processing tasks [7,8]. By designing an auxiliary pretext task, models are often pretrained first to obtain representations that are more suitable for downstream tasks. The mask and predict (MAP) strategy has been phenomenally successful in pretraining tasks [9][10][11][12]. Nevertheless, most of these approaches rely on reconstruction [13][14][15] or generation works [16], which are challenging for point cloud data due to their unordered nature. Furthermore, another fatal issue in point cloud analysis is feature aggregation. Since not every member shares the same weight in the local region, adopting kernels with rigid sizes for point feature aggregation is inherently limited.
In this paper, we propose a novel framework to address the aforementioned challenges from three aspects. First, we design a pretext task aiming at delivering a boundary-aware model for point cloud semantic segmentation. Through a high-pass filter for point clouds, the pretraining task attempts to regress the target sharp features of point clouds by mixing the boundary features. As shown in Figure 1, we first locate the boundary points and then swap the features with their farthest local neighbors. Therefore, by using mixed boundaries to predict sharp features, we encourage the model to be boundary-aware in the pretraining stage. Second, we develop a dynamic feature aggregation (DFA) module to select discriminative neighbors with the consideration of spatial information. This module helps to raise the proportion of detailed information fed into the pooling module. Consequently, the latent representation is spatially more precise. Third, a boundary-label consistent loss is introduced to evaluate the predicted boundaries based on the segmentation results, which is jointly optimized with the cross-entropy loss. Experimental verifications of our proposed model are performed on different semantic segmentation datasets. After comparison, our method produces more accurate results in the boundary regions. Furthermore, we transfer our network to the point cloud classification (on the ModelNet40 and ScanObjectNN datasets) and object detection task (on the SUNRGB-D dataset). The results demonstrate the applicability of our method to general 3D processing tasks.
To summarize, the major contributions of this work are presented as follows: • We design a mixed feature prediction task for point cloud semantic segmentation to pretrain the model to be boundary-aware. • A dynamic feature aggregation module is proposed to perform point convolutions with adaptive receptive fields, which introduces more spatial details to the high-level feature representations. • Experimental results validate the enhancement of our method for boundary regions in semantic segmentation predictions. In addition, the integrated feature representations learned by our method transfer well to other point cloud tasks such as classification and object detection.

Point Cloud Semantic Segmentation Methods
Point cloud semantic segmentation is an indispensable critical step in point cloud processing. Its principle is to divide point cloud data into several nonintersecting subsets according to the characteristic properties of point clouds. Traditional point cloud segmentation methods include edge-based, region-based, and model-fitting based methods [17][18][19], only obtaining a relative coarse segmentation result. Recently, with the development of data-driven deep learning algorithms, end-to-end methods are proposed to analyze point clouds. As a pioneering work of learning-based method, PointNet [20] provided a network architecture that was suitable for processing raw point clouds directly. To further exploit local features, PointNet++ [21] sought out a solution by dividing the entirety at different scales. However, the aggregation manner for the global feature was still an intractable question. KPConv [22] employed irregular convolution operators to map pointwise features to predefined kernel points, and carried out convolution operations with regular kernel points. Recently, the transformer-based methods [23][24][25] possessing a large number of parameters were introduced to explore the long-range relations among points. In this work, we propose a self-supervised pretraining method on point clouds, providing an essential initialization for semantic segmentation tasks.

Self-Supervised Learning
Self-supervised pretraining has become increasingly popular due to its effectiveness in many research fields, such as natural language processing [7,8], target detection [26], pose estimation [27], and 3D reconstruction [28]. BERT [7] introduced a masked language modeling (MLM) strategy to mask and recover the input tokens. MAE [12] encoded incomplete patches with an autoencoder and reconstructed the original image through a lightweight decoder. Li et al. [29] used multiview relevance to generate supervision signals for training a 2D-to-3D pose estimator. Recently, several current works [24,30,31] advised to use self-supervised pretraining techniques for 3D point clouds. Sauder et al. [32] presented a pretraining method based on rearranging permuted point clouds. Compared to the jigsaw puzzles in images [33], Eckart et al. [30] designed a pretext task to partition the point clouds to interpret a latent Gaussian mixture model. Inspired by the BERT-style in NLP, Point-BERT [11] first converted point clouds to discrete tokens via dVAE, and then performed pretraining by predicting the masked point tokens. However, most of the existing pretraining tasks for point clouds are based on generative or contrastive methods. It is inconducive for downstream tasks, such as semantic segmentation, which require more geometric features to reflect local information.

Boundary Learning in Segmentation
In 2D image processing, the problems of blurred target boundaries and inaccurate predictions in segmentation tasks can be effectively improved by jointly executing boundary detection and semantic segmentation. Li et al. [34] decoupled edge and body areas by extracting the high-frequency components and low-frequency components of the image. Distinct optimizations were performed on different parts of the pixels. Zhen et al. [35] designed a pyramid context module that shared semantic information for joint task learning, refining details along object boundaries. However, it remained a tough problem for point clouds to incorporate boundary features with the point branches. EC-Net [36] detected point cloud boundaries by a deep architecture and accomplished a fine-grained 3D reconstruction with sharp features reserved. Jiang et al. [37] constructed a hierarchical graph to embed edge features with the point features progressively. JSENet [38] proposed a two-stream fully convolutional network to settle the edge detection problem and the segmentation problem at the same time. CBL [31] explored the boundary areas with contrastive learning and enhanced performance on different baselines. However, due to the sparse distribution and variable scales, the boundary features remained difficult to capture adequately.

Overview
In this section, we present the overall framework of our model on boundary learning for the point cloud semantic segmentation task. Given point clouds denoted as P ∈ R N×3 , we first detected boundaries and mixed the features between the boundary points and the interior points. With this boundary-blurred outcome fed into the model, we directly regressed the sharp features of the point clouds. After this self-supervised pretext task, the pretrained model was applied to the downstream task, fine-tuned with task-specific decoders. In particular, to better preserve detailed information in high-level feature propagation, we adopted dilated grouping and hybrid feature encoding strategies. In addition, a global label-consistent loss was introduced to ensure the segmentation correctness around boundary regions. Figure 2 illustrates the detailed pipeline of our boundary-aware model.  Figure 2. An overview of the proposed boundary-aware architecture for point cloud semantic segmentation. We first carry out a mixed feature prediction (MFP) task on ShapeNet to learn prior knowledge. Then, we fine-tune the network on downstream tasks based on the shared parameters. The boundary detector serves as an auxiliary supervised signal to maintain the correctness of the segmentation.

Boundary Detector
The boundary regions of point clouds share a distinct spatial distribution characteristic that most neighbor points locate only in some directions. For a query point P i in point clouds, we first sought its nearest neighbors as N i = p j |j = 1, 2, · · · k through a k-dimensional (k-d) tree. Considering the density of each local region, we then computed the coordinate value of the shape center with the following equation: where k is the number of local neighbors, S(x, y, z) is an inverse density function provided by kernel density estimation, and p j is the spatial coordinates of local neighbors. By calculating the L 2 distance between the shape center and the centroid point, we could identify the marginal regions. Considering the density variation of local regions, the resolution r(N i ) was defined as the minimum distance between a centroid point and its neighbor point.
We set a threshold value λ to extract the boundary points in marginal regions, as defined in Equation (3):

Mixed Feature Prediction Task
Motivated by MAE [12] and Point-BERT [11], we wanted to extend this self-supervised method to point cloud segmentation. Nevertheless, it is challenging to extend this mask and model strategy to the unorganized point cloud data. Unlike the regular arrangement of two-dimensional images, the decoder for point clouds needs to recover the coordinate position of the masked points. Moreover, there is no well-defined local partition manner to build a shape vocabulary for point clouds. Point-BERT [11] adds an external tokenizer for the pretraining task, but extra training is required for the tokenizer in the workflow. Another common solution to obtain local embeddings for point clouds is to utilize the basic sampling and grouping operations. Nevertheless, the process of sampling introduces randomness to the same input, which increases the difficulty to reconstruct the masked point clouds for the decoder.
To solve the problems above, we adopted a mix and predict strategy as in Figure 3 to generate a pretext task for pretraining. Specifically, we first utilized a boundary detector to extract and mask boundary points. Then, we found its K surrounding neighbors for each point, forming a continuous local region. Inspired by the PointCutMix [39] technique, we randomly selected the sampling points and swapped its features with those of its farthest local neighbor. Finally, a boundary-mixed coarse input was fed into the encoder module. As for the decoder, we selected the sharp features of point clouds as the target prediction. In this way, the encoder was encouraged to capture the sharp features from the coarse input formed by inner points. Moreover, for the same set of point cloud targets, the sharp edge features were explicit regardless of the variability of the sampling results. The spatial coordinate information was also reserved, and no additional learning was required for the decoder.

Encoder Decoder
Sampling Points Boundary Sampling Points Figure 3. Illustration of the mixed feature prediction task. We compare our mixed feature prediction task with the masked autoencoding workflow in the 2D image domain. The upper row shows the process to train an autoencoder by reconstructing the masked patches from the unmasked parts of the input image data. The lower row describes the main idea of an analogous method applicable to point clouds. An encoder is trained to learn sharp features of point clouds with mixed boundary feature embeddings.
The goal of our mixed feature prediction task was to boost the ability of our model to recognize boundaries. Through a high-pass filter, we could obtain the sharp features of the point clouds. The geometric information could be directly served as a continuous supervision signal to pretrain our model. After the feature aggregation through a MLP, we computed the L 2 distance between the predicted sharp features and the original sharp features as in Equation (4). The pretext task performed self-supervised learning by locating boundaries, which was helpful to improve the training of the model with limited data. (4) where N represents the point number and F s_pred and F s_tgt represent the predicted sharp features and the original target sharp features, respectively.

High-Pass Filter
The sharp features describe the contour area and the corner sets of point clouds, reflecting the local structural information that entails semantic meaning. In this paper, we regarded the point clouds as an undirected graph signal G = (V, A), where V represents the set of the nodes and A represents the relationship between nodes. The relations between two nodes were defined as: where x i and x j are the feature vectors of node i and node j. · 2 represents the L 2 norm distance, T is a nonlinear function which we implemented as a shared multilayer perceptron (MLP), and σ is the activation function; we adopted a LeakyRelu in the practical experiment. Hence, A = a i,j ∈ R N×N is the adjacent matrix of graph G. In this way, we could resort to the spectral graph theory to analyze disordered point clouds. The Laplace matrix of the graph G was obtained as follows: where D is a diagonal matrix consisting of the degrees of each node in G. To eliminate the influence of scale variation, we normalized the Laplace matrix as follows: After an eigendecomposition as in Equation (8), the eigenvalues represented different frequency components of the graph signal.
where Λ represents the frequency response of the point clouds. In general, the eigenvalues are sorted in a descending order as λ 1 ≥ λ 2 ≥ · · · ≥ λ n . By designing an attention mechanism, we could update the node aggregations by focusing on high-frequency information to generate the sharp features. Practically, we used the Haar-like filter [40] h(Λ), which is a special case of the Chebyshev polynomial approximation of GCN. As shown in Equation (9), the frequency response of h(Λ) was 1 − λ i . Due to the condition that 1 − λ i ≤ 1 − λ i+1 , the signal passing through this filter was amplified in the high-frequency regions and suppressed in the low-frequency regions, achieving the effect of a high-pass filter as below: In Figure 4, the target sharp features of the input point clouds P for the pretraining task was obtained as h(Λ)P. Actually, the sharp features reflect the change variation of each node to its neighbors.

Dynamic Feature Aggregation
For most point cloud processing baselines, high-level features contain limited spatial information due to the inevitable pooling operations. To preserve more spatial details, one straightforward solution is to collect more spatial information before pooling. So in this work, we first clustered the point clouds with boundary-based location information for separately encoding. Then, we adopted three grouping strategies with different dilated ratios to perform feature aggregation. Finally, the boundary regions containing the most detailed information fused hybrid grouping results with the most kinds of dilated ratios.

Spatial-Based Clustering
Points located in different position of the point cloud have different importance. Therefore, it is inefficient to perform the same calculation for each point. To adequately exploit the feature relations between local regions, we divided the input feature map into three branches based on the boundary detector in Section 3.2.1, denoted as boundary regions, cross regions, and interior regions. After extracting the boundary points, we counted the number of boundary points in each local region. The cluster criterion is given as follows: where µ is a threshold parameter. b i denotes the number of boundary points surrounding the point P i . The threshold parameter µ is determined by calculating the 3D chamfer distance between the points from boundary cluster and the points from boundary detector as follows, where P and Q represent the points from the boundary cluster and the points extracted by the boundary detector, respectively. N p and N q represent the number of point sets P and Q.

Hybrid Feature Encoder
Searching K nearest neighbors is a commonly used grouping method for each sampling point. However, merely using a rigid selection criterion can not suitably reflect the geometric and structural properties of point clouds, which is redundant or even inefficient, especially dealing with flat regions. To solve this problem, we put forward a dilated neighbor selection strategy for point grouping. As shown in Figure 5, a wider range of points was considered by applying our predefined n dilated searching rates. In this process, we obtained K nearest neighbors for each dilated ratio. As the dilated ratio increased, the neighborhood range of each point increased accordingly.
To ensure the symmetric invariance of the point cloud processing network, pooling modules are necessary for point feature encoders. Unfortunately, such an operation results in the loss of spatial information, which is harmful to pointwise prediction for segmentation. To alleviate this issue, we enriched the components of boundary region features before pooling. The overall structure of our proposed DFA module is shown in Figure 6. The module consisted of three branches corresponding to the boundary, cross, and interior regions, respectively. According to the various feature distribution of different regions, we adopted three kinds of dilated ratios for a reasonable neighbor grouping. In particular, we stacked all of the neighbor points with the dilated ratios of 1, 2, 4 for the boundary regions, neighbor points with the dilated ratios of 1 and 2 for the cross regions, and the dilated ratio of 4 for the interior regions. The use of dilated grouping can not only integrate multiscale information but also augment the diversity of spatial detail information. After employing an MLP to build feature maps F, we fed the outcomes to the max-pooling block and obtained a comprehensive latent representation.

Dilate ratio=2
Dilate ratio=1 Dilate ratio=4  Figure 6. The architecture of the proposed dynamic feature aggregation (DFA) module. Note that KNN denotes the dilated ratio is 1, 2KNN denotes the dilated ratio is 2, and 4KNN denotes the dilated ratio is 4. N 1 , N 2 , and N 3 denote the number of boundary points, cross points, and interior points, respectively, and N 1 + N 2 + N 3 = N.

Loss Function
To better explore boundary regions in the segmentation results, we constrained the final result with a label-consistent loss. Given the predicted category of each point, we obtained the boundary prediction result by analyzing the categories of its neighboring points. Specifically, for each point, we first counted the number of neighbors of each category within a certain range and then calculated the proportion of points different from the category of the point. If the proportion was greater than the preset ratio value, the point was identified as a boundary point; otherwise, it was not. As the boundary detector results could be regarded as the supervision information, we computed the chamfer distance of the label boundaries b i and the predicted boundariesb j : Finally, we explored the boundary regions based on a global correction and the final loss was L = L cross−entropy + θL BLC (13) where θ is the loss weight.

Experiments
In this section, we introduce the implementation details for our approach. We first provide the setups of the pretraining scheme. Then, we evaluate the performance of our model on downstream tasks. The evaluation metrics and corresponding comparison results with other state-of-the-art works are shown in detail. Our network was implemented in Pytorch, with two parallel Nvidia GTX 2080Ti GPUs employed for training.

Pretraining Setups
ShapeNet [41] was selected for the pretraining of our network; it contains over 57,448 models with 55 categories. We split the datasets into a training set and a validation set. Each model was sampled for 1024 points. In the pretraining phase, we employed the AdamW [42] optimizer with an initial learning rate set to 0.001 and a weight decay of 0.3. The model was trained for 600 epochs and the batch size was set to 64. For the boundary point detector, the optimal results for Equation (3) was achieved at λ = 8 on the ShapeNet and the clustering threshold parameter µ in Equation (10) was 12 in our experiments.

Evaluation Metrics
We selected overall accuracy (OA), average accuracy (mAcc), and mean IoU (mIoU) as the evaluation metrics for the downstream tasks on point clouds. OA denotes overall accuracy, i.e., the proportion of the correct predicted points to the total number of points. mAcc denotes the mean value of the prediction accuracy for all categories. mIoU denotes the mean value of IoUs corresponding to different categories, where IoU denotes the ratio of the intersection and union of two sets of ground truths and predictions. The calculation formulas for the evaluation metrics are shown as follows, where N is the total number of the categories. P ij represents the number of points that belong to category i but are classified as category j. P ii represents the number of points correctly predicted in category i. P ji represents the number of points that belong to category j but are classified as category i.

Part Segmentation on ShapeNet-Part Dataset
The ShapeNet-part dataset [43] is composed of 16,881 3D models covering 16 categories. Each category is divided into 2 to 6 parts, resulting in 50 parts in total. The dataset was split into training, validation, and testing set, containing 12137, 1870, and 2874 models, respectively. The dataset is practically challenging due to its imbalanced distribution of the model categories and parts within the category. The results of the proposed method and some current state-of-the-art methods on the ShapeNet-part dataset are presented in Table 1.

Semantic Segmentation on ScanNet v2 Dataset
ScanNet v2 [48] was built using 2.5 million scans from 1613 3D indoor scenes, with 21 semantic classes included. The dataset contains 3D coordinate information and label information in a mesh format. A train/validation/test split of 1201/312/100 is provided for the public. Following previous work [49], we randomly sampled 8192 points from each room for training and testing over the entire scene. In the training phase, we employed the AdamW [42] optimizer with an initial learning rate set to 0.001 and a weight decay of 0.3. The model was trained for 300 epochs and the batch size was set to 32.
As shown in Table 2, we compared the classwise mean of intersection over union (mIoU) of our method to some outstanding networks. Our network ranked third after Mix3D [42] and O-CNN [50]. Mix3D uses an out-of-context technique for data augmentation and reach the state-of-the-art scores based on a voxel-based method [51], which depends on costly computation complexity. O-CNN [50] is also a full-voxel-based solution that leverages an octree structure. Notably, our method outperformed other point-based methods by a large margin. JSENet [38] and CBL [31] are methods considering boundaries for point clouds; our method achieved a performance improvement of 5.9% and 5.3%, respectively, against these methods. Table 2. Quantitative comparison with state-of-the-art methods on ScanNet v2 dataset. The bold denotes the best performance.

Semantic Segmentation on S3DIS Dataset
S3DIS [53] refers to the Large-Scale 3D Indoor Spaces dataset obtained by a Matterport scanner; it contains 271 rooms from six different areas. Each point in the scene is represented by a semantic label in 13 categories (chair, table, wall, etc.). We divided the rooms into 1 m × 1 m blocks and randomly sample 4096 points for each block. As the experimental setups used in [54], we selected Area 5 for testing and the other areas for training in default. In the training phase, we employed the AdamW [42] optimizer with an initial learning rate set to 0.0001 and a weight decay of 0.3. The model was trained for 300 epochs and the batch size was set to 32.
Officially, the scenes from Area 5 of S3DIS were used for testing. The qualitative improvements of our method on boundary regions are highlighted by red dotted circles in Figure 7. Notice that our method performed well on categories with explicit boundaries such as ceiling, floor, wall, and window. Moreover, our method not only segmented boundaries precisely but also enhanced the overall performance by a large margin from the baseline. As shown in Table 3, the overall accuracy (OA), the mean accuracy, and the mean intersection over union (mIoU) were considered to compare the performance of our method with some recent remarkable networks. As a whole, our method presented a compelling segmentation result with the help of our boundary-aware network. For the generality of the experiment results, we also conducted a sixfold crossvalidation (Table 4) by changing the testing area in turn. Our method outperformed previous methods with leading results on overall accuracy and the mean intersection over union (mIoU).

Object Classification
Point cloud classification is an important problem in 3D scene understanding. We evaluated the transfer learning performance of our network from the ShapeNet dataset to the ModelNet40 dataset. The ModelNet40 [59] is built from 12,311 objects covering 40 categories, with 9843 objects for training and 2468 objects for testing. In the training phase, we employed the AdamW [42] optimizer with an initial learning rate set to 0.001 and a weight decay of 0.5. The model was trained for 200 epochs and the batch size was set to 64.
To verify the effectiveness of our method in feature representation for point clouds, we transferred our network to the point cloud classification task. In particular, after we trained our network from the ShapeNet dataset, we froze the parameters of the encoder and trained a linear SVM classifier on ModelNet40. The overall accuracy (OA) results are shown in Table 5. Since the encoder and the SVM were trained on different datasets, the experiment showed the generalization ability of our network. Moreover, we also compared with other supervised methods as shown in Table 6. Initialized with the proposed pretraining method results, our method achieved an outstanding accuracy performance.  [61] 83.3 Latent GAN [62] 85.7 MRTNet-VAE [63] 86.4 FoldingNet [14] 88.4 PointCapsNet [13] 88.9 GraphTER [15] 89.1 Ours 89.3 We extended our method to a real-world dataset, ScanObjectNN [67]. The dataset provides a more challenging setup than ModelNet40, considering the background and occlusions in realistic scenarios. It contains 2902 objects spread across 15 categories. The results of our proposed method and some existing excellent methods on the ScanObjectNN dataset are presented in Table 7.

Few-Shot Classification
To evaluate the generalization performance of our model, we conducted few-shot classification experiments on ModelNet40 dataset. We followed the standard "K-way N-shot" configuration for data generation. In the training phase, we randomly selected K categories in the training set with N samples for each category. The K × N samples constituted a support set for the model. Then, a batch of samples were selected from the remaining data of these categories as the query set to evaluate the model. We ran 10 different rounds according to the same settings and reported the mean outcome. In Table 8, we compare the result of our model with some other state-of-the-art approaches under four different data conditions. Ours-rand represents the proposed network training from scratch. Our method outperformed the Point Transformer by 0.8%/1.2%/0.6%/0.9%, which was promoted by 0.6%/0.8%/0.4%/0.7% through pretraining, demonstrating the effectiveness of our self-supervised pretraining method.

Object Detection
Object detection is a traditional task in the field of computer vision. Different from image recognition, 3D object detection needs to provide not only the identification and classification of objects present in the point clouds, but also the location of the object through a minimal 3D bounding box. A 3D bounding box is the smallest cuboid that encloses a target object in the real 3D world. Theoretically, a 3D bounding box has nine degrees of freedom, three for the position, three for the rotation, and three for the dimensional size. The SUNRGB-D [69] dataset is a popular densely annotated dataset for target detection, and the average precision (AP) is commonly used to evaluate the model. We combined our pretrained model with the state-of-the-art detection framework VoteNet. The AP 25 and AP 50 results are listed in Table 9; a 5.8% promotion on AP 25 indicates that our pretraining method provided benefits compared with training from scratch.

Boundary Extraction Strategies
As the proposed pretraining task requires accurate boundary points as a prerequisite. Therefore, we compare the performance of the boundary extractor in the model with another two edge detection methods [76,77]. The comparison methods are referred as the discontinuity-based (D+SC) method and the eigenvalue analysis (EA) method, respectively. The ground truth of the boundary points was generated by Meshlab. A quantitative analysis of the results was performed using precision and recall estimates. As shown in Figure 8, as the recall rate increased, the accuracy of the D+SC [76] and EA [77] methods tended to decrease. In contrast, our method showed a consistently higher precision and recall, providing more precise and complete boundary proposals.

Effectiveness of the MFP Pretraining
In Table 10, we compare our pretraining method for network initialization with some recent methods including OcCo, Point-BERT, and training from scratch. We conducted experiments on ScanNet v2 and S3DIS, with the same fine-tuning strategy for each dataset. We can see that the self-supervised pretraining progress significantly raised the supervisedfrom-scratch baseline with the geometric priors learned from the boundary regions. In addition, our mixed feature prediction (MFP) pretraining method averted the time and space overhead for training the dVAE-based tokenizer for point clouds. We adopted random masking and compared the influence of different mixing ratios for boundary sampling points in Table 11. Different from the masking ratios in NLP and computer vision (15% to 40%), we surprisingly found that a higher mixing ratio resulted in better performance for point cloud tasks. The reason lay in the discontinuous nature of the unorganized point clouds. The optimal masking ratio was eventually confirmed as 90% in our experiments.

Effectiveness of the DFA Module
The dynamic feature aggregation module adopted an adaptive dilated KNN search for points at different positions. With the combination of different dilated ratios, the detailed information became diverse with flexible receptive fields. Table 12 shows different arrangements for the feature extraction of the boundary regions on ScanNet v2 and S3DIS. Notably, both the features in small receptive fields and in large receptive fields were critical for boundary learning.
Our network was trained on 2048 points with a number of neighbors K of 20. We conducted experiments with varying values of the threshold parameter µ, from 0 to 20. The results are presented in Figure 9. As µ increased, the number of points divided into boundary regions gradually decreased. To obtain an even distributed clustering result, we eventually selected µ = 12 on ShapeNet for pretraining.

Effectiveness of the BLC Loss
The boundary label-consistent loss serves as an implicit auxiliary supervision signal for point features. It focuses on the interclass regions and differentiates the point features of each region via boundary information. After the experimental test, the weight θ of the BLC loss was set to 0.2 in our implementation. We conducted an ablation study, whose results are shown in Table 13. Equipped with the BLC-loss, 0.6% and 0.5% increases were achieved on the ScanNet v2 and S3DIS datasets, respectively.

Complexity Analysis
Finally, we compared the model complexity of our method with that of other point cloud segmentation methods. For the sake of fairness, all the experiments were performed with two paralleled RTX 2080Ti GPUs on the ModelNet40 dataset. As shown in Table 14, PointNet++ [21] and DGCNN [44] are time-consuming, a major reason being that they both adopt the computationally expensive FPS sampling operation. KPConv [22] is difficult to extend to large scenes due to its high computational and memory cost. The Point Transformer [23] suffers from a large number of parameters on account of the huge calculation required from the self-attention layer. Our method requires fewer parameters while maintaining a competitive result. Thanks to the proposed dynamic feature aggregation (DFA) method, we employed different point convolution strategies according to spatial properties for point clouds, thus allowing the proposed model to extract long-range geometric dependency with less time. Furthermore, we used the simple random sampling operation together with the efficient shared-MLPs to aggregate features, which was especially beneficial for reducing inference time. We followed the stand-point-based processing design to build our network architecture. The detailed configuration is shown in Table 15.

Conclusions
Semantic segmentation technology plays a key role in extracting valuable content from large quantities of 3D data for better scene understanding. In this paper, we presented a simple and efficient framework focusing on the boundary regions for the point cloud semantic segmentation task. Through a mixed feature prediction task, the model obtained accurate boundary perception ability through pretraining. To capture local information more efficiently, we further proposed a dynamic feature aggregation (DFA) module to search for the best fitting neighbors under different receptive fields. In addition, a novel boundary label-consistent loss was integrated into our network to ensure the boundary smoothness of segmentation results. Overall, our proposed self-supervised learning method achieved results comparable with fully supervised learning in semantic segmentation tasks, avoiding the high cost of manual labeling for point clouds. On this basis, we can take advantage of large-scale raw data for training, greatly improving the applicability and robustness of neural network models in domains such as autonomous driving, augmented reality, robotics, and medical treatment. In the future, we will explore the method of integrating the information from multisource data. The texture features contained in hyperspectral remote sensing images are conducive to understanding the large-scale realworld point cloud scenarios.