Powerful Sample Reduction Techniques for Constructing Effective Point Cloud Object Classification Models
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsComments and Suggestions
- This paper presents a promising approach to improving 3D point cloud model training. The paper has clearly identified the challenge of large data volumes in 3D point clouds and the need for efficient downsampling.
- Using ModelNet40 provides a standard benchmark, and building upon PointNext, a known improved architecture, is a good starting point.
- Specific Down Sampling Method: The well-defined contribution focuses on an improved APES method with density consideration and adaptive K-value. The focus on edge point retention is also a good angle.
- The 15% reduction in training time and the improvement in accuracy are concrete metrics that demonstrate the effectiveness of your method.
- How do you determine the optimal K-value for the neighbor search? Is it a fixed value, or is it dynamically adjusted? A discussion of the impact of different K-values on performance and a justification for your chosen method would be valuable. Perhaps explore a range of K-values and show the impact on performance.
- How is the density of each point calculated? Is there a specific radius used? Similar to the K-value, how sensitive are the results to the density calculation method?
- How does your method compare to other state-of-the-art down-sampling techniques for point clouds? It would be beneficial to have a brief discussion of related work and a comparison table.
- What is the computational overhead of calculating the density and adapting the K-value? Is it negligible compared to the overall training time reduction?
- Detailed Comparison with Original APES: While you mention capturing edge points more effectively, providing a more detailed comparison (qualitative and quantitative) between your improved APES and the original APES would strengthen your claims. Show examples of how your method preserves edge points better. Metrics like edge point recall or precision could be helpful.
- In this aspect, I would like to suggest the authors to refer the following paper in their revised version.
- Kyaw, P.P., Tin, P., Aikawa, M., Kobayashi, I., Zin, T.T. (2025). Cow’s Back Surface Segmentation of Point-Cloud Image Using PointNet++ for Individual Identification. In: Pan, JS., Zin, T.T., Sung, TW., Lin, J.CW. (eds) Genetic and Evolutionary Computing. ICGEC 2024. Lecture Notes in Electrical Engineering, vol 1321. Springer, Singapore. https://doi.org/10.1007/978-981-96-1531-5_20
Author Response
Dear Editors and Reviewers,
The authors sincerely thank the editor and reviewers for their valuable comments and suggestions, which have greatly improved the quality of this manuscript. We have carefully revised the manuscript based on the feedback received. Below, we provide detailed responses to each comment. To assist the reviewers in their evaluation, we have highlighted significant changes (yellow background) made in the revised version of the manuscript.
- This paper presents a promising approach to improving 3D point cloud model training. The paper has clearly identified the challenge of large data volumes in 3D point clouds and the need for efficient downsampling.
- Using ModelNet40 provides a standard benchmark, and building upon PointNext, a known improved architecture, is a good starting point.
- Specific Down Sampling Method: The well-defined contribution focuses on an improved APES method with density consideration and adaptive K-value. The focus on edge point retention is also a good angle.
- The 15% reduction in training time and the improvement in accuracy are concrete metrics that demonstrate the effectiveness of your method.
Reply 1.~4.: Thank you for your thoughtful and detailed positive comments.
- How do you determine the optimal K-value for the neighbor search? Is it a fixed value, or is it dynamically adjusted? A discussion of the impact of different K-values on performance and a justification for your chosen method would be valuable. Perhaps explore a range of K-values and show the impact on performance.
Reply: Thank you for your suggestion. Please see Section 3.3. DAKNN Formula (p. 8) and Section 4.6. K Value Experiment (p. 15).
- How is the density of each point calculated? Is there a specific radius used? Similar to the K-value, how sensitive are the results to the density calculation method?
Reply: Thank you for your feedback. Please see Section 4.3. Experiment in Kernel Density Estimation (p. 13).
- How does your method compare to other state-of-the-art down-sampling techniques for point clouds? It would be beneficial to have a brief discussion of related work and a comparison table.
Reply: Thank you for your reminder. The following phrases have been added to Section 4.8., Comparison with Transformer-Based Architectures, of the revised manuscript (p. 16).
We compare our method with state-of-the-art point cloud classification models based on Transformer architectures, including Point-BERT and Point Transformer. The core of Point Transformer lies in its use of vector self-attention within local regions, combined with a learnable position encoding. This design effectively captures both local and global relationships.
Point-BERT, on the other hand, segments point clouds into local regions and encodes them into discrete codes using a discrete variational autoencoder (dVAE). It then performs self-supervised pretraining through Masked Point Modeling (MPM), where the model learns to reconstruct local geometric structures from partially masked inputs. This approach effectively captures both local and global features of point clouds, significantly improving performance in classification and few-shot learning tasks, and shows strong transferability.
The table below shows the classification results on the ModelNet40 dataset. Point-BERT achieves the highest overall accuracy of 93.8% when using 8k points. However, with only 1024 points, our model already achieves 93.57% accuracy, nearly on par. Even compared to Point Transformer, our model is only 0.13% lower in accuracy. Despite this, Point Transformer uses 4.9 million parameters, while our model uses only 4.5231 million, reducing the parameter count by nearly 400,000.
Table 8. Comparison with Transformer-Based Architectures
Method |
Architectures |
Input point |
Overall Acc. |
Point Transformer |
Transformer |
- |
93.70% |
[ST] Point-BERT (1k) |
Transformer-based |
1024 |
93.20% |
[ST] Point-BERT (8k) |
Transformer-based |
8192 |
93.80% |
PointNext-s + APES + DAKNN |
CNN + DAKNN/APES |
1024 |
93.57% |
- What is the computational overhead of calculating the density and adapting the K-value? Is it negligible compared to the overall training time reduction?
Table 9. Comparison of time at different K values |
|||
K Value |
Epoch |
Time |
Accuracy |
8 |
600 |
08:14:58 |
OA: 92.63 mAcc: 90.15 |
|
300 |
04:05:24 |
OA: 92.79 mAcc: 90.13 |
16 |
300 |
04:18:08 |
OA: 92.67 mAcc: 90.35 |
32 |
300 |
04:49:58 |
OA: 92.71 mAcc: 89.90 |
|
600 |
10:05:54 |
OA: 92.38 mAcc: 89.54 |
Adaptive (8 ~ 32) |
300 |
05:25:30 |
OA: 92.91 mAcc: 90.12 |
|
600 |
10:55:54 |
OA: 92.50 mAcc: 89.30 |
Adaptive (8 ~ 16) |
300 |
05:14:27 |
OA: 92.63 mAcc: 90.43 |
Reply: Thank you for your valuable reminder. The computational overhead of calculating the density and adapting the K-value are not negligible compared to the overall training time reduction. Please see the following Table 9. and Table 10. which are included in Section 4.9. Comparison of time at different K values (p. 16).
Table 10. The percentage increase in calculation time for different K values |
||||
Epoch |
K Value |
|||
Adaptive |
8 |
16 |
32 |
|
300 |
5.3 hr |
4.10 hr |
4.30 hr |
4.80 hr |
600 |
10.9 hr |
8.25 hr |
- |
10.10 hr |
- Detailed Comparison with Original APES: While you mention capturing edge points more effectively, providing a more detailed comparison (qualitative and quantitative) between your improved APES and the original APES would strengthen your claims. Show examples of how your method preserves edge points better. Metrics like edge point recall or precision could be helpful.
Reply: Thank you for your valuable feedback. Because the ModelNet10/40 3D point cloud dataset does not have labeled edge points of objects, it is not possible to accurately determine whether the detected edge points are correct, and therefore, quantitative analysis of the detection results' recall or precision cannot be conducted. However, through subjective observation of Figure 9 by five individuals, it demonstrates that the edge points detected by our method show significant improvement compared to the results of Method APES. The five individuals include three graduate students and two professors, one of whom is female, and four are male.
- In this aspect, I would like to suggest the authors to refer the following paper in their revised version.
Kyaw, P.P., Tin, P., Aikawa, M., Kobayashi, I., Zin, T.T. (2025). Cow’s Back Surface Segmentation of Point-Cloud Image Using PointNet++ for Individual Identification. In: Pan, JS., Zin, T.T., Sung, TW., Lin, J.CW. (eds) Genetic and Evolutionary Computing. ICGEC 2024. Lecture Notes in Electrical Engineering, vol 1321. Springer, Singapore. https://doi.org/10.1007/978-981-96-1531-5_20
Reply: Thank you for your suggestion. The revised manuscript now includes this paper as reference 9. The description for reference 9 can be found on page 3.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsHow does the proposed downsampling strategy compare with state-of-the-art transformer-based methods for 3D point cloud classification, such as Point-BERT or Point Transformer?
The use of DAKNN for dynamic neighborhood selection is interesting. Has there been an evaluation of how different density estimation techniques (e.g., adaptive kernel density estimation) impact classification accuracy?
In Table 5, the results show that adding point cloud rotation improved accuracy, which contrasts with findings in the PointNext paper. Could this be due to specific characteristics of the APES method? Would additional analysis, such as visualizing feature importance maps, clarify the underlying reason?
The study focuses on the ModelNet40 dataset, which consists of clean, synthetic data. How well would the proposed approach generalize to real-world, noisy point clouds, such as those from LiDAR scans or RGB-D sensors? Has any robustness evaluation been conducted on such datasets?
Given that self-attention mechanisms are computationally expensive, especially in high-resolution point clouds, could further efficiency improvements be made?
The paper discusses the impact of subsampling techniques but does not explore their influence on adversarial robustness. Would incorporating adversarial training or robustness testing against perturbations (e.g., occlusions, noise) offer additional insights into model reliability?
The integration of self-attention mechanisms for downsampling aligns with broader research in sensor fusion and multi-modal learning. Related works on attention-based sensor fusion, such as [DOI:10.48550/arXiv.2112.11224], explore how attention can improve feature extraction across modalities. Please cite it.
The proposed method involves significant architectural adjustments, but does not discuss potential trade-offs in hardware deployment. Would there be any limitations when implementing this approach on resource-constrained edge devices, such as embedded GPUs?
Author Response
Dear Editors and Reviewers,
The authors sincerely thank the editor and reviewers for their valuable comments and suggestions, which have greatly improved the quality of this manuscript. We have carefully revised the manuscript based on the feedback received. Below, we provide detailed responses to each comment. To assist the reviewers in their evaluation, we have highlighted significant changes (green background) made in the revised version of the manuscript.
- How does the proposed downsampling strategy compare with state-of-the-art transformer-based methods for 3D point cloud classification, such as Point-BERT or Point Transformer?
Reply: Thank you for your suggestion. We have added following brief descriptions in Section 4.8 Comparison with Transformer-Based Architectures of the revised manuscript (P.16).
We compare our method with state-of-the-art point cloud classification models based on Transformer architectures, including Point-BERT and Point Transformer. The core of Point Transformer lies in its use of vector self-attention within local regions, combined with a learnable position encoding. This design effectively captures both local and global relationships.
Point-BERT, on the other hand, segments point clouds into local regions and encodes them into discrete codes using a discrete variational autoencoder (dVAE). It then performs self-supervised pretraining through Masked Point Modeling (MPM), where the model learns to reconstruct local geometric structures from partially masked inputs. This approach effectively captures both local and global features of point clouds, significantly improving performance in classification and few-shot learning tasks, and shows strong transferability.
The table below shows the classification results on the ModelNet40 dataset. Point-BERT achieves the highest overall accuracy of 93.8% when using 8k points. However, with only 1024 points, our model already achieves 93.57% accuracy, nearly on par. Even compared to Point Transformer, our model is only 0.13% lower in accuracy. Despite this, Point Transformer uses 4.9 million parameters, while our model uses only 4.5231 million, reducing the parameter count by nearly 400,000.
Table 8. Comparison with Transformer-Based Architectures
Method |
Architectures |
Input point |
Overall Acc. |
Point Transformer |
Transformer |
- |
93.70% |
[ST] Point-BERT (1k) |
Transformer-based |
1024 |
93.20% |
[ST] Point-BERT (8k) |
Transformer-based |
8192 |
93.80% |
PointNext-s + APES + DAKNN |
CNN + DAKNN/APES |
1024 |
93.57% |
- The use of DAKNN for dynamic neighborhood selection is interesting. Has there been an evaluation of how different density estimation techniques (e.g., adaptive kernel density estimation) impact classification accuracy?
Reply: Thank you for your reminder. The following phrase and Table 11. has been included Section 4.10. Adaptive Kernel Density Experiment (P.17).
We experimented with adaptive kernel density estimation (adaptive KDE), whereas the original paper used standard KDE. Adaptive KDE selects a different bandwidth for each data point, adjusting automatically based on the local data density. In denser regions, a smaller bandwidth is used (for higher resolution), while in sparser regions, a larger bandwidth is applied (to avoid overfitting). This method captures sharp variations in data distribution more effectively, but it comes with a higher computational cost.
Based on the current experimental results, it can be observed that the adaptive bandwidth has almost no impact compared to the original method.
Table 11. Adaptive Kernel Density Experiment |
|||
bandwidth |
Epoch |
Time |
Accuracy |
0.1 |
200 |
- |
OA: 92.63 mAcc: 90.30 |
|
600 |
10:55:54 |
OA: 92.50 mAcc: 89.30 |
Adaptive |
200 |
03:42:08 |
OA: 92.59 mAcc: 90.50 |
(0.1 ~ 0.2) |
600 |
11:01:28 |
OA: 92.18 mAcc: 90.12 |
Adaptive (0.12 ~ 0.17) |
200 |
03:54:54 |
OA: 92.59 mAcc: 89.44 |
Reference:Point Cloud Denoising and Feature Preservation: An Adaptive Kernel Approach Based on Local Density and Global Statistics
- In Table 5, the results show that adding point cloud rotation improved accuracy, which contrasts with findings in the PointNext paper. Could this be due to specific characteristics of the APES method? Would additional analysis, such as visualizing feature importance maps, clarify the underlying reason?
Reply: Thank you for your reminder. This is because APES places more emphasis on selecting edge points during sampling. It calculates the standard deviation of each point to identify edge points. In contrast, PointNext uses FPS (Farthest Point Sampling), which focuses on distributing the sampled points more evenly in space rather than specifically targeting edge points. The core idea behind data augmentation is that the semantics should remain unchanged after transformation. As long as the object's outline remains intact, the semantics will not be altered regardless of rotation. However, in FPS, rotation can potentially cause semantic changes (e.g., after rotation, a point cloud of one object may become spatially similar to that of another object, resulting in similar point distributions). Therefore, by focusing on edge point sampling, APES can effectively use rotational data augmentation, enabling the model to learn invariance to these irrelevant changes and improving its generalization ability.
- The study focuses on the ModelNet40 dataset, which consists of clean, synthetic data. How well would the proposed approach generalize to real-world, noisy point clouds, such as those from LiDAR scans or RGB-D sensors? Has any robustness evaluation been conducted on such datasets?
Reply: Thank you for your valuable suggestion. The authors have previously surveyed several point cloud datasets, including 3DMatch, KITTI Odometry, ModelNet10/40, DeformingThings4D, and 4DMatch. The 3DMatch dataset consists of 8 indoor scenes, each containing approximately 50-60 point clouds. KITTI Odometry includes 22 street scenes, of which only 11 have ground truth data. ModelNet10/40, DeformingThings4D, and 4DMatch are all synthetic datasets. DeformingThings4D and 4DMatch are both animations. ModelNet10/40 is the largest of these datasets, with each point cloud representing a single object along with its ground truth, which is advantageous for evaluating experimental performance. Therefore, this paper utilizes ModelNet10/40 as the research dataset.
In addition, following the reviewer’s suggestion, the authors add artificial offset and noise to each point cloud of the dataset. Then, these noisy point clouds will be used to test and evaluate the robustness of the proposed algorithm. The experimental results and discussions, the following phrase and Tables, have been included in Section 4.11. Adaptive Kernel Density Experiment of the revised manuscript (P. 17).
We conducted two types of noise tests on ModelNet40 to evaluate the model’s generalization and robustness under unseen disturbances. First, we performed outlier and noise injection experiments by randomly adding noise points (ranging from 1% to 20% of the original point cloud) to simulate real-world interference. Results showed that when noise was below 19%, the overall accuracy (OA) remained nearly unchanged, and even at 20% noise, the OA only slightly decreased, staying above 92%. With adversarial training, the model maintained its performance even under 20% noise, demonstrating strong robustness. This is largely attributed to the N2P sampling mechanism in APES, which uses a self-attention-based relation graph to assess the relevance between each point and its neighbors. Since outliers have low relational scores due to distant neighbors, they are unlikely to be sampled, effectively reducing the impact of noise.
Next, we simulated point cloud shift scenarios by randomly offsetting 1% to 10% of the points with varying magnitudes (from 0.01 to 0.1) to test the model’s sensitivity to geometric disturbances. The results showed that when the offset magnitude was below 0.06, each 0.01 increase led to only a minor drop in OA. Even with 10% of points being shifted, the model maintained over 90% OA as long as the offset was under 0.3. However, when the shift became larger, accuracy dropped significantly, likely due to the disruption of the point cloud structure, which altered edge point locations and neighborhood relationships, thus weakening the model’s recognition ability.
For the offset test, we compare primarily against our original noise results (OA: 92.91, mAcc: 90.12). Results with OA ≥ 92 and mAcc ≥ 89 are highlighted in red text. The yellow background indicates the red-text result with the highest noise ratio. The detailed results are shown in Table 12. Due to space limitations, we have removed some results from Table 12.
For the noise experiments, we introduced noise by randomly adding 1% to 20% noisy points. Models were trained using data with 5%, 10%, 15%, and 20% noise levels, and tested on data with noise levels ranging from 0% to 20%. The general results are shown below. The "average" refers to the mean performance across all noise levels from 1% to 20%. Due to space limitations, we only present the results for 0%, 5%, 10%, 15%, and 20% in Table 13.
Table 12. Compare the results after offset noise (Original results:OA: 92.91, mAcc: 90.12)
Noise Ratio |
Offset |
||||||||||
0.01 |
0.02 |
0.03 |
0.04 |
0.05 |
0.06 |
0.07 |
0.08 |
0.09 |
0.10 |
||
1 % |
OA |
93.03 |
92.46 |
92.87 |
92.42 |
92.26 |
92.30 |
92.26 |
92.22 |
92.50 |
92.54 |
mAcc |
90.60 |
89.04 |
90.00 |
89.30 |
89.40 |
89.20 |
89.03 |
89.63 |
90.03 |
89.59 |
|
2 % |
OA |
92.50 |
92.10 |
92.26 |
92.26 |
92.14 |
91.25 |
91.09 |
91.21 |
91.45 |
91.33 |
mAcc |
89.87 |
89.37 |
89.82 |
89.40 |
89.53 |
88.47 |
88.24 |
87.95 |
88.72 |
88.41 |
|
4 % |
OA |
92.46 |
92.26 |
92.06 |
91.61 |
91.17 |
91.29 |
90.68 |
89.71 |
89.34 |
89.30 |
mAcc |
89.82 |
89.28 |
89.29 |
88.69 |
87.93 |
88.05 |
86.93 |
87.06 |
85.79 |
85.06 |
|
7 % |
OA |
92.30 |
92.10 |
91.37 |
90.92 |
90.19 |
89.87 |
88.37 |
87.52 |
86.26 |
85.05 |
mAcc |
89.14 |
89.28 |
88.00 |
87.60 |
85.95 |
85.64 |
83.96 |
81.86 |
81.90 |
79.00 |
|
10 % |
OA |
92.59 |
91.73 |
90.92 |
89.99 |
88.98 |
87.88 |
86.51 |
84.24 |
82.09 |
79.54 |
mAcc |
89.81 |
88.92 |
87.60 |
85.85 |
83.80 |
83.33 |
80.06 |
78.06 |
74.20 |
71.91 |
Table 13. Compare the results after adding noise (Original results:OA: 92.91, mAcc: 90.12)
Testing SNR |
Training SNR |
|||||
0% |
5% |
10% |
15% |
20% |
||
0 % |
OA |
92.71 |
93.23 |
93.15 |
93.11 |
92.87 |
mAcc |
89.90 |
90.84 |
90.29 |
91.01 |
90.09 |
|
5 % |
OA |
92.14 |
92.95 |
93.64 |
93.19 |
92.38 |
mAcc |
89.52 |
90.51 |
90.72 |
90.90 |
89.86 |
|
10 % |
OA |
92.83 |
92.83 |
93.31 |
93.07 |
92.91 |
mAcc |
90.17 |
89.58 |
90.49 |
90.87 |
90.05 |
|
15 % |
OA |
92.75 |
92.54 |
92.59 |
93.27 |
93.27 |
mAcc |
90.49 |
90.05 |
89.23 |
90.73 |
90.75 |
|
20 % |
OA |
91.98 |
92.63 |
93.07 |
92.38 |
92.59 |
mAcc |
89.86 |
89.80 |
90.03 |
89.76 |
89.90 |
|
Avg. |
OA |
92.55 |
92.88 |
93.00 |
92.97 |
92.77 |
mAcc |
90.11 |
90.12 |
90.02 |
90.51 |
90.29 |
- Given that self-attention mechanisms are computationally expensive, especially in high-resolution point clouds, could further efficiency improvements be made?
Reply: Thank you for your valuable suggestion. This study does not currently consider improving the self-attention mechanism, which is an important issue and a significant challenge. Maybe we will study this topic in future work.
- The paper discusses the impact of subsampling techniques but does not explore their influence on adversarial robustness. Would incorporating adversarial training or robustness testing against perturbations (e.g., occlusions, noise) offer additional insights into model reliability?
Reply: Thank you for your feedback. Please see the reply of comment 4.
- The integration of self-attention mechanisms for downsampling aligns with broader research in sensor fusion and multi-modal learning. Related works on attention-based sensor fusion, such as [DOI:10.48550/arXiv.2112.11224], explore how attention can improve feature extraction across modalities. Please cite it.
Reply: Thank you for your suggestion. This paper is now referenced as number 20 in the revised manuscript. The description for reference 20 can be found on page 4.
- The proposed method involves significant architectural adjustments, but does not discuss potential trade-offs in hardware deployment. Would there be any limitations when implementing this approach on resource-constrained edge devices, such as embedded GPUs?
Reply: Thank you for pointing out this issue in the manuscript. This paper aims to develop a novel and effective approach for classifying point cloud objects. To implement this approach on hardware, a sufficient budget is required to obtain the necessary equipment. Currently, the funding for the authors' research project is limited and inadequate for purchasing hardware. In the future, if the authors secure additional research funding, they plan to proceed with deploying the approach on hardware, such as embedded GPUs.
Author Response File: Author Response.pdf