Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Local Contextual Attention for Enhancing Kernel Point Convolution in 3D Point Cloud Semantic Segmentation

Appl. Sci. 2025, 15(17), 9503; https://doi.org/10.3390/app15179503

by Onur Can Bayrak^*

and Melis Uzar

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2025, 15(17), 9503; https://doi.org/10.3390/app15179503

Submission received: 5 August 2025 / Revised: 26 August 2025 / Accepted: 28 August 2025 / Published: 29 August 2025

(This article belongs to the Topic 3D Computer Vision and Smart Building and City, 3rd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper introduces the adaptation of a Local Contextual Attention (LCA) mechanism for the KPConv network, with reweighting kernel coefficients based on local feature similarity in spatial proximity domain.

Is it sufficient to add the innovation point of Local Contextual Attention to the innovation point in this article?

2.In the "4. Experiments: Setup, Metrics, and Comparison" section, without the 4.2 subsections, only "4.1", "4.3", and "4.4" are unreasonable.

3.In "Table 2. Semantic segmentation results on STPLS3D [44] as of August 2025. "The best performer in the comparison is the 2019 model, which is recommended to be compared with the latest model.

4.In the ablation experiment, STPLS3D dataset, why are the measurement results of different point cloud data so different?

5.In the "References" section, the format is not uniform; some years are bolded and some are not.

Author Response

Is it sufficient to add the innovation point of Local Contextual Attention to the innovation point in this article?
Answer: Many thanks for your constructive suggestion. We have added the innovation of LCA between lines 117–134.
In the "4. Experiments: Setup, Metrics, and Comparison" section, without the 4.2 subsections, only "4.1", "4.3", and "4.4" are unreasonable.
Answer: Thank you for your attention. We have corrected the subsection numbering.
In "Table 2. Semantic segmentation results on STPLS3D [44] as of August 2025." The best performer in the comparison is the 2019 model, which is recommended to be compared with the latest model.
Answer: Many thanks for your suggestion. Prior to submission, we had consulted the benchmark’s website. Following your recommendation, we included additional studies that used this dataset. However, no existing work has employed (i) real, (ii) synthetic, and (iii) real+synthetic training configurations in the literature. Therefore, we selected relevant works and inserted them into the appropriate rows of Table 2.
In the ablation experiment, STPLS3D dataset, why are the measurement results of different point cloud data so different?
Answer: Many thanks for your constructive question. Point clouds inherently suffer from uneven density, incompleteness, non-uniformity, and noise. LCA is particularly suitable for real-world point clouds. Moreover, based on Reviewer-2’s suggestion, we implemented Z-Score standardization for feature alignment, which improved classification performance by approximately 5% mIoU. We discussed this issue in the Discussion section, between lines 451–460.
In the "References" section, the format is not uniform; some years are bolded and some are not.
Answer: Thank you for your valuable observation. We have corrected the formatting in the References section.

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript introduces a Local Contextual Attention (LCA) mechanism integrated into the Kernel Point Convolution (KPConv) framework for 3D point cloud semantic segmentation. The LCA block adaptively reweights kernel coefficients based on local feature similarity while preserving KPConv’s geometric fidelity, aiming to improve boundary delineation and recognition of small or imbalanced classes. Experiments on three challenging benchmarks (STPLS3D, Hessigheim3D, Toronto3D) show substantial gains, including up to 20% mIoU improvement on STPLS3D, 16% mF1 improvement on Hessigheim3D, and 15% mIoU improvement on Toronto3D over baseline KPConv and other state-of-the-art methods. The novelty lies in combining a lightweight attention mechanism with distance-weighted kernels, achieving competitive accuracy without significant computational overhead. However, I recommend Major Revision for the following reasons.

The theoretical motivation and novelty of LCA compared to existing attention-enhanced KPConv variants need clearer articulation. While the mathematical formulation is presented, the conceptual advancement over similar local attention mechanisms remains insufficiently highlighted. The authors should expand the related work section to include a detailed comparison with prior contextual or attention modules in point-based networks, explicitly stating what is unique in LCA’s design, such as its computational efficiency, locality preservation, or robustness to data sparsity.
The performance gap observed when training with synthetic data, as discussed in Section 5, is a significant limitation. Rather than only identifying potential remedies as future work, the paper should provide experimental evidence supporting these claims. For example, the authors could implement and evaluate preliminary domain adaptation techniques, feature normalization strategies, or hybrid training approaches to demonstrate that such methods can indeed reduce the gap between synthetic and real-world performance.
The error analysis provided is limited and does not sufficiently explain failure patterns across classes. A more systematic class-wise error evaluation should be conducted for each dataset, potentially including confusion matrices and performance drop comparisons for specific class pairs. This analysis should be followed by a discussion of targeted improvements, such as the integration of boundary-aware loss functions, multi-scale context aggregation, or shape priors, which could directly address recurrent misclassification issues such as Fence–Building or Car–Building confusion.

Author Response

The theoretical motivation and novelty of LCA compared to existing attention-enhanced KPConv variants need clearer articulation. While the mathematical formulation is presented, the conceptual advancement over similar local attention mechanisms remains insufficiently highlighted. The authors should expand the related work section to include a detailed comparison with prior contextual or attention modules in point-based networks, explicitly stating what is unique in LCA’s design, such as its computational efficiency, locality preservation, or robustness to data sparsity.
Answer: Thank you very much for your constructive comment. The gaps of current attention-based networks and the uniqueness of LCA’s design are presented between lines 117–134.
The performance gap observed when training with synthetic data, as discussed in Section 5, is a significant limitation. Rather than only identifying potential remedies as future work, the paper should provide experimental evidence supporting these claims. For example, the authors could implement and evaluate preliminary domain adaptation techniques, feature normalization strategies, or hybrid training approaches to demonstrate that such methods can indeed reduce the gap between synthetic and real-world performance.
Answer: We sincerely appreciate your suggestion. Following your idea, we implemented Z-Score normalization during the training of real+synthetic point clouds on the STPLS3D dataset. We observed approximately a 5% improvement in both mIoU and Overall Accuracy. The results are provided in Table 2 and discussed between lines 269–278.
The error analysis provided is limited and does not sufficiently explain failure patterns across classes. A more systematic class-wise error evaluation should be conducted for each dataset, potentially including confusion matrices and performance drop comparisons for specific class pairs. This analysis should be followed by a discussion of targeted improvements, such as the integration of boundary-aware loss functions, multi-scale context aggregation, or shape priors, which could directly address recurrent misclassification issues such as Fence–Building or Car–Building confusion.
Answer: Thank you for this valuable recommendation, which has greatly enhanced our work. We added confusion matrices in Figures 6, 8, and 10, and discussed them both in the figure captions and in the Discussion section, between lines 461–481.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript introduces a Local Contextual Attention (LCA) block to enhance KPConv for 3D point-cloud semantic segmentation, showing notable improvements on STPLS3D, Hessigheim3D, and Toronto3D with minimal claimed computational overhead. Reported gains include +20% mIoU on STPLS3D (real-only), high mF1 on Hessigheim3D, and state-of-the-art mIoU on Toronto3D.

The bibliography is up to date through 2025 and references recent KPConv/transformer work (PGFormer, HyperG-PS). However, including the latest studies on synthetic-to-real/domain adaptation and lightweight attention methods for MLS/UAV data would strengthen the background and contextualize the contribution.
In Eq. (6), W1 and b1 are repeated; it should instead be W2 ReLU(W1 u + b1) + b2. Please correct this and explicitly state the values for d_hidden and the output dimensions.
The decay rate “0.11/100 per epoch” is unclear. Specify the exact scheduling method (cosine, step with gamma, or exponential decay factor) and describe the random seed policy.
The abstract claims negligible increases in memory and computation. To support this, include parameters, FLOPs, peak VRAM usage, and per-scene latency comparisons against KPConv.
The dataset splits are clearly described (STPLS3D areas, Toronto3D L001/3/4 → L002), which is good. Please clarify whether the baseline models were retrained using your setup or if their results were taken directly from public leaderboards.

Author Response

The bibliography is up to date through 2025 and references recent KPConv/transformer work (PGFormer, HyperG-PS). However, including the latest studies on synthetic-to-real/domain adaptation and lightweight attention methods for MLS/UAV data would strengthen the background and contextualize the contribution.
Answer: Thank you very much for your constructive comment. Please check lines 87–100 and 108–134.
In Eq. (6), W1 and b1 are repeated; it should instead be W2 ReLU(W1 u + b1) + b2. Please correct this and explicitly state the values for d_hidden and the output dimensions.
Answer: Thank you very much for your careful review. We have corrected the equation and explicitly stated the values of d_hidden and the output dimensions. Please check lines 173–181.
The decay rate “0.11/100 per epoch” is unclear. Specify the exact scheduling method (cosine, step with gamma, or exponential decay factor) and describe the random seed policy.
Answer: Thank you for pointing out the ambiguity regarding the learning rate schedule. We apologize for the lack of clarity. In our implementation, we employed an exponential decay strategy for the learning rate.
The abstract claims negligible increases in memory and computation…
Answer: Thank you for raising this important point. We have added the relevant details between lines 433–439 and included Table 6.
The abstract claims negligible increases in memory and computation. To support this, include parameters, FLOPs, peak VRAM usage, and per-scene latency comparisons against KPConv.
Answer: Thank you for your valuable comment. The baseline results reported in our work were not retrained using our setup. Instead, they were directly taken from public leaderboards and previously published academic papers. This was done to ensure consistency with existing benchmark comparisons and to provide a fair and transparent evaluation of our proposed method against widely accepted reference performances. We have revised the manuscript accordingly; please check lines 238–240.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have satisfactorily modified their manuscript according to my previous criticisms. Therefore, I recommend the publication of this manuscript.

Author Response

Comments 1: The authors have satisfactorily modified their manuscript according to my previous criticisms. Therefore, I recommend the publication of this manuscript.

Answer 1: We appreciate your positive and constructive comments. Best regards.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript introduces a Local Contextual Attention (LCA) block into KPConv for 3D semantic segmentation, reporting notable improvements on STPLS3D, Hessigheim3D, and Toronto3D with only a modest computational cost. The contribution is interesting and relevant, but a few points could further strengthen the work:

The related work section has been updated with recent KPConv/transformer approaches and domain adaptation literature (2021–2025), which provides good context. To make it more practical, you could briefly connect MLS/UAV datasets with the cited DA strategies, helping readers see how these techniques apply in real scenarios.
Many of the baseline results are taken directly from leaderboards or prior publications rather than being retrained. To ensure a fair “apples-to-apples” comparison, it would be valuable to retrain KPConv and one or two strong baselines within your own pipeline. This would also allow you to report consistent comparisons for gains, latency, and VRAM usage.
In the real+synthetic setting, performance drops below KPConv unless normalization is applied. It would help to include an ablation study exploring the impact of hyperparameters such as neighborhood radius, kernel-point count, and d_hidden. This could clarify the sensitivity of your method and explain the underlying causes of this behavior.

Author Response

Comment 1: The related work section has been updated with recent KPConv/transformer approaches and domain adaptation literature (2021–2025), which provides good context. To make it more practical, you could briefly connect MLS/UAV datasets with the cited DA strategies, helping readers see how these techniques apply in real scenarios.

Answer 1: Thanks to your suggestions, readers will better understand the functionality of DA techniques. Please see lines 111–117.

Comment 2: Many of the baseline results are taken directly from leaderboards or prior publications rather than being retrained. To ensure a fair “apples-to-apples” comparison, it would be valuable to retrain KPConv and one or two strong baselines within your own pipeline. This would also allow you to report consistent comparisons for gains, latency, and VRAM usage.

Answer 2: Thank you for your constructive feedback. We retrained the KPConv and Point Transformer networks on the utilized datasets (STPLS3D, Hessigheim3D, and Toronto3D). The results are reported in Tables 2, 3, and 4 for these datasets, respectively, and the network complexity parameters are provided in Table 7. Although we obtained similar results for STPLS3D and Toronto3D, we were not able to fully reproduce the results reported in the previous literature. We explained the reasons and discussed this issue in detail. Please see lines 525–535.

Comment 3: In the real+synthetic setting, performance drops below KPConv unless normalization is applied. It would help to include an ablation study exploring the impact of hyperparameters such as neighborhood radius, kernel-point count, and d_hidden. This could clarify the sensitivity of your method and explain the underlying causes of this behavior.

Answer 3: We greatly appreciate your comment. Following your suggestions, we benchmarked the parameter effects of the real+synthetic training configuration on STPLS3D. For this purpose, we conducted 36 training and testing experiments, both with and without normalization. Please see Table 6 and lines 447–480 and 516–524.

Article Menu

Local Contextual Attention for Enhancing Kernel Point Convolution in 3D Point Cloud Semantic Segmentation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI