GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis manuscript presents a well-structured study with clear innovations and sufficiently validated experiments. However, several aspects of the writing and content organization could be further improved:
1. Abstract. Some expressions can be more concise to avoid redundancy. For example, phrases such as “Among them, the multimodal feature extraction network uses…” could be simplified. In addition, technical details like “multi-scale feature extraction” are described with excessive granularity for an abstract. The use of transition words can also be diversified to improve fluency.
2. Introduction. The background section would benefit from additional citations. Several statements currently lack sufficient referencing to existing literature, and strengthening this part would enhance the academic rigor of the introduction.
3. Related Work. It is recommended to add a concluding paragraph to summarize the limitations of the existing approaches and explicitly highlight how the proposed method addresses these gaps. This will improve the logical connection between related work and the contribution of this paper.
4. Experiments. Including more visual comparison results would provide intuitive evidence of the improvements achieved. Moreover, the discussion section should address the limitations of the proposed method, such as the trade-off between model complexity and performance.
Author Response
Dear Reviewers,
We are pleased to resubmit the revised version of our paper entitled "GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds" and sincerely request your consideration for publication in Electronics. We highly appreciate the time you have dedicated to reviewing our paper, and all your comments have provided valuable guidance for our revisions. We have fully considered all the review comments, carefully revised the paper, and the revised content has been highlighted at the corresponding positions in the revised manuscript.
We have replaced all figures and tables in the revised paper with higher-resolution versions. In addition, regarding the visualization results of Figure 7, we have uploaded them to the supplementary materials for the reviewers and other readers to access.
Comments 1: Abstract. Some expressions can be more concise to avoid redundancy. For example, phrases such as "Among them, the multimodal feature extraction network uses..." could be simplified. In addition, technical details like "multi-scale feature extraction" are described with excessive granularity for an abstract. The use of transition words can also be diversified to improve fluency.
Response 1: Thank you very much for your suggestions. Excessive technical details do indeed affect the fluency of the manuscript. We have revised lines 19-24 of the manuscript.
Comments 2: Introduction. The background section would benefit from additional citations. Several statements currently lack sufficient referencing to existing literature, and strengthening this part would enhance the academic rigor of the introduction.
Response 2: We have added new references in the Introduction and rephrased the existing literature (revised sections are lines 39-51) to enhance academic rigor.
Comments 3: Related Work. It is recommended to add a concluding paragraph to summarize the limitations of the existing approaches and explicitly highlight how the proposed method addresses these gaps. This will improve the logical connection between related work and the contribution of this paper.
Response 3: We have added a new paragraph to summarize the limitations of existing methods and the advantages of our approach. The revision is located at lines 160-175.
Comments 4: Experiments. Including more visual comparison results would provide intuitive evidence of the improvements achieved. Moreover, the discussion section should address the limitations of the proposed method, such as the trade-off between model complexity and performance.
Response 4: We have added figures and tables for experimental results and have re-discussed the strengths and limitations of the method in the method comparison section. The revision is located at lines 483-493.
Thank you again for pointing out the deficiencies in our discussion. These comments have significantly enhanced the rigor and persuasiveness of the paper!
Reviewer 2 Report
Comments and Suggestions for AuthorsAbstract
Please determine which kind of lidar data (mobile, static, airborne or terrestrial). Also, the resolution of used images. As there are plenty of AI classification algorithms suggested in the literature, it is important to quantify the obtained accuracy.
Introduction
There are comments within the paper which mentions that presented version is not the last version.
Figures and tables should be cited in the text before illustrating them.
Figure 1, plenty of colours are used distinguish different classes, unfortunately, that makes the figure vague. How did you calculate the ground truth? Please replace Figure 1 by clear figure.
Please highlight the novelty and the contribution of the paper.
Method
In Figure 2, the input and the output data are not clear.
Line 166: the term “data processing is a general term, please replace it with an accurate expression.
Line 165: what is the reason for adopting “an optimized multimodal gated attention mechanism” and not other solution?
Line 166: the 4 stages are not clear in Figure 2.
Can you please explain more the equation 1, and how we can get the orientation elements of images to calculate the projection.
Please define all function and parameters illustrated in Figure 4 in the figure caption. Please check all figures.
Please check defining all parameters in the suggested equations and cite their references, please check all equations.
Experimental Analysis
Please don’t put two section titles consecutively, separate them by a transaction paragraph(s). suggested approach Flowchart can be put in this section.
Datasets
You provided general explanations about the used data set which is unacceptable. Please provide a table to show the characteristics of input data sets.
Table 3: the column titles need to be defined in the caption.
Figure 7 is vague.
Conclusion
Please discus the limitations and if the obtained accuracy is significant
Author Response
Dear Reviewers,
We are pleased to resubmit the revised version of our paper entitled "GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds" and sincerely request your consideration for publication in Electronics. We highly appreciate the time you have dedicated to reviewing our paper, and all your comments have provided valuable guidance for our revisions. We have fully considered all the review comments, carefully revised the paper, and the revised content has been highlighted at the corresponding positions in the revised manuscript.
We have replaced all figures and tables in the revised paper with higher-resolution versions. In addition, regarding the visualization results of Figure 7, we have uploaded them to the supplementary materials for the reviewers and other readers to access.
Abstract
Comments 1: Abstract. Please determine which kind of lidar data (mobile, static, airborne or terrestrial). Also, the resolution of used images. As there are plenty of AI classification algorithms suggested in the literature, it is important to quantify the obtained accuracy.
Response 1: Thank you very much for your suggestions. We have emphasized the type of lidar we used (line 13), and the resolution of the images is 480×320 pixels, which are randomly cropped and fed into the network. There is a detailed introduction in Section 3.3 of the paper.
Introduction
Comments 2: There are comments within the paper which mention that the presented version is not the last version.
Response 2: This statement was a remnant from an early draft and has been completely removed.
Comments 3: Figures and tables should be cited in the text before illustrating them.
Response 3: We have uniformly adjusted the order of figures and tables to ensure they are cited before being displayed.
Comments 4: Figure 1, plenty of colours are used to distinguish different classes, unfortunately, that makes the figure vague. How did you calculate the ground truth? Please replace Figure 1 with a clear figure.
Response 4: The ground truth calculation source for Figure 1 is the official manual annotations of SemanticKITTI, which has been added to the figure caption (line 102). Regarding the contribution of this paper, we have rephrased our work to highlight the focus of our research (lines 78-95).
Comments 5: Please highlight the novelty and the contribution of the paper.
Response 5: A new paragraph has been added at the end of the Introduction, where we have rephrased our work to highlight the focus of our research (lines 78-95).
Method
Comments 6: In Figure 2, the input and the output data are not clear.
Response 6: Figure 2 has been redrawn with annotations for input and output.
Comments 7: Line 166: the term "data processing" is a general term, please replace it with an accurate expression.
Response 7: We have removed this statement and provided a detailed explanation (lines 193-194).
Comments 8: Line 165: what is the reason for adopting "an optimized multimodal gated attention mechanism" and not other solutions?
Response 8: A detailed explanation of the advantages of the gated attention mechanism has been added to the text (lines 160-174).
Comments 9: Line 166: the 4 stages are not clear in Figure 2.
Response 9: Figure 2 has been revised, with different background colors in the figure representing different stages.
Comments 10: Can you please explain more about Equation (1), and how we can get the orientation elements of images to calculate the projection.
Response 10: The derivation of Equation (1) has been completed and can be found in Section 3.1 of the main text (lines 193-194). The intrinsic (K) and extrinsic (T) parameters are directly taken from the official KITTI calibration files and do not require additional calculation. The source of the data and the reference have been mentioned in the text.
Comments 11: Please define all functions and parameters illustrated in Figure 4 in the figure caption. Please check all figures.
Response 11: The caption for Figure 4 has been completed, and all other figures have been checked. All symbols are defined either in the figure captions or at their first appearance in the main text.
Comments 12: Please check defining all parameters in the suggested equations and cite their references, please check all equations.
Response 12: All parameters in Equations (1)-(9) have been defined at their first appearance and missing references have been added.
Experimental Analysis
Comments 13: Please don't put two section titles consecutively, separate them by a transaction paragraph(s). The suggested approach flowchart can be put in this section.
Response 13: A transitional paragraph has been added before Section 4.1 to explain the structure of Chapter 4 (lines 412-415).
Datasets
Comments 14: You provided general explanations about the used dataset which is unacceptable. Please provide a table to show the characteristics of input datasets.
Response 14: A new table, "Introduction to the SemanticKITTI and nuScenes Datasets," has been added.
Comments 15: Table 3: the column titles need to be defined in the caption.
Response 15: We have explained Table 4 in lines 479-482.
Comments 16: Figure 7 is vague.
Response 16: Due to document limitations, the clarity of the image display is poor. A clearer image has been replaced.
Thank you again for pointing out the deficiencies in our discussion. These comments have significantly enhanced the rigor and persuasiveness of the paper!
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper proposes an improved model for point cloud segmentation. The novelty mainly stems from connecting already known modules in a new way, targeting some of the presumed issues with existing models. The results show small improvements in segmentation accuracy, although some of the recent work is not included in comparison. The main problem with the paper is section 3, which does a poor job of clearly and coherently presenting the proposed model structure. For this reason, I recommend substantial improvements to be made to the paper before it is ready for publication. Below are section-by-section comments and suggestions.
Introduction:
- The introduction describes the importance and applications of point cloud semantic segmentation, but abruptly ends with a list of highlights of the proposed architecture. The motivation for proposing yet another neural network model for solving a well-researched problem is not properly explained. Which are the limitations of existing state-of-the-art methods that the proposed model overcomes? How do individual components, mentioned in the list, address these issues? Some challenges are described later, but should be described here to provide the context for this study.
- Lines 58-62: The novelty of the proposed model seems to be mainly in adding different existing modules to the 2DPASS segmentation model, such as RepGhost in the 2D extraction branch. The text "We introduce the 2D feature extraction network (RepGhost)" indicates that RepGhost is a new architecture developed in this study, but is not. The same is true for PV-RCNN.
Related work: Recently, several new neural network models have been proposed for fusion-based PC semantic segmentation. Many of them use attention and different mechanisms for improving feature extraction. The authors should include the most recent works, at least those reporting results on the SemanticKITTI and nuScenes datasets used in this study. For instance, how do modules introduced in GMAFNet relate to similar modules in existing studies, including:
- APPFNet: Adaptive point-pixel fusion network for 3D semantic segmentation with neighbor feature aggregation by Wu et al., 2024, claims even higher mIoU on both datasets
- Multi-scale sparse convolution and point convolution adaptive fusion point cloud semantic segmentation method, Bi et al. 2025
- A Cross-Modal Attention-Driven Multi-Sensor Fusion Method for Semantic Segmentation of Point Clouds, Shi et al., 2025, claims state-of-the-art on nuScenes
- Lines 111-114: The authors state that there are some remaining challenges of fusion-based PC segmentation, but the reasoning should be supported by better arguments. What makes the authors conclude that fine-grained image features are lost, and what does it mean that "insufficient representation of point cloud geometry persist"? How does the proposed solution of "a multimodal feature extraction network" differ fundamentally from existing solutions and specifically addresses the loss of fine-grained image features? Or how does the gated attention differ from what has been used in related work?
Section 3 is central to understanding the contributions of the paper to the architecture of the segmentation model. However, it introduces quantities that are never explained, or are used inconsistently. The flow of information through the network is only vaguely explained in the text, which makes true reproduction of experiments impossible. The authors should improve the description by adding equations, equipping figures with notation used in the text, and adding more details about the inputs/outputs pf individual model components. Longer paragraphs should be split into smaller logical units.
Section 3.2:
- Lines 158: "in the process of fusing the two types of features, the phenomenon of information redundancy is neglected, causing the model to overfit the noise" - what makes the authors think that overfitting to noise is the problem? Adding noise is sometimes actually used as a regularization technique.
- Line 169: how is disparity computed? Some key equations should be provided for reader's convenience.
- The notation used in the textual description in lines 173-190 should also be used in Figure 2.
- Line 183: is "VAS module" a typo?
Section 3.3:
- Figure 4 is not referenced in the text, as the leftover comment already explains.
Section 3.4:
- Figure 5 is not clear - what do the connections between VSA and the tensors in the upper branch represent? How are the tensors from upper and lower branch combined, what are their dimensions at each step of the way?
- The description in lines 231-238 is also ambiguous. Relation to Figure 5 should be made by adding quantity names in the figure.
- Line 237: "The features of each key point include voxel features from different layers and the original point cloud feature" - what is meant by "include", is concatenation used?
- Without equations, it is difficult to follow the flow of information through the network. The reader will need to read the PV-RCNN paper first to understand this section, in which case it will probably be redundant.
- The purpose and function of the RoI grid pooling module is not clear. How are the mentioned proposals obtained? What are the RoIs used for? Equation 3 is not informative, because the meaning of variables is not explained.
Section 3.5:
- There is some grammar problem in line 304.
- It is not clear where the quantities in equation 4 come from, because this notation is used for the first time here. In section 3.4, the authors used F^S and F^V, how are they related to F^le and F^V here?
- Grammar problem again in line 308.
- In Figure 4, the element-wise multiply only has one input. The quantities mentioned in the text should be added to the figure. Minor point - the sigmoid curve is not correct.
- It is not clear how the training of the gating module is integrated with the training of the whole network. The module uses its own loss, so how does that combine with the segmentation loss?
- Further, from the module description, it is not clear how the outputs S^2D3D and S^3D are used downstream. In Section 3.2, it is stated that the gating module generates F_fusion, but later this quantity is not mentioned anymore.
- Eq. 9 would require S^2D3D and S^3D to be probability distributions, e.g. after applying a softmax. What is the goal of computing the D_KL between two score vectors?
- The reasoning for the teacher-student concept described in 323-328 is unclear. Why do you want one to follow the other?
Section 4:
- The structure of the SemanticKITTI dataset is poorly described. What are "sequences"? What does sequence 10 have to do with object classes to be emphasized in the same line 339? The information in lines 337 and 339 regarding sequence 10 conflicts.
- Description of nuScenes dataset lacks information about the number of classes, and how was the data divided into training, validation and test sets.
- I am missing information about the training configuration (hyperparameter values) and how the fine-tuning was done.
- Line 383: "GMAFNet exhibits excellent segmentation accuracy" is a bit of an overstatement, because the differences from other models are quite small. Same thoughts about "outstanding performance" in the next line.
- Line 386: Table 3 is not "analysis of algorithmic complexity".
- I recommend the authors to analyze how different added components affect the training time.
- Figure 7 is not referenced in the text
Author Response
Dear Reviewers,
We are pleased to resubmit the revised version of our paper entitled "GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds" and sincerely request your consideration for publication in Electronics. We highly appreciate the time you have dedicated to reviewing our paper, and all your comments have provided valuable guidance for our revisions. We have fully considered all the review comments, carefully revised the paper, and the revised content has been highlighted at the corresponding positions in the revised manuscript.
We have replaced all figures and tables in the revised paper with higher-resolution versions. In addition, regarding the visualization results of Figure 7, we have uploaded them to the supplementary materials for the reviewers and other readers to access.
Regarding your questions, please see the attachment.
Thank you again for pointing out the deficiencies in our discussion. These comments have significantly enhanced the rigor and persuasiveness of the paper!
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper looks much better.
Author Response
Dear Reviewers, Greetings! First and foremost, we would like to express our sincere gratitude to you for devoting your valuable time to the meticulous review of our manuscript entitled GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds and providing constructive comments and suggestions. Your professional insights have not only helped us identify omissions in our research but also played a pivotal guiding role in enhancing the quality of this paper. We have carefully revised and improved the manuscript in response to each of your comments. Thank you again for your guidance.Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have improved the paper, but some minor issues persist:
- Line 154: "The term 'insufficient point cloud geometric representation' refers to the fact ..." is the answer to my comment from first review, but cannot be used directly in the text, because the term has not been used before.
- Line 228: The authors explain that the disparity is obtained from Eq. (1), but the equation defines the mapping from the point cloud to the plane, it does not explain the disparity computation. What do the authors mean by "disparity"?
- Figure 2: The authors' response 6 states that Figure 2 has been extended, but there is still no labeling of quantities in the figure. For instance, which arrows represent F_3D^S, F_3D^V, F_2D, F_3D or F_fusion?
- Equation 2 is wrong. Also, what do the authors mean by " for a point cloud Pi=(xi,yi,zi), the corresponding voxel block ..." - the equation describes mapping a single point to a single voxel?
- Line 305: What is the meaning of S_i-1 is not clear.
- In Figure 4, two different notations are used to describe dimensions, one with commas and the other with multiplication signs. Please unify.
- Figure 5: Response 11 states that "Annotations have been added to the figure", but only dimensions were added, not quantity notation (F_3D ...)
- The text in line 351 is not a sentence.
- What is r is Eq. 6?
- My previous comment 15 was mostly ignored, and nothing has changed in Figure 6. The element-wise multiply requires two inputs, the quantity notation should be added to the arrows. I also cannot find the changes mentioned in Response 15 in the text.
- In Table 1, the names of columns 2-4 do not correspond to the table content. The table caption is also not very good - why "introduction"?
- Why were the reported results for new methods (APPFNet and Bi et al.) not included in the Tables?
- It is difficult to find significant differences between the methods in Figure 7. What do the red frames denote in the images - they don't seem to enclose anything meaningful?
Author Response
Dear Reviewers, Greetings! First and foremost, we would like to express our sincere gratitude to you for devoting your valuable time to the meticulous review of our manuscript entitled GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds and providing constructive comments and suggestions. Your professional insights have not only helped us identify omissions in our research but also played a pivotal guiding role in enhancing the quality of this paper. We have carefully revised and improved the manuscript in response to each of your comments, and the detailed revisions are reported as follows:Comments 1: - Line 154: "The term 'insufficient point cloud geometric representation' refers to the fact ..." is the answer to my comment from first review, but cannot be used directly in the text, because the term has not been used before.
Response 1: Thank you very much for your comment. The phrase "insufficient point cloud geometric representation" has indeed not been mentioned in the original manuscript, and we have removed this expression ( 148-158).
Comments 2: - Line 228: The authors explain that the disparity is obtained from Eq. (1), but the equation defines the mapping from the point cloud to the plane, it does not explain the disparity computation. What do the authors mean by "disparity"?
Response 2: The term "disparity" herein refers to the depth or positional difference information between 2D image coordinates obtained through projection and their corresponding 3D points, which is used to establish the correspondence between image pixels and point cloud points. We have revised the explanation in the manuscript ( 225-234).
Comments 3: - Figure 2: The authors' response 6 states that Figure 2 has been extended, but there is still no labeling of quantities in the figure. For instance, which arrows represent F_3D^S, F_3D^V, F_2D, F_3D or F_fusion?
Response 3: Thank you for pointing out this issue. We have added the corresponding annotation in the figure.
Comments 4:- Equation 2 is wrong. Also, what do the authors mean by " for a point cloud Pi=(xi,yi,zi), the corresponding voxel block ..." - the equation describes mapping a single point to a single voxel?
Response 4: Thank you for pointing out our mistake. The error in Equation (2) herein was a typo, and it has been corrected. Regarding your question: what is described here is the mapping from a single point to a single voxel.
Comments 5:- Line 305: What is the meaning of S_i-1 is not clear.
Response 5: S_i-1 denotes the set of the first i-1 key points, and the corresponding explanation has been supplemented in the manuscript ( 307-310).
Comments 6:- In Figure 4, two different notations are used to describe dimensions, one with commas and the other with multiplication signs. Please unify.
Response 6: Thank you for pointing out this issue. We have added the corresponding annotation in the figure.
Comments 7:- Figure 5: Response 11 states that "Annotations have been added to the figure", but only dimensions were added, not quantity notation (F_3D ...)
Response 7: Thank you for pointing out this issue. We have added the corresponding annotation in the figure.
Comments 8:- The text in line 351 is not a sentence.
Response 8: We have revised and replaced the corresponding sentences in the manuscript ( 357-360).
Comments 9:- What is r is Eq. 6?
Response 9: Thank you for your reminder. The symbol r denotes the distance threshold, and the corresponding explanation has been supplemented in the manuscript ( 357-360).
Comments 10:- My previous comment 15 was mostly ignored, and nothing has changed in Figure 6. The element-wise multiply requires two inputs, the quantity notation should be added to the arrows. I also cannot find the changes mentioned in Response 15 in the text.
Response 10: Thank you for your valuable comments. We have adopted every suggestion proposed by the reviewer. We apologize for missing some of them in the previous revision due to an oversight. We have redrawn Figure 6 and added the corresponding annotations to it. Additionally, we have reorganized Section 3.4 to supplement the explanation of PV-RCNN.
Comments 11:- In Table 1, the names of columns 2-4 do not correspond to the table content. The table caption is also not very good - why "introduction"?
Response 11: It was an oversight on my part. The title has been revised to "Dataset Statistics Used in Experiments".
Comments 12:- Why were the reported results for new methods (APPFNet and Bi et al.) not included in the Tables?
Response 12: Thank you for your reminder. We have added the experimental results of APPFNet in Table 2. Due to the time constraints for revising the manuscript, we regret that we are unable to supplement the experimental results of Bi et al. within the deadline.
Comments 13:- It is difficult to find significant differences between the methods in Figure 7. What do the red frames denote in the images - they don't seem to enclose anything meaningful?
Response 13: The red boxes in the images indicate incorrect regions, where the model fails to recognize objects accurately. This issue is mainly reflected in the relatively large errors in the edge areas of objects.
We would like to express our sincere gratitude to you again for taking the time to review our manuscript.
