MFSM-Net: Multimodal Feature Fusion for the Semantic Segmentation of Urban-Scale Textured 3D Meshes
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis article presents a clear and well-structured framework. The author proposes a three-stage process for semantic segmentation of 3D mesh models, comprising 2D texture feature extraction, 3D feature extraction, and a bridge-view-based feature alignment method. This approach significantly enhances the accuracy of semantic segmentation. Notably, the bridge-view-based feature alignment method introduces an innovative technique that ensures the deep network learns sufficient features. To further enhance the completeness of the article, the following suggestions are offered:
1, It is recommended to separate the "Conclusions" section into a distinct "Discussion" section. Currently, sections 4.1, 4.2, and 4.3 primarily focus on discussing the reasons behind the obtained results. In the "Conclusions" section, it would be beneficial to include more in-depth reflections on potential extensions of this work, such as whether additional 2D feature inputs from other perspectives are necessary.
2, The author is encouraged to include an analysis of the reasons for any unsuccessful results, which would help identify areas for improvement.
3,Consider adding an ablation study to validate the importance of each of the three proposed steps. This could be integrated with section 3.2 for a more comprehensive discussion.
Author Response
We sincerely thank Reviewer 1 for the valuable feedback. We have carefully revised the manuscript accordingly, and all changes are marked in the attached annotated version.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this manuscript, authors proposed a multimodal feature extraction for semantic segmentation, based on 3D geometric structures and 2D high-resolution texture images. Even though the valuable results based well-designed framework is provided, some questions need to be considered for the publication as follows:
- We have a question on what MESM’s meaning is.
- In Table 2, KPConv [42] among the conventional approaches is the most competitive against the proposed model. Please add enough comparison for [42] as related works of Introduction. Also, the OA’s comparison between the proposed and KPConv was done in Table 2. However, the values of 94.0 and 93.3 for OAs are not much different. Please explain the similar OAs additionally.
- For editorial updates, authors need to rewrite the manuscript more kindly such as abbreviation, spacing, a label alignment style for figures.
Editorial updates are required carefully.
Author Response
We sincerely thank Reviewer 2 for the valuable feedback. We have carefully revised the manuscript accordingly, and all changes are marked in the attached annotated version.
We have carefully revised the Methods section by reorganizing the descriptions for better clarity and coherence. In addition, the entire section has been reviewed and refined by a native English-speaking professor to improve the overall language quality.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper proposes a multimodal feature extraction network based on 3D geometric structures and 2D high-resolution texture images for semantic segmentation of textured 3D meshes. The experiments demonstrate the efficiency of the method proposed in the paper.
The paper demonstrates the superior quality of methods based on multimodal solutions, which add redundant techniques and data. Even though the method is more laborious and expensive, the classification quality is very good.
Reduce the text of capture for Fig1 and Fig5. Move the explanations into the text of the paper.
What are the four lines in Fig16? Explain more.
Fig2 and Gig16 do not specify explicitly what is parts a,b,c and d in the figure.
How do you assign the index for each mesh face? I think that in Fig.8 a single example is enough for converting the numerical index into R, G and B components.
I don't understand in section 2.4 how the correspondence between pixel and mash face is done? Is it a manual or automatic assignment?
Editting mistakes:
- efficiency and accuracy
- coefficientδ
- as shown in Fig. ??.
- (1) and (2) are actually expressions, not equations
- (1) - (4) to be consecvent in using the term of expression or formula
- enlarge the text in figures 1, 3 and 5
Author Response
We sincerely thank Reviewer 3 for the valuable feedback. We have carefully revised the manuscript accordingly, and all changes are marked in the attached annotated version.
Author Response File: Author Response.pdf