Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Precise Feature Removal Method Based on Semantic and Geometric Dual Masks in Dynamic SLAM

Appl. Sci. 2025, 15(13), 7095; https://doi.org/10.3390/app15137095

by Zhanrong Li^1,2,3,4, Chao Jiang^1,*

, Yu Sun¹, Haosheng Su¹ and Longning He¹

Reviewer 1:

Taizo Yoshikawa

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2025, 15(13), 7095; https://doi.org/10.3390/app15137095

Submission received: 15 May 2025 / Revised: 13 June 2025 / Accepted: 19 June 2025 / Published: 24 June 2025

(This article belongs to the Special Issue Deep Learning and Machine Learning in Image Processing and Pattern Recognition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In visual SLAM, this study proposes a robust dual-mask filtering strategy that synergistically integrates semantic segmentation information with geometric outlier detection techniques.
Compared to traditional methods, this dual-mask collaborative filtering strategy improves the accuracy of dynamic feature removal and enhances the reliability of dynamic object detection, validating its robustness and applicability in complex dynamic environments.

Structure of this paper:
Section 2: review the research progress in visual SLAM and focuses on discussing existing dynamic object handling techniques, analyzing the strengths and weaknesses of these methods and the challenges they face.
Section 3: introduce the design and implementation of the proposed system, including core ideas, system architecture, and key technologies.
Section 4: evaluate the performance of the proposed method on KITTI and TUM datasets through experiments and compares it with existing methods to demonstrate the system’s performance improvements.
Section 5: conclude the main work of the paper, discusses the limitations of the research, and looks forward to potential future research directions and application prospects.

Processing steps: Easy to understand
1. Image Input: Receive original video frames as the starting point for SLAM processing.
2. Feature Point Extraction: Use ORB-SLAM3 to extract feature points from each frame.
3. Semantic Segmentation: Utilize YOLOv11 to perform semantic segmentation, creating
masks for potential dynamic objects.
4. Mask Processing: Perform morphological dilation on the generated masks to enhance
the capture precision of dynamic object boundaries.
5. Dynamic Feature Point Filtering: Combine a dual-mask strategy with geometric 156
information to effectively filter out dynamic feature points.
6. Pose Optimization: Improve camera localization accuracy through PnP optimization
calculations and dynamically update the feature point set to support subsequent frame processing.

Feature of this approach: Clearly explained
This approach re-utilizes feature points identified as outliers in the previous frame as clues for potential dynamic regions, combining them with instance masks provided by semantic segmentation to spatially constrain dynamic objects.
Once the number of outliers detected in the current frame that fall within the semantic dynamic mask region exceeds a set threshold, these points are considered dynamic features, thus realizing dynamic detection driven by both semantic and geometric cues.

(1) There is no problem with the structure of the text and logical development. The structure is very easy to understand.
(2) I would like to see images of the execution status of the public datasets KITTI and TUM, and figures showing the execution status of the proposed method using them.
(3) It would be better to compare the real time performance with other methods.

Author Response

We sincerely appreciate your positive feedback on the logical structure and readability of the manuscript. This confirmation motivates us to further enhance the clarity of experimental results and comparative analyses, as addressed in the responses to the following comments.

Comment 1:

I would like to see images of the execution status of the public datasets KITTI and TUM, and figures showing the execution status of the proposed method using them. Response 1 ：
We appreciate the reviewer’s valuable suggestion. In response, we have supplemented the manuscript with additional visual and quantitative results to enhance clarity and completeness.

For the KITTI dataset (Section 4.1):
- The original Table 1 has been split into Table 1 and Table 2 to provide more detailed and organized evaluation data.
- We have added a new table (Table 4) that presents the percentage improvement for each data sequence, highlighting the performance gains of the proposed method.
- Additional quantitative metrics and their corresponding standard deviations have been included, and discussed in lines 349–360 and 377–381.
For the TUM dataset (Section 4.2):
- Similar improvements have been applied as in the KITTI section.
- We have included newly added indicator plots (Figure 3 and Figure 5) and trajectory plots (Figure 4 and Figure 6) to visualize the performance across different sequences (e.g., setting_xyz, walking_rpy).
- These visual and numerical analyses are discussed in lines 410–419, 426–433, and 439–447.
  We believe these additions provide a clearer illustration of the method’s execution and effectiveness on both public datasets.

Comment 2:

It would be better to compare the real-time performance with other methods. Response 2:
Thank you for the suggestion. To address this, we have added a new section titled “Section 5: Time Evaluation” in the revised manuscript (lines 468-500).

This section specifically evaluates the real-time performance of our proposed method compared to other dynamic SLAM systems, including DS-SLAM and DynaSLAM.
The content includes a description of the experimental setup, performance metrics, and comparative results.
This enhancement offers a holistic analysis of the balance between real-time performance and computational overhead, thereby reinforcing the practical viability of our approach in real-world applications.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors should address the following weaknesses in the manuscript:

The abstract requires revision regarding the research results. It is necessary to specify the obtained quantitative indicators demonstrating the advantages of the proposed approach, particularly in comparison with other methods or sensors.
The introduction should more clearly highlight the advantages of using SLAM technology compared to alternative sensor solutions, emphasizing its effectiveness within the relevant application context. This will substantiate the necessity of employing this technology to address the problem under consideration.
Presenting the main contributions in the introduction is important and necessary. However, these contributions need to be strengthened by specifying exactly how stability was improved and by how much — at least within general numerical ranges. Additionally, the main contribution section contains many general statements. It is advisable to provide specific numerical improvements, for example, reductions in ATE expressed in percentages or units.
The Related Work section lacks critical analysis and comparison with existing approaches to delineate the scientific novelty of the proposed method clearly. Quantitative metrics such as ATE, RPE, and FPS are missing, which would allow an objective assessment of the effectiveness of the described solutions. The transitions between subsections appear fragmented — it is advisable to add a summarizing analytical comment on the rationale for combining geometric and semantic approaches, as implemented in the proposed work. Furthermore, a brief focus on application areas where dynamic scenes are typical is recommended to justify the relevance of the study. All the above-mentioned descriptions can be effectively presented in a comparison table.
Formula (1) lacks an explanation regarding the choice of the parameter ε. The description of YOLOv11 in the text is inaccurate, as it is portrayed as a pixel-level segmentation model, whereas the standard version only performs bounding box detection and not full instance segmentation.
Despite the comparative testing conducted on two well-known datasets KITTI and TUM RGB-D, the results require expanded analysis of the statistical significance of improvements, since currently only mean values are presented without corresponding standard deviations, confidence intervals, or hypothesis tests. Moreover, it would be useful to supplement the results with trajectory visualizations and ATE/RPE plots for better interpretation of differences between the systems under dynamic conditions.
The results section lacks sufficient analysis of the critical aspect of decision-making time, which is crucial for real-time localization systems. It is recommended that future research evaluate the proposed approach’s performance in terms of processing delay, which would provide a more complete assessment of the method’s suitability for practical application in robotic systems.

Comments for author File: Comments.pdf

Author Response

We would like to express our sincere gratitude to you for the thorough and valuable comments. These suggestions have been instrumental in refining our work and enhancing its overall quality. Below, we provide a point-by-point response to each of the concerns raised.

Comment 1:

The abstract requires revision regarding the research results. It is necessary to specify the obtained quantitative indicators demonstrating the advantages of the proposed approach, particularly in comparison with other methods or sensors.

Response 1:
In the revised abstract, we have explicitly included the quantitative results achieved by the proposed method. Specifically, we now present the improvements in ATE and RPE on the KITTI and TUM datasets to highlight its effectiveness compared to baseline methods.

Although the comparison with other types of sensors is not the primary focus of this work, we acknowledge the importance of this aspect. Accordingly, we have addressed this point in the Introduction section to better reflect the advantages of the proposed visual SLAM system in contrast with alternative sensor-based approaches.

Comment 2:

The introduction should more clearly highlight the advantages of using SLAM technology compared to alternative sensor solutions, emphasizing its effectiveness within the relevant application context. This will substantiate the necessity of employing this technology to address the problem under consideration.

Response 2:
We appreciate your valuable feedback. To address this, we have enriched the Introduction section with a detailed comparison between visual SLAM and other common sensor technologies, such as LiDAR, millimeter-wave radar, and satellite navigation systems.

The added content outlines the key advantages of visual SLAM from three main perspectives:

Cost-effectiveness – Visual sensors are generally more affordable and widely available.
Environmental adaptability – Cameras can capture rich texture and lighting information in various environments, including indoor and GPS-denied areas.
Integration with vision-based tasks – Visual SLAM can be seamlessly integrated with object detection, scene understanding, and other high-level computer vision functions.

These aspects support the necessity and suitability of visual SLAM in real-world applications such as autonomous mobile robots and augmented reality systems.
The revised content can be found in lines 26–39 of the updated manuscript.

Comments 3：

Presenting the main contributions in the introduction is important and necessary. However, these contributions need to be strengthened by specifying exactly how stability was improved and by how much — at least within general numerical ranges. Additionally, the main contribution section contains many general statements. It is advisable to provide specific numerical improvements, for example, reductions in ATE expressed in percentages or units.

Response 3:
The stability of the system is evaluated using the standard deviation of error metrics. To reflect this, we have incorporated standard deviation values into the tables and figures presented in the experimental section and provided corresponding analyses (KITTI: lines 349–360, 377–381; TUM: lines 410–439, 426–433). In addition, the Main Contributions section has been revised to include specific numerical improvements, explicitly reporting the percentage reductions in Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) (lines 78-84).

Comments 4：

The Related Work section lacks critical analysis and comparison with existing approaches to delineate the scientific novelty of the proposed method clearly. Quantitative metrics such as ATE, RPE, and FPS are missing, which would allow an objective assessment of the effectiveness of the described solutions. The transitions between subsections appear fragmented — it is advisable to add a summarizing analytical comment on the rationale for combining geometric and semantic approaches, as implemented in the proposed work. Furthermore, a brief focus on application areas where dynamic scenes are typical is recommended to justify the relevance of the study. All the above-mentioned descriptions can be effectively presented in a comparison table.

Response 4:
To clarify the motivation and rationale behind combining geometric and semantic approaches, we have added a concise analytical summary in the Related Work section. This summary highlights the complementary strengths of both methods and explains how their integration supports the robustness and accuracy of our proposed system. The relevant additions can be found in lines 121–127 and 144–155 of the revised manuscript.

Comments 5：

Formula (1) lacks an explanation regarding the choice of the parameter ε. The description of YOLOv11 in the text is inaccurate, as it is portrayed as a pixel-level segmentation model, whereas the standard version only performs bounding box detection and not full instance segmentation.

Response 5:

Regarding the choice of the parameter ε\varepsilonε in Equation (1):
We have added the following explanation to the manuscript (lines 256–260):
“The parameter ε empirically set to 50 pixels, a value determined through extensive comparative experiments on the KITTI and TUM RGB-D datasets. This setting achieves a balance between dynamic and static regions. A larger value of ε tends to result in the direct removal of entire dynamic objects, while a smaller value tends to the elimination of only the dynamic points.”
In the future, this parameter could be further optimized using adaptive algorithms to enhance the method's generalization across diverse environments.
Regarding the inaccurate description of YOLOv11 as a pixel-level segmentation model:
Thank you for your observation. Earlier versions of YOLO (v1–v3) were purely detection-based. Since YOLOv5-v7 (2022), segmentation functionality was introduced, and further improved in YOLOv8 (2023) and YOLOv11 (2024). While the standard detection models (e.g., yolov11.pt) only perform bounding box detection, the -seg variants (e.g., yolov11n-seg.pt) are explicitly designed for instance segmentation with pixel-level masks.
Reference: YOLOv11 Improvements – Ultralytics Documentation
In our current work, we use the semantic segmentation capability of YOLOv11. We appreciate your suggestion, and in future iterations, we plan to adopt the instance segmentation model to better distinguish between different object instances and further improve system performance.

Comments 6：

Despite the comparative testing conducted on two well-known datasets KITTI and TUM RGB-D, the results require expanded analysis of the statistical significance of improvements, since currently only mean values are presented without corresponding standard deviations, confidence intervals, or hypothesis tests. Moreover, it would be useful to supplement the results with trajectory visualizations and ATE/RPE plots for better interpretation of differences between the systems under dynamic conditions.

Response 6:
We appreciate the reviewer’s valuable suggestion. In response, we have supplemented the manuscript with additional visual and quantitative results to enhance clarity and completeness.

For the KITTI dataset (Section 4.1):
- The original Table 1 has been split into Table 1 and Table 2 to provide more detailed and organized evaluation data.
- We have added a new table (Table 4) that presents the percentage improvement for each data sequence, highlighting the performance gains of the proposed method.
- Additional quantitative metrics and their corresponding standard deviations have been included, and discussed in lines 349–360 and 377–381.
For the TUM dataset (Section 4.2):
- Similar improvements have been applied as in the KITTI section.
- We have included newly added indicator plots (Figure 3 and Figure 5) and trajectory plots (Figure 4 and Figure 6) to visualize the performance across different sequences (e.g., setting_xyz, walking_rpy).
- These visual and numerical analyses are discussed in lines 410–419, 426–433, and 439–447.
  We believe these additions provide a clearer illustration of the method’s execution and effectiveness on both public datasets.

Comments 7：

The results section lacks sufficient analysis of the critical aspect of decision-making time, which is crucial for real-time localization systems. It is recommended that future research evaluate the proposed approach’s performance in terms of processing delay, which would provide a more complete assessment of the method’s suitability for practical application in robotic systems.

Response 7:
Thank you for the suggestion. To address this, we have added a new section titled “Section 5: Time Evaluation” in the revised manuscript (lines 468-500).

This section specifically evaluates the real-time performance of our proposed method compared to other dynamic SLAM systems, including DS-SLAM and DynaSLAM.
The content includes a description of the experimental setup, performance metrics, and comparative results.
This enhancement offers a holistic analysis of the balance between real-time performance and computational overhead, thereby reinforcing the practical viability of our approach in real-world applications.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors propose a simple approach to enhance the feature point selection capability in dynamic environments. The objective is to reduce the interference of dynamic objects on the simultaneous localization and mapping system, ensuring stable operation in complex and dynamic environments. Experimental results conducted on two benchmark datasets demonstrate that the proposed method improves global trajectory accuracy for various dynamic scenarios.

The paper is of some interest and fits the scope of the journal. Major revision is recommended considering the following comments.

From the results presented in Tables 1-3 we may conclude that the proposed approach superiors some other techniques in terms of quality of identification. However, it is not clear the cost of these improvements. Please provide CPU time indicators for your approach comparing with other techniques.
For the optimization problem (4) please clearly state the constraints for the variables. Which techniques are used to solve the problem (4)?
There are too many abbreviations in the paper which complicates reading and understanding the paper. Some of them (e.g. Intensity Assisted Iterative Closest Point (IAICP), SegNet), are not used any further. Please reduce the number of abbreviations and leave only those abbreviations that are necessary and frequently used.
Lines 51,52. “With geometric consistency constraints, I will validate and optimize the filtered feature points….” Please rewrite.
In Conclusion section please provide a critical analysis of the advantages and limitations of the proposed approach compared with other techniques.

Author Response

We appreciate the reviewer’s detailed evaluation and critical insights. The identified issues—such as computational cost, formulation clarity, abbreviation usage, and conclusion analysis—are well taken. We have thoroughly revised the manuscript to address each point and believe these changes have significantly strengthened the quality and clarity of our work.

Comment 1:

From the results presented in Tables 1-3 we may conclude that the proposed approach superiors some other techniques in terms of quality of identification. However, it is not clear the cost of these improvements. Please provide CPU time indicators for your approach comparing with other techniques.

Response 1:
Thank you for the suggestion. To address this, we have added a new section titled “Section 5: Time Evaluation” in the revised manuscript (lines 468-500).

This section specifically evaluates the real-time performance of our proposed method compared to other dynamic SLAM systems, including DS-SLAM and DynaSLAM.
The content includes a description of the experimental setup, performance metrics, and comparative results.
This enhancement offers a holistic analysis of the balance between real-time performance and computational overhead, thereby reinforcing the practical viability of our approach in real-world applications.

Comment 2:

For the optimization problem (4) please clearly state the constraints for the variables. Which techniques are used to solve the problem (4)?

Response 2:
The requested revisions have been made, and the updated explanation can be found in the manuscript at lines 281–291.

Comments 3：

Response 3:
We have made the following revisions regarding the use of abbreviations:

Removed unnecessary and unused abbreviations:
- Line 98: Extended Kalman Filtering (EKF) — abbreviation removed.
- Line 102: Bundle Adjustment (BA) — abbreviation removed.
- Line 104: a multi-map system (ATLAS) — abbreviation removed.
- Line 117: Intensity Assisted Iterative Closest Point (IAICP) — abbreviation removed.
Revised unused abbreviations:
- Line 132: SegNet was replaced with a segmentation network to improve clarity.
Reduced redundancy in repeated abbreviations:
- Repeated mentions of absolute trajectory error (ATE) and relative pose error (RPE) have been trimmed for conciseness.
  We have retained widely known and commonly referenced model or SLAM system names, such as Mask R-CNN, MonoSLAM, PTAM, DynaSLAM, DS-SLAM, DSO, and LSD-SLAM, as they are standard terminology in the field.

Comments 4：

Lines 51,52. “With geometric consistency constraints, I will validate and optimize the filtered feature points….” Please rewrite.

Response 4:
“With geometric consistency constraints, I will validate and optimize the filtered feature points, enhance the system's ability to judge dynamic areas, and support the continuous update of dynamic regions to adapt to ever-changing environments.”
change to
“Geometric consistency constraints are applied to the filtered feature points for validation (via reprojection error checks) and optimization (via bundle adjustment). Notably, the information of outlier feature points is preserved to assist in the ongoing judgment of dynamic regions under ever-changing environments.（lines 60-64）”

Comments 5：

In Conclusion section please provide a critical analysis of the advantages and limitations of the proposed approach compared with other techniques.

Response 5:
We have added a critical analysis in the Conclusion section (lines 526–536). Compared with classical dynamic SLAM methods that introduce extra geometric constraints before feature matching—such as DS-SLAM using optical flow for lightweight checks and DynaSLAM employing depth-based multi-view geometry—our method integrates dynamic point detection directly into the existing optimization process. This avoids additional computational overhead while ensuring reliable outlier filtering, resulting in improved overall efficiency and accuracy.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Accept in present form

Reviewer 3 Report

Comments and Suggestions for Authors

The authors revised the paper significantly considering the referees' comments. I think in its current form the paper can be recommended for publication.

Article Menu

Precise Feature Removal Method Based on Semantic and Geometric Dual Masks in Dynamic SLAM

Comment 1:

Comment 2:

Comment 1:

Comment 2:

Comments 3：

Comments 4：

Comments 5：

Comments 6：

Comments 7：

Comment 1:

Comment 2:

Comments 3：

Comments 4：

Comments 5：

Further Information

Guidelines

MDPI Initiatives

Follow MDPI