Review Reports - RGB-Based Staircase Detection for Quadrupedal Robots: Implementation and Analysis

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1. The background section is descriptive rather than analytical — it fails to build a clear scientific argument about the research gap.

2. The claim that their dataset is “unique” is not supported by comparisons with public datasets such as StairNet or the RGB-D Stair Dataset. Even if the dataset is proprietary, the authors must describe how environmental variables such as lighting conditions, stair inclination, and surface texture were controlled.

3. The YOLO references are too general; there is no specific justification for choosing YOLOv11 over alternatives such as YOLOv5, YOLOX, or SSD. The rationale for selecting YOLOv11 should be clearly explained beyond it being the latest version.

4. To strengthen the contribution, perform baseline comparisons with traditional methods (e.g., HOG or simple CNNs) and include an ablation study — particularly evaluating performance without preprocessing, without NMS, and without multi-view input.

5. Provide and explain in detail a visualization of the YOLOv11 architecture or any modifications made, especially since it is described as the “proposed method.”

6. The conclusion is overly optimistic compared to the actual results (notably the low mAP on key subsets) — this should be revised.

7. Add a discussion on the model’s generalization limits and plans for improvement, such as multi-domain fine-tuning.

Author Response

Author's response, paper "YOLO-Based Staircase Detection for Quadrupedal Robot" Sensors (ISSN 1424-8220)

We thank the reviewer for the constructive and detailed comments. Their feedback has been invaluable in enhancing the overall quality of our article, particularly by clarifying the methodology and providing a more precise description and justification of our dataset.

Comment 1: The background section is descriptive rather than analytical — it fails to build a clear scientific argument about the research gap.

Response 1. We have updated the Introduction sections (78-89 lines) to build a clearer scientific argument about the research gap. The updated sections now emphasize the limitations of previous studies, identify the specific problem our work addresses, and articulate how our approach contributes to filling this gap.zComment 2: The claim that their dataset is “unique” is not supported by comparisons with public datasets such as StairNet or the RGB-D Stair Dataset. Even if the dataset is proprietary, the authors must describe how environmental variables such as lighting conditions, stair inclination, and surface texture were controlled.

Response 2. To support our claim of dataset uniqueness, we have added comparisons with the StairNet and RGB-D Stair Dataset in the Dataset section (lines 252–264). The StairNet dataset contains static, human-held camera images of indoor environments, designed primarily for general scene recognition, and does not include robot-mounted viewpoints, sequential image data, or motion dynamics typical of mobile platforms. Similarly, the RGB-D Stair Dataset provides annotated RGB-D pairs for stair detection, but it focuses on static scenes collected under controlled conditions without movement or locomotion context.

In contrast, our dataset captures sequential RGB data from cameras mounted on a mobile quadruped robot in motion, introducing natural variations in viewpoint, motion blur, and fisheye distortion. Environmental factors such as lighting conditions, stair inclination, and surface texture were deliberately diversified and documented to ensure robustness. The division of the data into multiple sequences enables the assessment of scene difficulty, texture quality, and illumination variability within comparable indoor environments and similar types of stairs. Thus, our dataset complements existing benchmarks by offering a robot-centric, dynamic, and realistic perspective, bridging the gap between controlled visual datasets and practical robotic perception.

Comment 3: The YOLO references are too general; there is no specific justification for choosing YOLOv11 over alternatives such as YOLOv5, YOLOX, or SSD. The rationale for selecting YOLOv11 should be clearly explained beyond it being the latest version.

Response 3: In the revised manuscript (lines 290-295), we have added a clear justification for selecting YOLOv11, supported by recent comparative studies, “The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection” and “YOLO Evolution: A Comprehensive Benchmark and Architectural Review of YOLOv12, YOLO11, and Their Previous Versions”, which demonstrate that YOLOv11 provides a strong balance of accuracy, speed, and robustness across diverse object detection tasks. For comparison, the YOLOv8 variant was also included to assess performance differences between recent generation models.

Comment 4: To strengthen the contribution, perform baseline comparisons with traditional methods (e.g., HOG or simple CNNs) and include an ablation study — particularly evaluating performance without preprocessing, without NMS, and without multi-view input.

Response 4. At an early stage of our research, we tested traditional handcrafted descriptors (HOG, SIFT, SURF, ORB), but they provided unsatisfactory results under changing lighting and viewpoint conditions. Therefore, we focused on a dedicated YOLO implementation, which offers better robustness and real-time optimization for embedded on-robot operation.

Previous studies have also reported that handcrafted-descriptor-based detectors (e.g., HOG+SVM) perform significantly worse than deep learning models such as YOLO in similar detection tasks. Notable examples include "Recent Object Detection Techniques: A Survey" (Diwakar & Raj, 2022) and "A Comprehensive Review of Object Detection with Traditional and Deep Learning Methods" (Pagire, Chavali & Kale, 2025). These traditional approaches were tested during preliminary experiments but were not included in the proposed work presented in this paper. In the revised version (lines 171-179), we have clarified this methodological choice and included a brief description of these preliminary tests.

Comment 5: Provide and explain in detail a visualization of the YOLOv11 architecture or any modifications made, especially since it is described as the “proposed method.”

Response 5. In the revised version, we have introduced a detailed description of the YOLOv11 network architecture in the Methods section. Additionally, a schematic diagram illustrating the structure of the proposed YOLOv11-based model has been added as Figure 8. YOLOv11n architecture general scheme used in the experiments, presenting all main components of the backbone, neck, and detection head modules.

Comment 6: The conclusion is overly optimistic compared to the actual results (notably the low mAP on key subsets) — this should be revised.

Response 6. We have revised the Results section to provide a more detailed interpretation of the findings. Additional information has been added to the Experiments section to clarify that the relatively low mAP score for the stairs-only variant (with only stairs labeled, Table 5) is due to the high difficulty level of this specific dataset. We also included an illustrative figure showing examples where stairs appear as distant objects in the scene and were fully labeled. Such distant and visually challenging instances contribute to the reduced detection performance. Therefore, we additionally reported results for the complete dataset (Table 4), where an important aspect, beyond stair detection accuracy—is minimizing false stair detections.

Comment 7: Add a discussion on the model’s generalization limits and plans for improvement, such as multi-domain fine-tuning.

Response 7. We have added a discussion addressing the generalization aspects of the model (lines 464 - 470). In the current study, training was intentionally performed on a single sequence as part of the defined experimental scenario. Additionally, we have outlined future improvement plans, including multi-domain fine-tuning based on experiments using data collected from all sequences and additional cameras. The corresponding text has been added to the Discussion section. We believe that this addition more clearly highlights the limitations of the proposed approach and outlines potential directions for its improvement.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors' efforts in developing and documenting this study, which is clearly written and technically consistent, are appreciated. Below are my recommendations for making the article more relevant, in my opinion.

Comments on the Abstract

The summary is clear and technically quite solid, but I would like to know what new insights it brings in relation to previous studies. The authors should specify what distinguishes this work from previous studies that use YOLO-based or vision-based stair detection. The contribution in this regard is more applied than methodological. It would also be useful to include a brief comparative context for the performance described (mAP values) to highlight the significance of the results.

My recommendations:

Add a brief description of the novelty or main contribution of the research.
Clarify how this work differs from previous studies on staircase detection based on YOLO.
Briefly contextualize the performance results (mAP) in comparison with existing methods.

All this considering that we are in the abstract, simply mention it.

Comments on the Introduction

I find the introduction quite clear, although it perhaps anticipates too many methodological details and does not explicitly indicate the research gap it aims to fill. The statement that few studies use camera-based detection in mobile or quadrupedal robots does not seem convincing to me, as this approach is well represented in literature. It would be good to make a clearer statement of the contribution in this section.

My recommendations:

Focus on motivation and research gap in relation to the proposed model.
Clarify what specific limitations or improvements this work addresses compared to previous vision-based stair detection methods.
Reconsider or qualify the statement about the scarcity of camera-based approaches by providing references or rephrasing it more cautiously.
Conclude the section with a concise statement of the problem and objectives to guide the reader.

Comments on Related Works

The section provides a comprehensive overview of previous approaches to robotic vision, but in my view, it lacks a critical analysis of how the present work goes beyond these previous works. Traditional and deep learning methods are listed, and they conclude by stating that YOLO, the main method of this article, has already been widely applied in previous studies. This raises the question of what new contribution this research makes.

My recommendations:

Include a comparative and/or critical discussion that identifies the specific shortcomings of previous YOLO-based approaches, thereby highlighting the contribution of the study.
Clarify the added value or novelty of this study in relation to existing work.
Avoid a purely descriptive listing of methods; emphasize what differentiates this research from the reviewed literature.

Comments on the Dataset

The dataset is well described and includes important details about image acquisition, annotation, and camera configuration. It would be beneficial to provide more analytical depth to this section, as it is unclear how this dataset improves upon or differs from existing public datasets for stair detection. Although it is described as “unique”, no comparative evidence or references are provided to support this claim.

My recommendations:

Clarify what makes this dataset distinctive (e.g., perspective, number of sequences, real robot data).
Provide a comparison or justification showing how it complements or surpasses existing datasets.
Emphasize the scientific relevance of the data collected.
Include, if possible, information on the availability of the dataset (public access or link to the repository).

Comments on Methodology

I find the methodology technically clear and well structured, following standard procedures for image-based object detection. I do not see any methodological innovation beyond the routine implementation of YOLO on embedded platforms. The description is procedural, and I cannot discern any analytical discussion or justification of specific design choices (e.g., why YOLOv11n was selected over YOLOv8, or why certain thresholds and parameters were chosen).

My recommendations:

Clarify the reasons for selecting the proposed model, as well as the parameter settings (e.g., YOLOv11n, IoU threshold).
Highlight any adaptations or modifications to the standard YOLO process that may justify its originality.
Focus on what is new or optimized in this implementation.
If the novelty lies in the integration of multiple cameras or in the integrated implementation, explicitly emphasize this as a methodological contribution.

Comments on Results

I find the results quite detailed, including comprehensive tables comparing different versions of YOLO and devices. The section should include a more analytical part. For example, the mAP and F1 values are indicated, but there is no discussion of why certain configurations work better, nor is there any analysis of the possible causes of false detections or the effects of data set imbalance. The differences in performance between sequences and cameras are presented, but this should be explored in greater depth.

My recommendations:

Add an explanatory analysis of performance differences (e.g., between YOLOv8 and YOLOv11, or between front and bottomcameras).
Analyze the impact of dataset imbalance on detection accuracy and false negatives.
Include comparative references or benchmarks to contextualize the reported mAP values.
Avoid purely numerical descriptions; highlight the insights or lessons these results provide for future applications.

Comments on Discussion

The discussion is coherent and relates the results to the broader problem of stair detection in mobile robotics. It needs to be explored in greater depth and given a more critical perspective, as the known limitations of vision-based approaches (illumination, occlusion, computation) are repeated without being directly related to the experimental results obtained. The claim about the uniqueness of the dataset and the methodological contribution is not convincingly supported. A more in-depth comparison with previous studies based on YOLO or RGB alone would be desirable.

My recommendations:

Provide a critical interpretation of the results obtained in relation to previous work.
Recognize and analyze the limitations of the study (e.g., small dataset, limited diversity, manual data collection).
Clarify the practical implications of achieving “soft real-time” performance: what does this mean for actual robotic navigation?
Avoid generic statements; strengthen the discussion by connecting the findings to specific technical or scientific advances.

Comments on Conclusions

The conclusions adequately summarize the study and reiterate the results obtained, but we return to what was indicated in the previous comments: the contribution of the work in relation to previous studies is not expressed. The information provided in the results section is repeated, without adding new interpretative value or identifying future lines of research.

My recommendations:

Reformulate the conclusions to highlight the main contribution and its relevance beyond this case study.
Acknowledge the limitations and constraints of the work.
Add specific and realistic guidance for future work in relation to the limitations identified.
Avoid repeating numerical results; instead, emphasize the new knowledge or capabilities that this work brings.

Overall Assessment

I find the manuscript to be solid and well organized, but its main shortcoming lies in the lack of a clearly defined novelty. It is unclear whether the contribution is methodological or experimental, as the article mainly describes the application of existing techniques without specifying what is new or improved. This uncertainty about the original contribution significantly limits the scientific value of the work.

Author Response

Author's response, paper "YOLO-Based Staircase Detection for Quadrupedal Robot" Sensors (ISSN 1424-8220)

Comment 1: Comments on the Abstract

My recommendations:

Add a brief description of the novelty or main contribution of the research.
Clarify how this work differs from previous studies on staircase detection based on YOLO.
Briefly contextualize the performance results (mAP) in comparison with existing methods.

All this considering that we are in the abstract, simply mention it.

Response 1: We have revised the abstract to explicitly mention the novelty and main contribution of the paper. Specifically, we added information that the proposed evaluation scenario tests the model’s adaptation from a single training sequence to multiple unseen sequences, highlighting its extension of existing stair detection methods for quadrupedal robots. We also note the high variability in stair appearance due to the robot’s perspective and limited real-time processing capacity, emphasizing the applied contribution of this study.

We added clarification in lines 252–264 that our dataset differs from previous YOLO-based stair detection studies by providing continuous RGB sequences from a mobile quadruped robot, capturing realistic motion, viewpoints, and environmental variations, unlike existing static benchmarks.

Additionally, we have briefly contextualized the performance results (mAP) in comparison with existing methods, as described in lines 428–437 of the manuscript, to clarify the significance of our results relative to prior YOLO-based and vision-based object detection studies.

Comment 2: Comments on the Introduction

My recommendations:

Focus on motivation and research gap in relation to the proposed model.

Clarify what specific limitations or improvements this work addresses compared to previous vision-based stair detection methods.

Reconsider or qualify the statement about the scarcity of camera-based approaches by providing references or rephrasing it more cautiously.

Conclude the section with a concise statement of the problem and objectives to guide the reader.

Response 2: Focus on motivation and research gap in relation to the proposed model – we have addressed this by adding text in lines 78–89 of the Introduction, which discusses the challenges of sequential scenarios for a mobile quadruped robot, real-time perception under changing camera poses, and dynamic indoor environments. This added section also touches on the limitations of previous vision-based stair detection methods and the improvements introduced by the proposed approach, including the introduction and assessment of a sequential scenario in which the robot perceives and detects stairs while navigating dynamic indoor environments.

Comment 3: Comments on Related Works

My recommendations:

Include a comparative and/or critical discussion that identifies the specific shortcomings of previous YOLO-based approaches, thereby highlighting the contribution of the study.

Clarify the added value or novelty of this study in relation to existing work.

Avoid a purely descriptive listing of methods; emphasize what differentiates this research from the reviewed literature.

Response 3: We appreciate the reviewer’s valuable comments and recommendations. We fully agree with them and would like to emphasize that our approach, although based on YOLO, is distinguished by the dataset used for training.

Our dataset captures sequential RGB data from cameras mounted on a mobile quadruped robot in motion, introducing natural variations in viewpoint, motion blur, and fisheye distortion. Environmental factors such as lighting conditions, stair inclination, and surface texture were deliberately diversified and documented to ensure robustness. Thus, our dataset complements existing benchmarks by offering a robot-centric, dynamic, and realistic perspective, bridging the gap between controlled visual datasets and practical robotic perception.

We have added a comment in the Dataset section that further emphasizes the elements distinguishing our study (lines 252- 264).

Comment 4: Comments on the Dataset

My recommendations:

Clarify what makes this dataset distinctive (e.g., perspective, number of sequences, real robot data).

Provide a comparison or justification showing how it complements or surpasses existing datasets.

Emphasize the scientific relevance of the data collected.

Include, if possible, information on the availability of the dataset (public access or link to the repository).

In response, we have expanded the Dataset section to better emphasize the uniqueness and scientific relevance of the collected data. The distinctiveness of our dataset lies in its continuous sequential structure, which captures not only clear instances of stairs but also realistic scenarios where stairs appear as rare objects within a dynamic environment. This makes the dataset particularly valuable and challenging for model training in real-world robotic contexts.

Response 4: In response, we have expanded the Dataset section to better emphasize the uniqueness and scientific relevance of the collected data. The distinctiveness of our dataset lies in its continuous sequential structure, which captures not only clear instances of stairs but also realistic scenarios where stairs appear as rare objects within a dynamic environment. This makes the dataset particularly valuable and challenging for model training in real-world robotic contexts.

Unlike many existing public datasets, our data were collected directly from a real robot in motion, ensuring natural variations in camera pose and environmental conditions. To the best of our knowledge, there are no comparable datasets offering such continuous sequences that simultaneously represent changes in camera angle, distance to the stairs, and robot trajectory. Previous datasets often provide isolated frames or static viewpoints, whereas ours reflects the true dynamics of onboard perception.

Moreover, our dataset can serve as a benchmark reference for evaluating other datasets and methods, as it includes particularly challenging conditions, such as distant indoor staircases and partially occluded structures visible only in fragments of the RGB image. Importantly, the dataset is based solely on RGB data, while many existing studies rely on depth information, which significantly simplifies the detection task. Thus, our dataset provides a more demanding and realistic test scenario for vision-based stair detection.

Finally, due to ongoing work on the multisensor extension of the dataset, it is currently available upon request by contacting the corresponding author.

Comment 5: Comments on Methodology

My recommendations:

Clarify the reasons for selecting the proposed model, as well as the parameter settings (e.g., YOLOv11n, IoU threshold).

Highlight any adaptations or modifications to the standard YOLO process that may justify its originality.

Focus on what is new or optimized in this implementation.

If the novelty lies in the integration of multiple cameras or in the integrated implementation, explicitly emphasize this as a methodological contribution.

Response 5: In the revised manuscript, we have added a clear justification for selecting YOLOv11, supported by recent comparative studies, “The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection” and “YOLO Evolution: A Comprehensive Benchmark and Architectural Review of YOLOv12, YOLO11, and Their Previous Versions”, which demonstrate that YOLOv11 provides a strong balance of accuracy, speed, and robustness across diverse object detection tasks. For comparison, the YOLOv8 variant was also included to assess performance differences between recent generation models.

Additionally, we have introduced a detailed description of the YOLOv11 network architecture in the Methods section. Additionally, a schematic diagram illustrating the structure of the proposed YOLOv11-based model has been added as Figure 8. YOLOv11n architecture general scheme used in the experiments, presenting all main components of the backbone, neck, and detection head modules. Moreover, supplementary details regarding the network implementation and optimization process have been incorporated into the Methods section to enhance transparency and reproducibility of the experimental setup.

Comment 6: Comments on Results

My recommendations:

Add an explanatory analysis of performance differences (e.g., between YOLOv8 and YOLOv11, or between front and bottomcameras).

Analyze the impact of dataset imbalance on detection accuracy and false negatives.

Include comparative references or benchmarks to contextualize the reported mAP values.

Avoid purely numerical descriptions; highlight the insights or lessons these results provide for future applications.

Response 6: We have addressed these comments by adding a detailed analytical discussion in lines 399–427. The revised text explains the performance differences between YOLOv8 and YOLOv11, as well as between the front and bottom cameras, highlighting factors such as scene visibility, motion blur, and lighting variations. It also discusses the impact of dataset imbalance on detection accuracy and false negatives, and provides contextual insights to interpret the reported mAP values, emphasizing lessons for practical robotic applications.

Comment 7: Comments on Discussion

My recommendations:

Provide a critical interpretation of the results obtained in relation to previous work.

Recognize and analyze the limitations of the study (e.g., small dataset, limited diversity, manual data collection).

Clarify the practical implications of achieving “soft real-time” performance: what does this mean for actual robotic navigation?

Avoid generic statements; strengthen the discussion by connecting the findings to specific technical or scientific advances.

Response 7: We agree with the reviewer that, in comparison with other datasets commonly used in object detection research, the dataset employed in this study is relatively limited in size. Nevertheless, its collection process was intentionally designed to capture realistic operational conditions of a mobile robot performing stair recognition. From this perspective, the dataset is considered adequate for the specific aims of the study. However, as noted, its moderate diversity may constrain the generalization performance when applied to stair detection from different viewpoints or in outdoor environments. The above considerations have also been incorporated into the Conclusions section.

With regard to the concept of soft real-time in the context of navigation, we understand it as a processing time that is sufficient for control tasks, considering the chosen movement strategy and the kinematics of the robot. Thus, it refers to a practical approach to timing issues in robotic applications.

Comment 8: Comments on Conclusions

My recommendations:

Reformulate the conclusions to highlight the main contribution and its relevance beyond this case study.

Acknowledge the limitations and constraints of the work.

Add specific and realistic guidance for future work in relation to the limitations identified.

Avoid repeating numerical results; instead, emphasize the new knowledge or capabilities that this work brings.

Response 8: We have added a section on the limitations and weaknesses of the study to the Results part of the paper. Additionally, the Discussion section has been expanded with information regarding future research directions. The Conclusions section itself has been supplemented with remarks on processing time in the context of robot navigation tasks.

Comment 9: Overall Assessment

Response 9: We sincerely thank the reviewer for the valuable and constructive comments. All remarks have been carefully analyzed and fully addressed in the revised version of the manuscript. In particular, we clarified the novelty and original contributions of our work to better highlight its methodological and experimental significance.