Review Reports - Detection in Road Crack Images Based on Sparse Convolution

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper applies several new deep learning–based object detection and segmentation techniques for crack detection on pavement images. The following comments are provided:

The literature review does not adequately cover relevant background knowledge or justify the use of the selected methods. The related works on the key technologies adopted in this study—such as ConvNeXt V2, the Sparse Encoding Module, and the Binary Attention Module—should be introduced and discussed in a more integrated and concise way.
The key techniques used in this study (e.g., ConvNeXt module) should be explained in greater detail. The paper should discuss their origins and evolution in related works, the rationale for their selection, and provide clearer and more intuitive descriptions in the methodology section.
There is inconsistency in how the proposed technologies and modules are described. For example, the terms listed in the contributions section (Lines 64–76)—including “sparse convolution,” “sparse encoding module,” “random masking strategy,” “lightweight ConvNeXt network as decoder,” “asymmetric encoder–decoder structure,” “binary attention module,” and “channel and spatial attention bridging modules”—differ from those in the abstract, which mention “ConvNeXt V2,” “random mask policy,” “asymmetric coding and decoding structure,” “multi-stage and multi-scale characteristics,” and “spatial attention bridge module.” These should be aligned and presented consistently.
The mention of a single contribution at the end of the LR (Lines 116–119) appears abrupt and unbalanced. According to the contributions section, this study has multiple novel aspects that are not reflected here. Moreover, the LR does not clearly summarize the current gaps or challenges in existing crack detection methods. A strong LR should (1) identify the research gaps and limitations of existing work, (2) introduce the relevant background, and (3) justify the choice of technologies used in this study.
The comparison study should include more representative and state-of-the-art crack detection models rather than basic architectures such as FCN or U-Net. Please refer to the review paper “Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets” (arXiv:2508.10256v2) and include comparisons with some crack detection–specific models.
It is unclear why U-Net performs so poorly in Figure 8. This requires further explanation or validation.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

- The paper does not propose a new algorithm, rather, it adapts an effective masked modeling approach to the task of supervised crack segmentation.

- The methodology logically follows the problem statement, and the experimental results align with the study's goals.

- Questions to clarify:

1. The paper mentions a "random masking strategy" but does not provide specific details. Could the authors clarify how this strategy was designed and how its key parameters were selected and optimized?

2. The training methodology combines concepts from self-supervised learning (masking) with a supervised task, but the workflow is unclear. Is the model trained end-to-end, or does it involve a pre-training phase followed by fine-tuning? If it is trained end-to-end, how is the loss calculated?

3. The encoder utilizes a sparse module, while the decoder is a dense convolutional network. Could the authors explain the transition from the sparse encoder's output to the dense decoder's input more clearly?

4. Could you please clarify the difference between "Linear Attention" and "Attachment Linear Attention" as used in the BAM module? Is "Attachment Linear Attention" a novel variant, and if so, could you provide its mathematical formulation?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Faster R-CNN was initially used by the “Autonomous structural visual inspection using region‐based deep learning for detecting multiple damage types”. This original paper should be cited.
Overall, literature review is very weak. Some original papers should be discussed: “Deep learning-based structural health monitoring”.
The reported mIoU values are lowered than state of the art methods in this topic: SDDNet: Real-time crack segmentation; Hybrid pixel-level concrete crack segmentation and quantification across complex backgrounds using deep learning; Efficient attention-based deep encoder and decoder for automatic crack segmentation. These methods showed at least 80% and maximum 93% of mIoU. These methods should be introduced to discuss state of the art works in this topic or compared.
What are the frames per seconds to check the processing speed. That should be discussed with input image size.