CCT Net: A Dam Surface Crack Segmentation Model Based on CNN and Transformer
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper proposes CCT Net, a dam surface crack segmentation model that integrates CNNs and Transformers. The CNN captures fine-grained local features, while the Transformer models long-range dependencies. Trained on a 400-image dam crack dataset, the model achieves strong performance. There are few comments below:
- Provide more detail on environmental conditions and annotation procedures; consider including wet or underwater cases.
- How does CCT Net handle class imbalance during training beyond the use of the composite loss function
- Include inference time, parameter count, or FLOPs to assess deployment feasibility.
- Include recent transformer-based SHM work, e.g., https://doi.org/10.1177/14759217231182303, for better context and comparison.
- How does the model perform under severe surface noise (e.g., moss, stains)?
- What regularization strategies were used to prevent overfitting on the small dataset (400 images)?
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes a novel deep learning model named CCT Net (Crack CNN-Transformer Net) for the automatic detection and segmentation of surface cracks in concrete dams. The model innovatively combines a CNN encoder and a Transformer encoder, integrating local and global features through a Feature Complementary Fusion Module (FCFM). Using an encoder-decoder architecture and a composite loss function that combines binary cross-entropy loss and Dice loss, it achieves accurate surface crack identification. Trained on 400 crack images, CCT Net demonstrates higher accuracy compared to other models. The results confirm that both the loss function and the FCFM module are essential for improving segmentation performance. The topic of the study is interesting, but there are many issues in this article that need further revision:
(1) 42 lines: “NLP” was not used in the following text.
(2) 150 lines: Syntax error, “have emerged”
(3) Figure 1: How is the feature dimension of CNN Block1 derived from [B, C. H, W] to [B, C, H/2, W/2]?
(4) 199 lines: Why were only 4 of the 5 CNN blocks introduced?
(5) By cropping dam crack images to construct the dataset, would this dataset construction approach become more versatile, making it applicable beyond just dam crack images?
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe proposed CCT Net introduces a hybrid CNN Transformer-based model for dam surface crack segmentation, incorporating a Feature Complementary Fusion Module (FCFM) to combine local and global features. While the work presents an interesting architecture, several critical aspects require significant improvement to strengthen the study’s relevance, rigor, and practical applicability. The following points should be addressed thoroughly:
- The hybrid dual-encoder design (CNN + Transformer), combined with the Feature Complementary Fusion Module (FCFM), significantly increases the model’s computational overhead, training time, and inference latency. This complexity limits scalability and makes real-time or edge deployment (e.g., on drones or mobile devices) impractical. The authors should provide a detailed analysis of the model’s computational cost (e.g., FLOPs, inference time, memory footprint) and discuss feasible strategies for optimization, such as model pruning, quantization, knowledge distillation, or lightweight variants of CCT Net.
- The dataset used consists of only 400 manually annotated images, which lacks sufficient diversity in terms of crack types, lighting, and dam structures. This small scale increases the risk of overfitting and weakens the model’s generalizability.
- The authors should include more detailed descriptions and examples of the types of cracks (e.g., hairline, transverse, diagonal, and alligator cracks), ideally with visual samples to enhance clarity and relevance.
- Furthermore, the use of 400 training epochs is mentioned, but the results lack statistical rigor or variance analysis to justify the model’s stability and convergence. Additional data or validation experiments are needed to support the findings.
- The model is only evaluated against U-Net and Swin Transformer, which limits the comparative scope. Including more recent or efficient segmentation models, such as DeepLabv3+, SegFormer, and Fast-SCNN, would provide a stronger and fairer benchmark. Moreover, the comparison of segmentation performance across different crack types would improve practical insights and strengthen the value of the work.
- The paper’s language and formatting at times feel AI-generated or overly mechanical. The authors are encouraged to revise the tone and flow to make it more natural, human-readable, and academically sound. Smooth transitions, concise explanations, and improved figure captions can enhance clarity.
- Equation (3) and others contain syntax and typographic errors; they should be rewritten using consistent and proper mathematical notation.
- Figures (e.g., model architecture and fusion blocks) need higher visual resolution and clearer annotations to aid interpretation.
- The citations appear disorganized, with abrupt jumps in numbering and inconsistent contextual relevance. The reference list and in-text citations should follow a sequential and ascending order, aligned with the flow of the manuscript. Consistent formatting must be enforced.
The paper’s language and formatting at times feel AI-generated or overly mechanical. The authors are encouraged to revise the tone and flow to make it more natural, human-readable, and academically sound. Smooth transitions, concise explanations, and improved figure captions can enhance clarity.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper is good for publish. I do not have further comments.
