Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

An Investigation on Prediction of Infrastructure Asset Defect with CNN and ViT Algorithms

Infrastructures 2025, 10(5), 125; https://doi.org/10.3390/infrastructures10050125

by Nam Lethanh^1,*

, Tu Anh Trinh¹

and Mir Tahmid Hossain²

Reviewer 1: Anonymous

Reviewer 2:

Raihan Rahmat Rabi

Infrastructures 2025, 10(5), 125; https://doi.org/10.3390/infrastructures10050125

Submission received: 17 March 2025 / Revised: 28 April 2025 / Accepted: 6 May 2025 / Published: 20 May 2025

(This article belongs to the Section Infrastructures Inspection and Maintenance)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

-Line 38-39 Bibliographic references are needed to justify these statements.

-Line 42-43 Bibliographic references are needed to justify these statements.

-Line 49-53 Bibliographic references are needed to justify these statements.

-Line 61-62 Indicate data of the cost that this supposes

-Line 66 Indicate in what areas they have been so useful

-Line 72-73 Indicate which and your relevant application

-Line 84 Give some data about this

-Line 97 Indicate the references and applications of this

-Line 100-101 Indicate examples of application of this with references

-Line 140 Reference is necessary

-Section 2.2 It would be useful for a scheme that would explain the process to leave the text more explained.

-Line 181 Error! Reference source not found.

-Line 231-237 Include photos of the research carried out, if there are images of the data collection process, photos, drone used, etc.

Comments on the Quality of English Language

A minor English review is necessary

Author Response

Comment 1

-Line 38-39 Bibliographic references are needed to justify these statements.

-Line 42-43 Bibliographic references are needed to justify these statements.

-Line 49-53 Bibliographic references are needed to justify these statements.

Response 1
Relevant citations have been added to the texts

Comment 2

-Line 61-62 Indicate data of the cost that this supposes

Response 2
Relevant citations from WB, ABD and OECD have been added along with the indicative costs.

Comment 3

-Line 66 Indicate in what areas they have been so useful

-Line 72-73 Indicate which and your relevant application

Response 3
This entire paragraph has now been rewritten to indicate areas of usefulness as well as relevant applications. More citations being added.

Comment 4
-Line 84 Give some data about this

Response 4

This entire paragraph has now been rewritten into 2 paragraphs with relevant cost data and citations.

Comment 5

-Line 97 Indicate the references and applications of this

-Line 100-101 Indicate examples of application of this with references

Response 5:

This entire paragraph has been rewritten and expanded to include more citations and applications as well as examples.

Comment 6

-Line 140 Reference is necessary

Response 6
I seek reviewer opinion to not include reference in this paragraph as this paragraph is only to smooth the reading to the following section, this paragraph is not for literature review. Thank you.

Comment 7

-Section 2.2 It would be useful for a scheme that would explain the process to leave the text more explained.

Response 7:
Since the ViT model has been widely used and documented in numerous literatures, thus in this section, the author only provides a short introduction but not aim to explain the model in detail as it can be read through other literature cited in the literature review. I seek reviewer comment to accept this paragraph as it is. Thank you.

Comment 8

-Line 181 Error! Reference source not found.

Response 8
Corrected

Comment 9

-Line 231-237 Include photos of the research carried out, if there are images of the data collection process, photos, drone used, etc.

Response 9

The drone used for the research has been mentioned, it is a small type DJI Pro mini 4. The pictures of research team to be included in the paper is not necessary in my opinion, i have read all references and none of the reference include pictures of site investigations.

I do hope the review accept this as it is for this paragraph. Many thanks

Reviewer 2 Report

Comments and Suggestions for Authors

Please see the attachment for the reviewer's comments.

Comments for author File: Comments.pdf

Author Response

The manuscript titled presents an empirical study comparing Convolutional Neural Networks (CNN) and Vision Transformers (ViT) for crack detection in infrastructure assets. The study is relevant to the field of structural health monitoring and asset management, particularly with the increasing role of artificial intelligence (AI) in automating defect detection. The paper can be accepted for publication should the authors address the following comments:

Thank you very much for reviewing the paper. The authors greatly appreciate the reviewer’s insightful comments, which have been duly noted and considered. We would like to kindly request the reviewer’s understanding that the primary aim of this paper is to focus on the practical applications of existing CNN and ViT algorithms in TensorFlow. The authors have intentionally not delved deeply into the technical details of the CNN and ViT models in this paper, as the focus is on demonstrating their applicability in real-world infrastructure monitoring.

1) The study focuses on binary classification (crack vs. no crack). However, real world applications require classification of crack severities. Do you have plans to extend this research to multi-class classification?

Answer

The author acknowledges that a second phase of the research is currently underway, which aims to extend the analysis by incorporating multiple condition states and conducting a more comprehensive comparative study. This has been clearly outlined as part of the future work in the manuscript. However, the author believes that the current study, focused on binary classification, is valuable in its own right. The results already demonstrate the effectiveness and practical applicability of the models in real-world infrastructure monitoring scenarios, where binary decision-making (e.g., defect present or absent) is often sufficient for maintenance prioritization. Therefore, the author respectfully requests that this paper be considered for publication in its present form, while indicating that further developments will be reported in subsequent work.

2) The paper mentions that images were segmented into 100cm × 100cm squares. How does this segmentation impact model performance? Were different segment sizes tested?

Answer

The author acknowledges that the research team has not yet implemented segmentation based on different sizes of the images. The primary goal of segmentation is to ensure a sufficient number of images for creating a robust database. Furthermore, when segmenting the images into a 100x100 cm size, it was observed that the resolution remains sufficiently high to clearly capture and assess the severity of cracks. This segmentation size provides adequate detail for effective defect detection and analysis.

It is a useful comment and the author will extend the upcoming paper with also multiple state to include also the segmentation of image.

3) How was the dataset balanced? Did you encounter any issues with class imbalance (e.g., significantly fewer "no crack" images compared to "crack" images)?

Answer
Our database consists of balance no crack and crack image (50%, 50%) in the next paper, we will consider to randomly reduce or increase the images in the database to check the sensitibity of having it inbalance.

4) The study reports high accuracy for both CNN (95%) and ViT (98%), but accuracy alone can be misleading. Please provide additional performance metrics, such as precision, recall, F1-score, and AUC-ROC.

Answer
Thank you for your insightful question. Indeed, it would be beneficial to include additional performance metrics as suggested. However, in this first phase of the research, we focused on running the models and evaluating the accuracy, which we found to be quite promising. We tested the models across a variety of images and confirmed that the predictions were consistent and reliable. While further metrics may be incorporated in future phases, the current results demonstrate the effectiveness of the models for practical applications.

5) Did you perform hyperparameter tuning (e.g., grid search or random search) for CNN and ViT? If so, please describe the process.

answer

Not yet, we will consider in the upcoming paper that mentioned earlier to include multiple states

6) In Abstract and Introduction, it is claimed that ViT is "four times more efficient" than CNN, but there is no data to support this. If there is any study to support this claim, add it as a reference.

answer

We have added references to back up this statement in the literature review. Thank you.

7) The literature review is informative but lacks a comparative discussion on CNN vs. ViT in structural health monitoring. Please include more references regarding the use of artificial intelligence in Structural Health Monitoring such as: a) EƯectiveness of Vibration-Based Techniques for Damage Localization and Lifetime Prediction in Structural Health Monitoring of Bridges: A Comprehensive Review b) Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review c) A review on deep learning-based structural health monitoring of civil infrastructures

answer

We have enhanced the literature review by incorporating additional references and expanding on relevant applications. As this paper does not delve deeply into the technical details of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), we encourage readers to consult other literature for more in-depth discussions of these models. The primary focus of this paper is on the application of existing CNN and ViT algorithms available in TensorFlow, demonstrating their utility in the context of infrastructure monitoring.

8) The confusion matrices for CNN and ViT are presented, but their interpretation is missing. Could you provide a brief discussion on common misclassification patterns?

Answer
We found there is little to explain on this confusion matrix as it can be self explained. for this reason, we decide to take the graphs of confusion matrix out.

9) There are several instances where "Error! Reference source not found." appears in the text, indicating missing or broken references.

Article Menu

An Investigation on Prediction of Infrastructure Asset Defect with CNN and ViT Algorithms

Further Information

Guidelines

MDPI Initiatives

Follow MDPI