Review Reports - Intelligent Detection Method of Defects in High-Rise Building Facades Using Infrared Thermography

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

While the proposed method demonstrates a degree of innovation, its description focuses on structural presentation without sufficient depth of justification. Its true effectiveness and advancement are difficult to ascertain without details.

(1) The literature review is primarily descriptive, lacking critical comparison and synthesis. It fails to delineate the underlying "driving logic" of research evolution. Furthermore, the limitations of current studies are not distilled into concrete scientific questions or technical bottlenecks. Consequently, the proposed solution lacks a direct and compelling connection to the problems previously outlined.

(2) The manuscript does not adequately explain how post-decoder dual-branch boundary refinement module resolves the issue of feature degradation. Further elaboration is recommended.

(3) The rationale for employing the triple constraint mechanism and the basis for introducing the boundary-aware loss are not clearly established.

(4) Utilizing only 137 images for training and evaluating complex deep learning models is severely inadequate. The 9:1 split results in a minuscule test set of 12 images, making the statistical results highly contingent and casting doubt on the model's generalizability.

(5) The results show that incorporating the "Boundary Feature Optimization Branch" alone leads to lower Accuracy and mIoU than the baseline. This raises the question: would a combination of only the "Boundary-guided Attention Branch" with the boundary loss and superpixel segmentation yield superior performance? The authors' claim of "strong complementarity and synergistic effects" lacks explanation from a feature-level or task-division perspective.

(6) The comparison between DeepLabV3+&YOLOV11m and YOLOV11m does not convincingly demonstrate the contribution of the "two-stage method." It primarily compares pre-processed versus raw data. It is recommended to compare with the pure segmentation model like U-Net and the original DeepLabv3+.

(7) The assertion that "no similar false positives or missed detections occurring across all models" on de-noised masks is overly absolute and cannot be generalized from the limited examples provided.

(8) The work fails to compare with recent methods specifically designed for infrared image defect segmentation or building defect detection. Comparisons with general-purpose detectors (SSD, Faster R-CNN) are of limited significance.

(9) The conclusions are not rigorously and comprehensively supported by the experimental analysis, which undermines their persuasiveness. They focus selectively on positive outcomes while entirely avoiding critical limitations exposed in the experiments (e.g., small dataset, tiny test set, performance degradation in some model combinations). This omission lacks the comprehensive objectivity expected in scholarly writing.

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Summary:

In general, the approach in the article, theoretically, may have some advantages compared to others.

But the experimental evidence is too poor, unrepresentative, and difficult to draw any conclusions.

The design does not correspond to the template, and the English and the clarity of the text are questionable.

1. Only the accuracy, recall and mIOU metrics were used for segmentation.

These metrics do not take into account the imbalance of classes, and it is also not clear whether all classes are segmented at the same level.

Therefore, Precision (and F1, which is a combination of Precision and Recall), as well as other metrics that take into account the imbalance of classes (for example, Dice) are definitely needed here. It is also necessary to provide these metrics for each class - since we have an imbalance of segmentation classes, and, for example, weighted F1 will not show that some minor class is sagging.

For detection - mAP@50 is used, which is insufficient itself for assessing the quality of the drawn bbox-es, because it does not take into account the shape of the object itself. We also need mAP@50-95 (which is often used in YOLO together with mAP@50), which is more strict to the contours and shape of the bbox-es.

2. Not enough data.

137 photos of unknown quality, which are also divided into train and test in the ratio of 9:1 - which gives us just 12 photos for the test… It is not specified on which part various selections of hyperparameters and selection of final models are carried out, but most of all this is a test dataset.

These photos go to the input of the model one by one - and therefore on a dataset of 12 photos it is impossible to make any design choices and conclusions that the approach described in the article clearly helps. In addition, it will overfit when selecting hyperparameters for segmentation and detection models, so the final quality assessment is not representative.

Some of the conclusions in the article are made from photos at all, which is completely unreliable.

3. Neither the code nor the data are published.

4. The template is not used at all. Not all the headers, not all the information from the template, the styles of images, tables and headers are used.

==============

Section 1 - evaluations and design choices

It is not clear why authors use accuracy and miss other metric.

12 images per test set is very few - it is difficult to draw any conclusions from this.

For cracks and detachment it is even more difficult because of just 2 examples per class.

This is not enough.

Table 1 – it is better to search with grid search, since we are looking for the optimal combination, not the optimal value of each parameter, and in general these values were not searched enough - the article looks like 5 values were selected as final, and another 5 - just to show that the selected ones are correct.

It is also not indicated on which set they were searched - if on the test set, then

1 - there is not enough data for design choice,

2 - we will overfit on the test data.

Accuracy, recall and mean IoU are not representative enough. No Precision, F1 used.

Conclusions about false positives are made only from the images from figure 8, and not from the entire dataset. There is no metric to evaluate this. Also, the metrics used do not provide the assessment of false positives.

mAP@50 by itself is insufficient for assessing the quality of drawn bbox-es, because it does not take into account the shape of the object itself. You also need mAP@50-95 (which is often used in YOLO together with mAP@50), which is more strict to the contours and shape of the bbox-es.

=====================

Section 2. Links to articles and various descriptions

There is no link to the original DeepLabV3+, but authors immediately go to applications in various areas.

There are no links to SSD, Fast-R-CNN, YOLO.

Section 1.1 refers to the DeepLab overview. Authors write about their own achievements at the end, which causes confusion, and in general the section is about another topic.

If the section is about that, it should be clearly described and named. The title "1.1 Overview of the DeepLabV3+ Model" does not indicate this at all.

Different hyperparameters are simply given here.

There could be some non-obvious logic behind this, but it is not clearly articulated. It is unclear if other studies and testing of hyperparameters have been done.

It is not clear why references 20 and 21 are here. It is advisable to describe how they themselves investigated this.

There is no reference to SLIC and a description of what it is. Only a breakdown of the parameters used for this algorithm.

The text of the classes on the images is of very poor quality, difficult to impossible to read (figure 8).

In figure 6 (a) the markup is not very accurate. Part of the house (air conditioners and balconies) are defined as the sky.

================================

Section 3 - template mismatch

Template not used!

section names do not match the template

figure names do not match the template

tables do not match the template

no discussion section

no code, no dataset

no Author contribution

DOI should be present in all the references

Comments on the Quality of English Language

Use of English and use of Chat GPT / Other LLMs.

English should be improved dramatically – some examples are:

exploits -> relies on, uses

"defect features"

“false positives or missed detections”

“for walls under complex backgrounds”

“steel channel”

Key words

“imagin”

“andb1”

“circular void”

“accurac”

“detailed explosion”

“CPU memory”

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Intelligent Detection Method of Defects in High-Rise Building Facades Using Infrared Thermography

This paper presents a novel DL-based defect detection of building facades. The proposed method consisted of segmentation followed by classification based on DeepLabV3+ and YOLOv11 models, respectively. It also features a sub-pixel refinement module, i.e., Post-Decoder Dual-Branch Boundary Refinement, to mitigate inherent problems in the imaging modality and hence to better handle detailed defects such as cracks and cavities. The method was validated on an infrared thermography dataset of reasonable size and shown to outperform the competing techniques. The novelty of the method is sufficient and suitable for scientific publication. The manuscript is well prepared, although its format can be rectified during the revision. However, some issues need addressing to improve its quality.

The motivation regarding the limitations of DeepLabV3+ is supported by refs. [18] and [19], which clearly are closely related works. Their summary should also be introduced following ref. [17], to direct the focus of DeepLabV3+ on defect detection applications.
Accordingly, the proposed method should in fact be benchmarked against those works [18, 19], in addition to typical baselines and ablation studies.
Please enhance the details in Fig. 1 to highlight the difference between the apperance of input and output images. Please also provide a sample visualization to demonstrate ‘feature degradation’, ‘boundary blurring’ issues, and ‘interference due to complex texture’. The authors may move Figs. 6 and 7. and explain it here.
Section 1.2.1: Please give a brief overview of [22] (boundary-aware network), regarding the original contribution and the current adoption and/or modifications. In addition, please also provide the rationale surrounding the empirical weights 0.5, 0.3, and 0.2. The same applies to other parameters presented in Section 4.3.
Section 1.2.2: Similarly, please clarify whether this branch is an original innovation by the authors or has been adopted from another work. In the latter case, please provide relevant citations and an integration statement.
To facilitate reproducibility, please publish the dataset along with its detailed description on a public repository or the MDPI’s data sharing scheme. If the authors are unable to do so, please state their constraints.
In addition to Fig. 8, provide visualizations of selected results to demonstrate the proposed method’s abilty to detect defects under various environmental factors (ambient temperature, sunlight intensity and background interference). Compare these visualizations against the baseline method to showcase the proposed model’s effectiveness in resolving the problem.

Comments on Language

The writing is fine. Despite some typos, readers have no trouble understanding its content.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

I think it is difficult to justify the proposed idea based on DeepLab3.
First, DeepLab3 is an old model, and thus Fig.1 needs to be removed.
Then, authors need to compare the proposed method with SOTA Segment Anything3.
Also, authors need to check Table 3.
For example, based on Accuracy, YOLO12x(80%)/DeepLab3+YOLO12x(70%), FasterRCNN(49%)/DeepLab3+FasterRCNN(42%) need to be checked.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a two-stage detection framework that integrates an improved DeepLabV3+ model with YOLOV11 for accurate identification of building facade defects in complex infrared scenarios, addressing a topic of clear practical relevance. While the experimental section compares various segmentation and detection models, it lacks comparisons with recent Transformer-based architectures or more advanced lightweight segmentation networks, which somewhat limits the currency and comprehensiveness of the performance evaluation. Additionally, labels in some figures are unclear; it is recommended that these be reviewed and provided with consistent explanations.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This version is much better, but still lacks the important details.

1) as mAP@50-95 metric values are not so high, it is good to support the results with the comments why the accuracy falls so much in comparison with map@50 values – it shows that method is not-so-precise in comparison with ‘ideal’ variant.

2) precision (which is not accuracy nor recall) is still missing in the text and tables – please add it as +1 metric.

3) the per-class analysis is still absent. This means to add table(s) with the metrics (F1, mAP@50-95, etc.) for each particular class of defects as table lines (5 defect types in the manuscript). It is important to better understand the ‘hard-to-detect’ classes, or defects harder to detect.

4) if authors can’t share the code and/or dataset, the paper needs more ‘proofs’. At least authors should add the comparison over more data, for example from https://doi.org/10.20944/preprints202510.2032.v1 (by the way, much richer data than prepared by the authors). The alternative approach is also presented in that paper, so it should be added to the comparison (‘Discussion’) section at least.

5) ‘Discussion’ section should precede the ‘Conclusions’, and should include the comparative analysis of the presented method.

6) as to the hyperparameters, it is recommended to provide at least descriptive argumentation / comments for the chosen optimal (?) values, and which value ranges have been examined by authors during the research.

7) some references are still badly formatted (1,4,8,22,24): for example, doi:https://doi.org/10.1016/j.jobe.2024.110122.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

This manuscript was revised according to my comments.

Author Response

Comments 1: This manuscript was revised according to my comments.

Response 1: We sincerely thank the reviewer for the positive assessment and for confirming that the revisions have met your expectations. We are deeply grateful for the time and effort you dedicated to reviewing our work. Your constructive comments and professional insights during the previous round of review were instrumental in improving the quality, clarity, and rigor of this manuscript.