Gray Brick Wall Surface Damage Detection of Traditional Chinese Buildings in Macau: Damage Quantification and Thermodynamic Analysis Method via YOLOv8 Technology
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsIn this paper, a wall damage detection method based on YOLOv8 deep learning model is proposed for the Lingnan gray brick historic buildings in the Macao World Heritage Site. The study provides an automated tool for cultural heritage preservation. However, there are still the following problems with the content of the article:
- In 1.3, point out the problems faced and clarify the innovation points of the article.
- The data volume is only 375 images, whether it is sufficient to explain the problem or not, I think the sample size is insufficient.
- There are only 38 images (accounting for 10%) in the test set?
- In section 2.2, mark the meanings of CV2, C2f, etc., as well as the size of Conv, in Figure 2.
- The flowchart description in Figure 2 is rather vague. It uses formulas and other methods to describe the relevant steps in detail.
- In the experimental part, there is no comparison with other YOLO series models, and supplementary comparisons are needed to highlight the innovation of this study.
- In the experimental part, the results of different models in Figure 4, 5 and 6 are combined in the same figure to see the comparison results more accurately.
- In the results part, the heat map does not explain the corresponding meaning of the shade of color, and a detailed legend needs to be added.
- Overall, this paper has certain engineering significance, but its innovativeness is relatively weak.
Author Response
In this paper, a wall damage detection method based on YOLOv8 deep learning model is proposed for the Lingnan gray brick historic buildings in the Macao World Heritage Site. The study provides an automated tool for cultural heritage preservation. However, there are still the following problems with the content of the article:
In 1.3, point out the problems faced and clarify the innovation points of the article.
Response: Thank you for your comments. I have revised the section on innovations in 1.3.
In past research of scholars, it has been verified many times that the training set is the basis of model learning, and its quality and quantity directly affect the performance of the model. Therefore, researchers tried to expand the scope of collecting images and increase the number of labels to create datasets. Meanwhile, to improve the ability to identify damage to historical gray brick buildings in the Macau World Heritage Site and better estimate the subsequent repair costs, this study aims to build an automatic recognition and quantitative detection system for surface damage of historical gray brick buildings in Macau. The system automatically detects eight types of historical brick buildings: crack, damage, missing, vandalism, moss, stain, plant, and intact, to achieve a more efficient, standardized, and sustainable detection mechanism in architectural heritage protection.
The data volume is only 375 images, whether it is sufficient to explain the problem or not, I think the sample size is insufficient.
Response: Thank you for your comments. We fully understand your concern about the sample size. In fact, during the data collection phase of this study, images of 162 historical buildings in Macau were systematically collected, and 375 high-quality samples were retained after strict screening for model training. Although the total number of samples is limited, we have improved the diversity and representativeness of the samples through methods such as multi-dimensional label coverage, fine annotation, data enhancement, and pre-trained model initialization. At the same time, we introduced unseen dataset tests in heat map and confusion matrix analysis to verify the adaptability and generalization ability of the model to complex scenarios. We also pointed out the shortcomings and limitations of the sample size in the paper, and proposed that future research will expand the sample size, optimize the label system, and introduce higher-level classification and segmentation strategies to further improve the model performance and the robustness of the research. Thank you for your valuable suggestions, which we will improve in subsequent research.
There are only 38 images (accounting for 10%) in the test set?
Response: Thank you for your comments. Thank you for raising the concern about the number of test sets. Indeed, the test set of this article contains 38 images, accounting for about 10% of the total samples, which is mainly determined based on the limited capacity of the dataset and the need to take into account the reasonable distribution of training, validation, and test sets. Although the number of test sets is limited, we introduced unseen datasets in our research for supplementary testing, thereby more comprehensively verifying the generalization ability and practicality of the model. In addition, during the research process, we strictly maintained the independence of data division to prevent cross-contamination of training and test sets. We also explained the issues of limited sample size and test set size in the discussion section, and plan to further expand the sample size and diversity in subsequent research to improve the integrity of the research and the robustness of the results. Thank you for your valuable suggestions.
In section 2.2, mark the meanings of CV2, C2f, etc., as well as the size of Conv, in Figure 2.
Response: Thank you for your comments. To keep the image concise, we will add the following to the image names: C2f: CSP-based large residual structure for feature extraction and channel fusion; cv2, cv3: multi-scale detection heads for bounding box position regression and classification prediction, respectively; Conv: standard convolutional layer with batch normalization and SiLU activation for feature extraction and downsampling. The corresponding feature map sizes are marked in the figure, for example (32×32×320) means that the feature map after convolution is 32×32 pixels and 320 channels.
The flowchart description in Figure 2 is rather vague. It uses formulas and other methods to describe the relevant steps in detail.
Response: Thank you for your comments. We have further explained the structure and diagram details of YOLOv8, adding the following:
As shown in Figure 2, the YOLOv8 architecture uses a backbone network based on CSPDarknet53. This part extracts feature maps of different spatial scales through a multi-level feature pyramid (P1 to P5), starting from the input 512×512×3 color image, and gradually reduces the spatial resolution and increases the number of feature channels. Specifically, P1 (256×256×80) to P5 (16×16×640) gradually extracts gray brick surface feature information from low to high layers through convolution and down sampling operations. The Neck part, as a feature fusion module, uses the C2f structure, which replaces the CSPLayer in YOLOv5 in YOLOv8. The C2f module contains the input convolution (Conv(c1,2×c)) that divides the input channel into two parts, and then achieves dense residual fusion through multiple Bottleneck stacking. Finally, the output channel is integrated by convolution (Conv((2+n)×c,c2)) to improve the perception and expression capabilities of multi-layer features. The C2f layer in Neck outputs feature size annotations such as 64×64×320, 32×32×640, and 16×16×640. Combined with the standard convolution layer (Conv), batch normalization (BN), and SiLU activation function, the context perception of the feature map is effectively enhanced. The up- and down-sampling modules (Upsample) and the feature concatenation operation (Concat) further realize the alignment and fusion of features of different scales to improve the robustness of multi-scale features and the overall performance of the model.
In the detection head (Head) part, YOLOv8 adopts a decoupled detection head design to separate the objectness, classification, and bounding box regression tasks in the detection task. Specifically, the localization branch (cv2 module) consists of three layers of convolution (Conv3×3→Conv3×3→Conv1×1) to output the bounding box offset (the channel is 4×reg_max), while the classification branch (cv3 module) adopts a similar structure, and the final output channel is num_classes, which is used for target classification. In the Head, the two functions are respectively applied to multi-scale feature maps (such as cv20 and cv31), and the Sigmoid function is used to process the target probability and the Softmax function is used to process the classification probability, to achieve accurate positioning and multi-class discrimination of the detection results. In terms of loss function setting, YOLOv8 uses CIoU (Complete Intersection over Union) and DFL (Distribution Focal Loss) for bounding box regression loss calculation and uses binary cross-entropy to process classification loss to improve the performance of small target detection and the overall detection effect.
In addition, to enhance the model visualization expression and engineering practicality, this paper integrates the feature fusion heat map in the simplified path, which clearly shows the response area focused on by the model in feature maps of different scales. The final output result (Output) has a size of 512×512×3, where different color channels represent different prediction categories (such as missing, cracked, contaminated, etc.), to intuitively show the positioning and classification effect of YOLOv8 in gray brick damage detection.
In the experimental part, there is no comparison with other YOLO series models, and supplementary comparisons are needed to highlight the innovation of this study.
Response: Thank you for your comments. Thank you very much for your valuable comments on the experimental part. We understand that you want to highlight the innovation of this study by comparing with other YOLO series models (such as YOLOv5, YOLOv7, etc.). However, the focus of this paper is to build an automated system for surface damage detection of gray brick walls of Macau World Heritage based on the YOLOv8 architecture and combined with the specific needs of the architectural field. Since YOLOv8 has been widely recognized in the current field of object detection for its significant advantages over previous models (YOLOv5, YOLOv7) in terms of speed, accuracy and architectural optimization (such as decoupling head, C2f module, DFL loss function, etc.) (https://doi.org/10.1007/978-981-99-7962-2_39), we chose to directly use YOLOv8 as the core algorithm for in-depth exploration and application practice in this study, aiming to closely combine the latest deep learning technology with the needs of architectural heritage protection.
In the experimental part, the results of different models in Figure 4, 5 and 6 are combined in the same figure to see the comparison results more accurately.
Response: Thank you for your comments. Figures 4, 5, and 6 have been modified.
In the results part, the heat map does not explain the corresponding meaning of the shade of color, and a detailed legend needs to be added.
Response: Thank you for your comments. These images have been replaced with images with legends added.
Overall, this paper has certain engineering significance, but its innovativeness is relatively weak.
Response: Thank you for your comments. We fully agree with your opinion. We will add the following content to discuss the deficiencies in the research:
This study focuses on the application exploration and engineering practice verification of the surface damage identification task of gray brick walls of traditional buildings in Macau based on the existing YOLOv8 model. Compared with the innovative research on basic algorithms, the innovation of this paper is mainly reflected in the implementation of engineering applications, optimization of detection processes and the establishment of a multi-dimensional evaluation system. This application research oriented to engineering needs is inevitably subject to the following limitations: (1) Due to the special requirements of the real-time and adaptability of algorithms in the field of architectural heritage protection, YOLOv8, which currently takes into account both accuracy and speed, is used as the core detection framework, and no horizontal multi-model comparison and theoretical breakthrough are carried out; (2) In the case of limited sample size and incomplete definition of multi-type labels, the current research is more focused on optimizing the application effect and engineering feasibility of existing technologies in specific scenarios, rather than proposing a new theoretical model or algorithm architecture; (3) The diversity and complexity of the fields of architectural engineering and cultural heritage protection also limit the depth of algorithm innovation. We will further expand the sample, refine the label system, and explore more innovative solutions in combination with advanced detection algorithms.
I hope this revision can meet the requirements.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe article proposes and substantiates a method for recognizing wall damage based on the YOLOv8 deep learning object detection model. The proposed algorithm effectively solves the problems of low efficiency and low accuracy of manual inspections. The results are interesting, the method shows the efficiency.
Questions to the authors of the article and and remarks.
+ Class distribution.
Could the authors provide information on the distribution of examples across the eight damage categories in the dataset? This information is not provided in the article; it is important for assessing data balance and the potential impact of class imbalance on model quality.
+ Training, validation, and test set partitioning strategy
How was it ensured that the class distribution was maintained across the training, validation, and test sets? The article does not describe exactly how the dataset was partitioned; information on stratification would be useful for ensuring the objectivity of the assessment.
+ Dataset size and overtraining
Given the observed overtraining of the model, how critical is increasing the size of the training set to improve generalization?
Have the authors assessed or speculated on how increasing the number of images (e.g., from ~350 to 500 or 1000) might affect accuracy? The paper mentions retraining, but does not discuss the impact of data volume on model quality.
+ Interclass Confusion and Alternative Classification
Have the authors considered revising the class structure to improve the model's robustness to confusion between similar categories?
Given the confusion found between labels such as "spot" and "damage" or "moss" and "plant," would hierarchical or merged class grouping be useful? The paper mentions interclass confusion, but does not discuss changing the class structure or introducing a hierarchy.
The overall impression is that the article is well written. The questions and comments are aimed at improving the presentability of the article.
Author Response
The article proposes and substantiates a method for recognizing wall damage based on the YOLOv8 deep learning object detection model. The proposed algorithm effectively solves the problems of low efficiency and low accuracy of manual inspections. The results are interesting, the method shows the efficiency.
Response: Thank you very much for your comments. We have revised the manuscript, please refer to the red fonts.
Questions to the authors of the article and and remarks.
+ Class distribution.
Could the authors provide information on the distribution of examples across the eight damage categories in the dataset? This information is not provided in the article; it is important for assessing data balance and the potential impact of class imbalance on model quality.
Response: Thank you very much for your comments. The relevant content is in 2.1.2 of the paper. During the data annotation and construction phase, we finely annotated each image and counted the number of instances of eight types of labels (Crack, Damage, Missing, Vandalism, Moss, Stain, Plant, and Intact). Specifically, the 375 high-quality sample images finally selected include about 51 cracks (Crack), 57 damages (Damage), 52 missing (Missing), 40 vandalism (Vandalism), 68 mosses (Moss), 58 stains (Stain), 49 plant attachments (Plant), and several intact bricks (Intact), thus ensuring that common damage types are covered as diversely and completely as possible with a limited sample size. This distribution information is retained during the model training and testing phases, and the detection performance of different categories is reflected in the confusion matrix, heat map, and quantitative statistics.
+ Training, validation, and test set partitioning strategy
How was it ensured that the class distribution was maintained across the training, validation, and test sets? The article does not describe exactly how the dataset was partitioned; information on stratification would be useful for ensuring the objectivity of the assessment.
Response: Thank you very much for your comments. The relevant content is in 2.1.3 of the paper. After the preliminary classification and statistics of the 375 annotated images, they were divided according to the category distribution ratio using stratified sampling to ensure that each category appears in the training set (303 images), validation set (34 images) and test set (38 images) at a ratio close to the overall distribution of the original data set. This can effectively avoid the model from being biased due to large differences in the sample size of categories during the training, validation and testing stages, and improve the objectivity and stability of the evaluation results.
+ Dataset size and overtraining
Given the observed overtraining of the model, how critical is increasing the size of the training set to improve generalization?
Response: Thank you very much for your comments. Thank you very much for your concern about the dataset size and overfitting. We fully agree with your point of view that the model has some signs of overfitting during training, and this problem is closely related to the limited size of the current training set. As you pointed out, increasing the size of the training set is crucial to improving the generalization ability of the model. At present, this study has collected 375 high-quality sample images, of which the training set contains 303 images. Although we have compensated for the limited sample size by means of data cleaning, optimized annotation, and the use of pre-trained models (YOLOv8_x.pth), and verified the generalization performance in the test set and unseen datasets. However, we also observed that the detection accuracy and recall rate of the model in specific categories (such as Stain, Intact) are insufficient, and some high response areas in the heat map do not fully correspond to the actual damage location, which shows that the model still has a certain degree of overfitting tendency. Therefore, expanding the size of the training set will be an important direction for future research.
Have the authors assessed or speculated on how increasing the number of images (e.g., from ~350 to 500 or 1000) might affect accuracy? The paper mentions retraining, but does not discuss the impact of data volume on model quality.
Response: Thank you very much for your comments. In the current study, we mainly used 375 high-quality images for training, verification and testing. Although we achieved good detection results (such as excellent performance in mAP, F1 and other indicators) under the condition of limited data, we still observed fluctuations in the recognition ability of the model in some categories (such as Stain and Intact), and confusion between some categories can also be seen in the confusion matrix and heat map. We speculate that these phenomena are closely related to the insufficient number of samples, insufficient feature expression, and label distribution bias. In theory, with the increase in the amount of data, for example, expanding the number of images from the current 350 to 500 or even 1,000, the model will be able to learn richer feature representations, covering more damage types and diverse surface features in real scenes, thereby effectively improving the generalization ability and recognition accuracy of the model. In addition, increasing the amount of data will also reduce the sample imbalance problem of some categories and reduce the model bias and misjudgment caused by too few samples in a few categories.
+ Interclass Confusion and Alternative Classification
Have the authors considered revising the class structure to improve the model's robustness to confusion between similar categories?
Response: Thank you very much for your comments. In this study, we observed that the model had some confusion in identifying certain categories (especially between Stain, Moss and Plant), which was reflected in the confusion matrix analysis and heat map results. The main reasons for this confusion include the similarity in visual features between categories (such as the similar appearance of moss and plant attachments, or the blurred boundary between pollution and minor damage), and the inevitable subjective judgment bias in the annotation stage. In the current study, we used an annotation unit based on complete bricks for damage identification. This strategy simplifies the annotation work and statistical analysis to a certain extent, but it also brings about the problem of incomplete matching of local damage and overall labels. In response to your suggestion, we have clearly pointed out this shortcoming in the discussion section of Chapter 5. In the future, we will consider introducing a more fine-grained annotation strategy, such as a labeling system based on tile level or damage boundary level, so as to more accurately distinguish similar categories.
Given the confusion found between labels such as "spot" and "damage" or "moss" and "plant," would hierarchical or merged class grouping be useful? The paper mentions interclass confusion, but does not discuss changing the class structure or introducing a hierarchy.
Response: Thank you very much for your comments. This study found that the model had some confusion when identifying certain similar categories (such as “Stain” and “Damage”, “Moss” and “Plant”). This problem was reflected in the confusion matrix and heat map results. The main reasons for this confusion are the blurred boundaries of label definitions, the visual similarity of image features, and the inconsistency of local annotations. Although this study has tried its best to ensure the quality of annotations and the balance of categories under the existing category system, it also fully recognizes the impact of category definition strategies on model performance. To this end, future research will try to introduce a hierarchical classification system: (1) Group the damage types into large categories (such as distinguishing between structural damage and non-structural attachment) and then perform more fine-grained classification and recognition within each large category to reduce the confusion rate between similar categories. (2) This study will explore merging easily confused categories (such as “Stain” and “Damage”) or adopting a semantic refinement annotation strategy to improve the model’s ability to distinguish different types of damage and generalization performance. (3) By optimizing the classification system and improving the labeling strategy, it is expected that the robustness and accuracy of the model in practical applications will be further enhanced, providing more efficient and reliable technical support for the protection of Macao’s traditional architectural heritage. At last, at the model structure and deployment level, try more advanced lightweight Transformer or attention fusion models to improve remote recognition capabilities. And explore the integrated application of functions such as three-dimensional reconstruction and time series monitoring to create an intelligent platform suitable for historical building inspections, disease trend analysis, and digital archive construction, providing technical support and practical samples for the protection of historical and cultural heritage in Macau and southern coastal cities.
The overall impression is that the article is well written. The questions and comments are aimed at improving the presentability of the article.
Response: Thank you very much for your comments. We sincerely thank you for your time and effort in reviewing this article, which was of great help to us in improving the manuscript.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis article proposes a method for recognizing damage on gray brick walls based on the YOLOv8 model. Relying on 375 annotated images from 162 historical buildings in Macau, the model achieved a mAP of 61.51% and an F1 score of 0.74.
However, this work presents several shortcomings, including:
- The abstract is poorly written and does not convey the study. A well-structured scientific abstract must include:
- A clear context situating the research field and the relevance of the subject,
- The specific research question the study seeks to answer,
- The methodology used, including data, tools, and key steps,
- The overall results, presented concisely and quantitatively if possible,
- Finally, a brief conclusion will be provided highlighting the study’s interest or future perspectives.
- The current introduction has several weaknesses that should be addressed to improve the clarity and scientific rigor of the manuscript. It should:
- Clearly present the research question by discussing existing approaches' limitations and justifying the proposed study's need.
- Explicitly define the objectives and expected results to guide the reader on the potential contributions of the research.
- Highlight the main contributions in a structured bullet-point format.
- Outline the manuscript's structure by briefly describing each section's content.
- The literature review should be clearly separated from the introduction and placed in a dedicated section to review prior works, identify research gaps, and position the authors’ contribution within the state of the art.
- The authors should include a schematic or visual representation of annotated samples to help readers better understand the data annotation process and grasp the nature of the extracted information.
- The authors should present the various model configuration parameters in a table to enhance clarity and readability, and to allow for a quick and structured understanding of the experimental setup.
- The authors should explain the evaluation metrics used in the study by providing their definitions and the associated formulas to ensure a clear understanding of the model’s performance criteria.
Author Response
This article proposes a method for recognizing damage on gray brick walls based on the YOLOv8 model. Relying on 375 annotated images from 162 historical buildings in Macau, the model achieved a mAP of 61.51% and an F1 score of 0.74.
Response: Thank you very much for your comments. We have revised the manuscript, please refer to the red fonts.
However, this work presents several shortcomings, including:
- The abstract is poorly written and does not convey the study. A well-structured scientific abstract must include:
- A clear context situating the research field and the relevance of the subject,
- The specific research question the study seeks to answer,
- The methodology used, including data, tools, and key steps,
- The overall results, presented concisely and quantitatively if possible,
- Finally, a brief conclusion will be provided highlighting the study’s interest or future perspectives.
Response: Thank you very much for your comments. The revised article abstract is as follows:
The historical Lingnan gray brick buildings in Macau, a World Heritage Site, are facing severe deterioration due to prolonged disrepair, manifesting as cracks, breakages, moss adhesion, and other types of surface damage. These issues threaten not only the structural stability of the buildings but also the conservation of cultural heritage. To address the inefficiencies and low accuracy of traditional manual inspections, this study proposes an automated recognition and quantitative detection method for wall surface damage based on the YOLOv8 deep learning object detection model. A dataset comprising 375 annotated images collected from 162 gray brick historical buildings in Macau was constructed, covering eight damage categories: Crack, Damage, Missing, Vandalism, Moss, Stain, Plant, and Intact. The model was trained and validated using a stratified sampling approach to maintain balanced class distribution, and its performance was comprehensively evaluated through metrics such as mean average precision (mAP), F1 score, and confusion matrices. Results indicate that the best-performing model (Model 3 at the 297th epoch) achieved a mAP of 61.51% and F1 score up to 0.74 on the test set, with superior detection accuracy and stability. Heatmap analysis demonstrated the model’s ability to accurately focus on damage regions in close-range images, while damage quantification tests showed high consistency with manual assessments, confirming the model’s practical viability. Furthermore, a portable, integrated device embedding the trained YOLOv8 model was developed and successfully deployed in real-world scenarios, enabling real-time damage detection and reporting. This study highlights the potential of deep learning technology for enhancing the efficiency and reliability of architectural heritage protection, and provides a foundation for future research involving larger datasets and more refined classification strategies.
- The current introduction has several weaknesses that should be addressed to improve the clarity and scientific rigor of the manuscript. It should:
- Clearly present the research question by discussing existing approaches' limitations and justifying the proposed study's need.
- Explicitly define the objectives and expected results to guide the reader on the potential contributions of the research.
- Highlight the main contributions in a structured bullet-point format.
- Outline the manuscript's structure by briefly describing each section's content.
Response: Thank you very much for your comments. I summarized the main research objectives:
To improve the ability to identify damage to historical gray brick buildings in the Macau World Heritage Site and better estimate the subsequent repair costs, this study aims to build an automatic recognition and quantitative detection system for surface damage of historical gray brick buildings in Macau. The system automatically detects eight types of historical brick buildings: crack, damage, missing, vandalism, moss, stain, plant, and intact, to achieve a more efficient, standardized, and sustainable detection mechanism in architectural heritage protection. The researchers explored three core questions: (1) How does the YOLOv8 model help build core technologies that help identify these eight types of gray brick surface damage? (2) What are the results of the model's image recognition analysis of gray brick damage types? (3) What is the application of the model?
In the introduction I included a description of the remaining structure:
The rest of this paper is organized as follows: Section 2 is a literature review, analyzing existing research results in machine learning, especially object detection models, in architectural heritage. Section 3 introduces the research field, methods and processes. Section 4 is the training process and results after we build the model. Section 5 takes the traditional gray brick buildings in Macau as an example and deeply analyzes the results of the application. At the same time, the development of the corresponding recognition device is discussed. Section 6 presents the conclusions. Section 7 briefly introduces the invention patents in Chinese mainland corresponding to the current relevant research results.
The literature review should be clearly separated from the introduction and placed in a dedicated section to review prior works, identify research gaps, and position the authors’ contribution within the state of the art.
Response: Thank you very much for your comments. I have separated the literature review from the introduction and revised the subsequent section numbers accordingly.
The authors should include a schematic or visual representation of annotated samples to help readers better understand the data annotation process and grasp the nature of the extracted information.
Response: Thank you very much for your comments. You proposed to add a schematic diagram or sample picture to the paper to intuitively present the construction of the labeled data and the process of information extraction. We understand your intention, which is to help readers better understand the logic of data labeling, category definition, and feature extraction methods through diagrams. However, considering the limitations of the paper length and chart capacity, this paper has visualized the research process in Figure 1. In addition, we systematically described the data labeling and sample construction process in the methods and experiments section, including sample collection, cleaning, labeling standards, and category definition system, and strive to present the core content of data processing and feature extraction in a complete and clear manner in the text. Although no additional diagrams of labeled samples were inserted, we believe that the detailed description and structured process description are sufficient to enable readers to understand the key processes of data labeling and extraction and their significance to model training.
The authors should present the various model configuration parameters in a table to enhance clarity and readability, and to allow for a quick and structured understanding of the experimental setup.
Response: Thank you very much for your comments. We added it in 3.1.4.
The authors should explain the evaluation metrics used in the study by providing their definitions and the associated formulas to ensure a clear understanding of the model’s performance criteria.
Response: Thank you very much for your comments. We have added these instructions in Section 3.2. Please refer to the parts in red font.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe abstract is very clear, presenting the general objective and the main impacts of the research.
The introduction contains a contextualization of the basic elements of the research, but it is confusing. The suggestion would be to keep the first and second paragraphs only, and include a third paragraph, which could be the one in section 1.3.
The central aim of the research, which was only mentioned in the abstract, should also be stated again. Do not subdivide the introduction into subsections.
After the Introduction, turn subsection 1.2 into section 2, Literature review. The methodology clearly outlines the stages of the model's analysis.
However, it is not clear how the authors arrived at the studies in Tables 1 and 2.
How were existing studies analyzed and how did they contribute to this research? T
his needs to be made more explicit in the methodology.
The results present all the tests and simulations carried out very well. Figure 3 could have a better resolution.
Figure 7 also has poor resolution and should be improved. In Figures 13, 14 and 15, the captions inside the figure are not legible, and the image should also be improved.
The conclusion correctly concludes the objective presented in the abstract. The article is relevant and original, and the patent registration is highlighted at the end of the article.
Author Response
The abstract is very clear, presenting the general objective and the main impacts of the research.
Response: Thank you very much for your comments. Taking into account the reviewers' suggestions, we have made appropriate adjustments to the abstract of the article.
The introduction contains a contextualization of the basic elements of the research, but it is confusing. The suggestion would be to keep the first and second paragraphs only, and include a third paragraph, which could be the one in section 1.3.
Response: Thank you very much for your comments. I have separated the introduction and literature review so that we can better emphasize our research objectives in the introduction.
The central aim of the research, which was only mentioned in the abstract, should also be stated again. Do not subdivide the introduction into subsections.
Response: Thank you very much for your comments. I have deleted the secondary heading in the introduction.
After the Introduction, turn subsection 1.2 into section 2, Literature review.
Response: Thank you very much for your comments. I have separated this section into the second chapter as a separate literature review.
The methodology clearly outlines the stages of the model's analysis.
Response: Thank you very much for your comments. First of all, I am very grateful for your affirmation of our research. Taking into account the comments of other reviewers, we have also improved the methodology chapter.
However, it is not clear how the authors arrived at the studies in Tables 1 and 2.
Response: Thank you very much for your comments. Tables 1 and 2 are summarized and statistically obtained based on the information mentioned in the literature. At the same time, I have indicated the literature numbers of the corresponding literature. This summary aims to more clearly express what are the existing research articles on surface damage of traditional building materials. These belong to specific proprietary models. We try to highlight the unique contributions and innovations from this perspective.
How were existing studies analyzed and how did they contribute to this research? This needs to be made more explicit in the methodology.
Response: Thank you very much for your comments. I have added the corresponding information.
The results present all the tests and simulations carried out very well. Figure 3 could have a better resolution.
Response: Thank you very much for your comments. I have replaced the image with a higher resolution one. Don't worry, if publication requires it, we will also provide the original file to present the best effect.
Figure 7 also has poor resolution and should be improved. In Figures 13, 14 and 15, the captions inside the figure are not legible, and the image should also be improved.
Response: Thank you very much for your comments. I have replaced the image with a higher resolution one. Don't worry, if publication requires it, we will also provide the original file to present the best effect.
The conclusion correctly concludes the objective presented in the abstract. The article is relevant and original, and the patent registration is highlighted at the end of the article.
Response: Thank you very much for your comments. Thank you again for your appreciation of our research and for taking the time to review our article.
Round 2
Reviewer 1 Report
Comments and Suggestions for Authorsyolo11 has been released so far, and a previous comment wanted to compare with other yolo models, but the authors declined. At the same time, the sample size of the dataset did not increase. I don't think it's enough to accept. (Also, is it appropriate to have Chinese in an English article?)
Author Response
Thanks to the reviewers for their comments. For the former, we retrained the YOLOv12 model in the latest round of experiments. The details are as follows: (This round of modifications is highlighted in yellow)
4.3 Comparison with YOLOv12
In addition to the original YOLOv8 model, this study also trained and evaluated the latest YOLOv12 model to analyze the performance differences and applicability of the two on this dataset. The parameter settings used in the training of the YOLOv12 model are consistent with those of YOLOv8. The core includes the input image size of 512×512, the batch size of 2, the optimizer using SGD, the learning rate strategy of cosine decay, and the training epoch set to 300. During training, the early stopping feature in YOLOv12 was activated because the validation set's performance didn't get better after several rounds, so training automatically stopped at the 199th epoch (more details can be found in the YOLOv12_results.csv file in the Appendix A). In addition, other key settings such as data augmentation strategies (mixup, copy-paste, randaugment), freeze/thaw phase division, and training and validation division ratio are also consistent (see the args.yaml file in the supplementary material for details).
Figure 13 displays the comparison results. From the training results, the YOLOv8 model reached its peak performance at the 90th epoch, with an mAP@0.5 of 0.7128, while the YOLOv12 model reached a maximum mAP@0.5 value of 0.6825 at the 95th epoch, which is slightly lower than YOLOv8 overall. The two models are relatively close in the trend of the mAP curve, and YOLOv8 has higher detection accuracy and convergence stability in some stages. This result shows that although YOLOv12, as an updated version, has certain feature expression advantages in theoretical structure, in the gray brick damage detection task scenario targeted by this study, YOLOv8 still shows stronger adaptability and robustness; especially in the small sample training set, it can better maintain detection accuracy.
As for the issue of sample quantity, we also honestly admit that the current dataset size is still 375 annotated images and cannot be further expanded. This limitation is not due to ignoring the reviewer's suggestions, but because the survey task of local historical buildings in Macau has been completed in stages, and it is difficult to continue to obtain new images in the short term. However, we have formulated an expansion plan for the next stage, and intend to expand the scope of the survey from Macau to other typical Lingnan architectural areas in the Guangdong-Hong Kong-Macao Greater Bay Area (such as Foshan, Guangzhou Liwan, Zhaoqing, etc.) to build a cross-regional and more diverse extended dataset to further enhance the generalization ability and practicality of the model. The current lack of dataset size is a practical limitation of this study. We will supplement it truthfully in the discussion section of the paper and list data expansion as one of the key directions of subsequent work.
In addition, the reason why the names of the buildings are annotated in traditional Chinese is that the official languages ​​of Macau are traditional Chinese and Portuguese, and there is no English. The English names of many buildings are actually personal translations. We sincerely hope that although our research is not particularly great, it can provide international readers with clearer information in this field so that they can copy it to Google Chrome and other platforms for query and browsing. We have also had experience in annotating explanations in mdpi in the past, and have also published successfully. Please rest assured. Thank you again for your time in reviewing our article.
Reviewer 2 Report
Comments and Suggestions for AuthorsI thank the authors for their comprehensive answers and improvements to the text. As I said in the previous review, the paper is interesting and thorough. I propose that the article be accepted.
Author Response
Thank you for your recognition of our research. Thank you again for your time in reviewing our article.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have addressed all my concerns; therefore, I recommend the publication of the article.
Author Response
Thank you for your recognition of our research. Thank you again for your time in reviewing our article.
Reviewer 4 Report
Comments and Suggestions for AuthorsThank you for answering my comments
Author Response
Thank you for your recognition of our research. Thank you again for your time in reviewing our article.