Investigation of the Transferability of Measured Data for Application of YOLOv8s in the Identification of Road Defects: An SA-Indian Case Study
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- The paper proposes a road damage detection framework using YOLOv8s, evaluating its transferability and domain generalization between datasets collected in South Africa (RDD2024_SA) and India (RDD2022_India). The study explores four training–testing configurations to assess intra- and inter-dataset generalization, with results showing high accuracy when trained and tested within the same domain but significant degradation when tested cross-domain. The work emphasizes the need for domain alignment, data quality, and class balancing for effective deep learning deployment in real-world road maintenance systems.
-
The RDD2024_SA dataset contains only 489 images, which is insufficient for a robust deep learning experiment, especially when claiming generalization analysis. The small dataset likely caused the failure in Scenario 4 and overfitting in Scenario 1. -
The study reports single mAP, precision, and recall values without confidence intervals, variance estimates, or repeated trials. There is no statistical evidence to support the observed differences across scenarios. -
While the work focuses on domain transferability, it does not explore or even briefly test simple transfer learning techniques (e.g., fine-tuning the model trained on India data using a subset of SA data). -
The class reduction from six to four categories in Scenario 3 improves results, but the rationale is post hoc and not systematically analyzed. Furthermore, some classes (e.g., D01, D43) are inconsistently annotated between datasets. -
Figures such as the Precision–Recall curves (Figures 10–11) lack error bars, axis scales, or numerical context. Some tables (e.g., Table 5–7) are difficult to interpret without explicit units or averaging method descriptions. - Some related works may be missing. [1] Deep-IRTarget: An automatic target detector in infrared imagery using dual-domain feature extraction and allocation[2]Differential Feature Awareness Network within Antagonistic Learning for Infrared-Visible Object Detection
Author Response
Comment 1: The RDD2024_SA dataset contains only 489 images, which is insufficient for a robust deep learning experiment, especially when claiming generalization analysis. The small dataset likely caused the failure in Scenario 4 and overfitting in Scenario 1.
Response 1: We acknowledge this limitation and have now explicitly stated it in Section 5.1. We clarified that this dataset serves as a pilot study demonstrating cross-regional feasibility under resource constraints
Comment 2: The study reports single mAP, precision, and recall values without confidence intervals, variance estimates, or repeated trials. There is no statistical evidence to support the observed differences across scenarios.
Response 2: We agree that statistical reporting strengthens the findings. We have added a “Reporting uncertainty” paragraph in Section 4.2 describing the plan to report mean±SD over repeated runs and bootstrap 95% CIs for mAP/precision/recall at the image level in a future expanded dataset release. For this current submission, we explicitly flag the lack of repeated trials as a limitation and avoid over-interpreting pairwise differences across scenarios.
Comment 3: While the work focuses on domain transferability, it does not explore or even briefly test simple transfer learning techniques (e.g., fine-tuning the model trained on India data using a subset of SA data).
Response 3: We thank the reviewer for this valuable suggestion. Accordingly, we added a new Scenario 5 to test transfer learning. The YOLOv8s model pretrained on RDD2022_India was fine-tuned on 20 % of the RDD2024_SA dataset (seven-class configuration) for 40 epochs at a learning rate of 0.0005, freezing the first 10 backbone layers to retain low-level feature representations.
The fine-tuned model achieved mAP@0.5 = 0.862, precision = 1.00, recall = 0.88, and F1 = 0.88 at a confidence threshold of 0.503, far higher than the Scenario 2 cross-domain baseline (mAP = 0.32, precision = 0.36, recall = 0.37). This demonstrates that even limited fine-tuning on a small, locally representative subset effectively restores generalization performance across domains.
We have added a new section titled Scenario 5: Transfer Learning from India to South Africa (with Figures 11a, b) to the revised manuscript, together with a discussion of its engineering implications for sustainable, adaptive AI-based road-maintenance systems.
Comment 4: The class reduction from six to four categories in Scenario 3 improves results, but the rationale is post hoc and not systematically analyzed. Furthermore, some classes (e.g., D01, D43) are inconsistently annotated between datasets.
Response 4: Class balance and consistency: D01 (transverse/linear) and D43 (white line blur) displayed sparse and inconsistent annotations across datasets. To assess robustness under cleaner labels, we evaluated a 4-class taxonomy (D00, D20, D40, D44), which markedly stabilized training and improved mAP (see Table 7 & Fig. 9)
Comment 5: Figures such as the Precision–Recall curves (Figures 10–11) lack error bars, axis scales, or numerical context. Some tables (e.g., Table 5–7) are difficult to interpret without explicit units or averaging method descriptions.
Response 5: We have revised figure captions to state axes, IoU thresholds, and averaging rules; and updated table captions to specify whether values are per-class or macro-averages and the IoU setting (mAP@0.5). We also add units (GB) where needed and ensure all figures export as vector where applicable.
Comment 6: Some related works may be missing. [1] Deep-IRTarget: An automatic target detector in infrared imagery using dual-domain feature extraction and allocation [2] Differential Feature Awareness Network within Antagonistic Learning for Infrared-Visible Object Detection.
Response 6: Thank you. We have added the two related works on cross-domain/object-detection references in Section 2 to broaden context on dual-domain feature extraction and antagonistic learning for robust detection. (Full bibliographic entries will be inserted in the reference list upon finalization.)
Reviewer 2 Report
Comments and Suggestions for AuthorsThe study evaluates a YOLOv8-based approach for road surface damage detection, comparing performance across diverse datasets and scenarios to highlight the impact of data diversity, annotation quality, and environmental factors on model accuracy. However, it should be enhanced greatly to match the requirement of this journal.
- The YOLOv8 algorithm has been available for many years. The authors should clarify the rationale for selecting this algorithm in the study.
- Some figures (e.g., Figure 6) lack clarity.
- In the introduction of Scenarios, the authors should explicitly state the train/validation/test split ratios or explicitly reference the data partitions from Table 2.
- Section 5.1 fails to specify the lighting conditions and climate factors under which the dataset was collected.
- In Section 5.1, the unit "GP" is incorrect. It should be replaced with "GB" to align with standard data storage terminology.
- In Scenario 2, the authors note that the model detects "white line blur" but fails to detect "alligator cracks." However, no explanation is provided for this discrepancy.
- In Scenario 3, the exclusion of classes D01 and D43 is attributed to "sparse annotations," but no quantitative evidence is provided to support this claim.
- The text contains some unresolved "Reference source not found".
- The phrase "lack of diversity" in the conclusion is vague. The authors should specify which it refers to.
Author Response
Comment 1: The YOLOv8 algorithm has been available for many years. The authors should clarify the rationale for selecting this algorithm in the study.
Response 1: Thank you. We now justify the choice of YOLOv8s explicitly: it offers competitive accuracy–speed trade-offs, an improved head and loss functions over prior YOLO variants, strong small-object performance (relevant to fine crack patterns), and well-maintained tooling that eases reproducibility on modest hardware. We added a short comparison rationale in the Introduction and a note in the Discussion that results are framed as a case study rather than a claim of architectural superiority.
Comment 2: Some figures (e.g., Figure 6) lack clarity.
Response 2: Thanks to the reviewer for noting that Figure 6 is required improved clarity. We plan to use MDPI figure edit to improve the figure 6 and other figures in the manuscript.
Comment 3: In the introduction of Scenarios, the authors should explicitly state the train/validation/test split ratios or explicitly reference the data partitions from Table 2.
Response 3: We appreciate this observation. We have now added explicit statements of the train/validation/test split ratios and corresponding image counts for both datasets (RDD2024_SA and RDD2022_India) in Table 2 and applies to all scenarios.
Comment 4: Section 5.1 fails to specify the lighting conditions and climate factors under which the dataset was collected.
Response 4: We have added a data-collection note describing daylight windows, weather, and season, and the expected influence on texture/contrast under Section 5.1.1.
Comment 5: In Section 5.1, the unit "GP" is incorrect. It should be replaced with "GB" to align with standard data storage terminology.
Response 5: Thank you. The changes have been made to replace 'GP' with 'GB'.
Comment 6: In Scenario 2, the authors note that the model detects "white line blur" but fails to detect "alligator cracks." However, no explanation is provided for this discrepancy.
Response 6: We have added an explanatory paragraph "white line blur” exhibits high-contrast, contiguous patterns aligned with lane markings (stable spatial frequency), while “alligator cracks” present fractured, low-contrast, multi-scale textures with greater intra-class variation and annotation style differences between datasets. Also note that in the Discussion (Section 7) more explanation is provided for this discrepancy.
Comment 7: In Scenario 3, the exclusion of classes D01 and D43 is attributed to "sparse annotations," but no quantitative evidence is provided to support this claim.
Response 7: We thank the reviewer for this important observation. Upon re-evaluating the class-distribution characteristics of the RDD2022_India dataset, we agree that the original wording (“sparse annotations”) did not accurately reflect the underlying data, particularly for class D43. We have therefore revised the manuscript to remove this statement.
Instead, we now clarify that the decision to reduce the class structure from six to four categories was motivated by annotation inconsistencies, intra-class variability, and label noise observed in the full six-class configuration. These issues led to unstable feature learning and near-zero mAP values for D01 and D43 in Scenario 3. The refined four-class configuration (D00, D20, D40, D44) yielded a substantial performance improvement (mAP@0.5 = 0.7755), demonstrating that simplifying the taxonomy and focusing on consistently annotated categories improves model stability and generalization.
The revised text in the Scenario 3 discussion now provides an accurate rationale that aligns with the dataset characteristics and resolves the concern raised by the reviewer.
Comment 8: The text contains some unresolved "Reference source not found".
Response 8: All broken cross-references have been resolved.
Comment 9: The phrase "lack of diversity" in the conclusion is vague. The authors should specify which it refers to.
Response 9: We thank the reviewer for this helpful observation. The phrase “lack of diversity” has now been revised to specify the exact dataset characteristics that contributed to cross-domain degradation. In the updated conclusion, we clarify that the observed performance drop is linked to differences in illumination and weather conditions, camera viewpoint and height, pavement-material appearance, annotation style, and inter-dataset class-distribution characteristics. This provides a precise explanation of the factors that limited generalization. The revised conclusion now reads (excerpt):
“Observed cross-domain degradation is primarily linked to variations in illumination and weather conditions, camera viewpoint, pavement materials, annotation style, and inter-dataset class-distribution differences, all of which affect feature transferability across regions.”
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper presents a case study on the identification of road defects by YOLOv8s. The paper has a certain significance, and is within the scope of Sustainability. However, the innovation and contribution of this paper are not obvious, the presentation of a paper is currently very preliminary and will not be recommended for publication unless thoroughly revised. The main issues are listed below:
- YOLOv8 network has been widely applied, and it is also common to use machine vision methods for identification of road defects. Does the paper provide some meaningful conclusions of engineering based on the analysis results of specific datasets? This may be the only contribution of this paper.
- The abstract of the paper is too long and needs to be thoroughly rewritten to highlight the innovation and contribution of the paper.
- Why are there two sections of Materials and Methods?
- Why are there two sections of Discussion?
- What does Figure 7 want to show? What advantages of the proposed method do here want to highlight here? Further explanation and discussion are needed.
- Clarity of many figures in the paper is not enough, especially figures 7 and 10. It is recommended replacing them with vector graphics.
- It is suggested comparing the results horizontally with SOTA (State of the Art) methods. For example, other versions of YOLO networks.
- Lines 562~563, there is an abnormal display here.
- Artificial intelligence algorithms used in engineering are always expected to have a certain ability of interpretability. Is there any consideration for interpretability in this paper? It is suggested adding some discussions of perspective on this point. Some relevant literature can be used as positive and negative support for discussion, such as: https://doi.org/10.1016/j.engappai.2025.112069
- The language of the paper needs to be improved.
The English of this paper needs moderate revision.
Author Response
Comment 1: YOLOv8 network has been widely applied, and it is also common to use machine vision methods for identification of road defects. Does the paper provide some meaningful conclusions of engineering based on the analysis results of specific datasets? This may be the only contribution of this paper.
Response 1: We appreciate the reviewer’s insightful question regarding the practical engineering value of our study. The paper now articulates the engineering-oriented conclusions derived from the dataset-specific analyses. In the Discussion and Conclusion, we emphasize that this work provides empirical evidence for how dataset domain, class distribution, and annotation consistency directly influence the deployability and reliability of vision-based road-damage detectors in real operational settings. The main engineering takeaways added are:
-
Localized data collection yields high precision but limited transferability, highlighting the need for domain-specific calibration before field deployment.
-
Cross-domain degradation stems largely from mismatched illumination, texture, and camera geometry—factors that can be mitigated through data-driven calibration and light transfer-learning adaptation.
-
Class restructuring and balancing (reducing from six to four dominant classes) markedly improve model stability and interpretability, offering guidance for practical annotation protocols in future road-condition monitoring campaigns.
These points have been added to the Discussion (Section 7) clarifying that the study contributes concrete, evidence-based guidance for engineers deploying AI-based pavement inspection systems in heterogeneous environments.
Response 2: Abstract shortened to foreground the cross-domain problem, four scenarios, key finding (in-domain strong / cross-domain fragile), and operational implications.
Comment 2: We appreciate the reviewer’s helpful feedback. The abstract has been rewritten to approximately 200 words to emphasize the paper’s novel contribution, cross-domain experimental design, and engineering significance. The new version clearly presents the research motivation, methodological innovation (four-domain transferability scenarios), and major insights regarding dataset generalization and class structuring. It also reinforces the relevance of this South Africa–India comparative case study to sustainable, AI-enabled road-maintenance systems.
Comment 3: Why are there two sections of Materials and Methods?
Response 3: Thank you. Section 2 has been renamed to "Related Work", and Section 3 has been named "Materials and Methods".
Comment 4: Why are there two sections of Discussion?
Response 4: We have collapsed both duplicate section into a single Section 7 (Discussion)
Comment 5: What does Figure 7 want to show? What advantages of the proposed method do here want to highlight here? Further explanation and discussion are needed.
Response 5: We thank the reviewer for this important observation. Figure 7 has been clarified as the visual diagnostic result of Scenario 3, where the YOLOv8s model was trained on the RDD2022_India dataset containing six damage classes before reduction to four. The figure provides insight into the distribution, scale, and positional spread of labeled defects, helping to explain the model’s unstable performance in this configuration. Specifically, the upper bar plot shows severe class imbalance, with D01 and D20 containing very few samples compared to D00, D43, and D44. The overlaid bounding-box map and heatmaps show that most annotations are small and spatially clustered, with narrow width–height ratios, indicating limited geometric diversity.
These observations justify the later class consolidation from six to four, which improved model convergence and mAP. The advantage of the proposed YOLOv8s approach is its anchor-free, multi-scale detection head, allowing it to partially learn from this heterogeneous data even under imbalance something less flexible detectors would struggle with. We have expanded the figure caption and added a short explanatory paragraph in the Discussion.
Comment 6: Clarity of many figures in the paper is not enough, especially figures 7 and 10. It is recommended replacing them with vector graphics.
Response 6: We will be making use of MDPI figure edit.
Comment 7: It is suggested comparing the results horizontally with SOTA (State of the Art) methods. For example, other versions of YOLO networks.
Response 7: We appreciate this valuable suggestion. Given that the main aim of this study is to evaluate cross-domain transferability rather than to perform architecture benchmarking, we have provided a literature-based contextual comparison instead of additional experiments. A new paragraph has been added to the Discussion section comparing the present YOLOv8s results with representative state-of-the-art studies employing YOLOv4, YOLOv5, and YOLOv7.
Specifically, the comparison references:
-
RDD-YOLOv5 (Jiang et al., 2023), which achieved a mAP of 91.48 % using transformer-enhanced self-attention.
-
YOLOv7 (Pham et al., 2022), which obtained F1-scores of 81.7 % and 74.1 % on Google Street View road-damage data; and
-
YOLOv4 + DeepLabv3 (James, 2021), which demonstrated effective detection and segmentation on small pavement datasets from the Philippines.
These works collectively establish benchmark performance ranges (mAP ≈ 0.70–0.91), within which our YOLOv8s model performs competitively (mAP@0.5 = 0.95 in-domain; ≈ 0.78 after class reduction). The added paragraph emphasizes that, unlike previous architecture-centric efforts, our contribution lies in quantifying how dataset diversity, annotation consistency, and class structuring affect domain generalization between South African and Indian contexts. This addition clarifies both the comparative standing and the unique novelty of the current work.
Comment 8: Lines 562~563, there is an abnormal display here.
Response 8: Abnormal display has been fixed.
Comment 9: Artificial intelligence algorithms used in engineering are always expected to have a certain ability of interpretability. Is there any consideration for interpretability in this paper? It is suggested adding some discussions of perspective on this point. Some relevant literature can be used as positive and negative support for discussion, such as: https://doi.org/10.1016/j.engappai.2025.112069
Response 9: We thank the reviewer for highlighting the importance of interpretability in engineering-oriented AI systems. While the primary focus of this study was evaluating cross-domain generalization and transfer learning performance, we agree that model interpretability is essential for practical adoption in infrastructure management. To address this, we have added a statement in the "Conclusion" acknowledging the need for explainable AI (XAI) techniques such as class-activation mapping and attention visualization to support more transparent decision-making in future deployments. We have also cited the suggested literature to contextualize this direction. This addition clarifies our position and outlines interpretability as an important avenue for future work.
Comment 10: The language of the paper needs to be improved.
Response 10: We will make utilizing the MDPI professional language edit option.
Round 2
Reviewer 1 Report
Comments and Suggestions for Authorsaccept
Author Response
Comments: accept Response" Thank you.
Reviewer 2 Report
Comments and Suggestions for AuthorsThere are some 'Error! Reference source not found.' problems remained to be settled.
Author Response
Comment 1: There are some 'Error! Reference source not found.' problems remained to be settled.
Response 1: I have resolved the issue with 'Reference not found' Thank you
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have addressed most of the reviewer's comments. A minor revision needs to be done before recommending publication:
- Page 23, there are still many abnormal displays here due to citations. The citation of the whole text needs to be carefully checked.
- The robustness of artificial intelligence algorithms in engineering applications has always been a widely concerned issue among scholars. Does the method proposed by the authors have a robust improvement strategy for interference such as shadows and light changes? It is suggested adding some discussions on this point. Some literature can be used as support for discussion, such as: https://doi.org/10.1109/TIM.2023.3343742 , https://doi.org/10.1109/JSEN.2023.3294912
The English of this paper needs a minor revision.
Author Response
Comments 1: Page 23, there are still many abnormal displays here due to citations. The citation of the whole text needs to be carefully checked.
Response 1: Thank you, the issue with the Reference and citations have been resolved.
Comments 2: The robustness of artificial intelligence algorithms in engineering applications has always been a widely concerned issue among scholars. Does the method proposed by the authors have a robust improvement strategy for interference such as shadows and light changes? It is suggested adding some discussions on this point. Some literature can be used as support for discussion, such as: https://doi.org/10.1109/TIM.2023.3343742 , https://doi.org/10.1109/JSEN.2023.3294912
Response 2: We appreciate the reviewer’s valuable observation regarding robustness to shadows, lighting variations, and other environmental interference. The focus of the present study was to assess cross-domain transferability between South African and Indian datasets, rather than to develop or evaluate robustness-enhancing mechanisms. Accordingly, we used the standard YOLOv8s augmentation and preprocessing pipeline without introducing specialized illumination-invariant or shadow-resistant strategies.
We acknowledge that robustness to environmental variability is an important consideration for future development of practical road-damage detection systems. While this falls outside the scope of the present work, we agree that integrating shadow-aware augmentation, adaptive normalization, or illumination-invariant feature extraction represents a meaningful direction for future research. We thank the reviewer for highlighting this point.

