1. Introduction
Since the early 1980s, road maintenance has become a focus for many municipalities around the world. With proper and frequent maintenance, the design life of roads may be significantly extended. Typically, the cost of reconstruction of a deteriorated road, due to lack of maintenance, may be more than three times the cost of preserving a road through frequent maintenance [
1]. Most road damage recognition strategies for rural roads are based on onerous visual inspections. Visual inspection is the primary method for evaluating the physical and operational state of road infrastructures [
2]. These may involve closing off the inspected road or lane from traffic, thereby diverting traffic to alternative routes, which may result in traffic congestion on those routes. Properly maintained road networks are one of the central goals of the United Nations’ ninth sustainable development goal (UN-SDG 9), which aims to build resilient infrastructure [
3]. In fact, a good road network ensures accessibility to schools, health centers, markets, industry, urban centers, and other places that are key to good livelihoods. Therefore, just as sustainable transport is mainstreamed across all the United Nations SDGs, properly maintained road networks are the lifeblood of the economy [
4]. Road damage, such as potholes and cracks, is one of the reasons for traffic accidents in the United States and around the world, which may lead to deaths and/or loss of well-being. It is further reported that road surfaces that are properly maintained and monitored improve comfort, fuel efficiency, and user safety [
5].
Preventive maintenance is crucial for aged pavements (over 15 years old) as they tend to degrade much more quickly than new pavements. At the same time, routine preventive maintenance may be too costly for road agencies in low- and medium-income economies, such as South Africa and India. In these economies, it is therefore common for road agencies to prioritize strategic national road networks, thereby justifiably neglecting local and feeder road networks that connect communities to essential amenities. These local and feeder road networks tend to be subjected to some form of breakdown maintenance, where repair actions are triggered by the occurrence of major road damage. In addition to the deprivation that societies suffer due to a poor road network, this results in heavily patched-up roads that are uncomfortable for use and, eventually, the need for total rehabilitation of the road, which is extremely expensive. However, it is expected that these challenges can be put under control with early detection of road damage and timely maintenance actions [
6]. Early detection of damage, requiring maintenance action, can be achieved through the implementation of a condition-triggered strategy. For roads, roughness has traditionally been used as a primary indicator for pavement condition, as evidenced by the wide usage of different forms of roughness indicators in road maintenance decision-making. Road roughness includes everything from cracks to potholes to random deviations in a profile. As disintegration progresses down the pavement layers over time, these cracks and potholes could become much worse. The most common road distress is divided into four major classes: crack, pothole and patch, surface defects, and surface deformations [
7]. Each of these distress types requires different repair actions. Thus, it is obvious that knowledge of the type of distress is useful for deciding what repair action should be deployed to the affected road network. However, the challenge is that around the world, road agencies must deal with road transport infrastructure that must be kept operational through constant maintenance operations [
8,
9]. It is expected that the increase in the figures can be put under control with early detection of road damage and timely maintenance actions [
6]. The goals of road maintenance and rehabilitation are to maintain the pavement network in the best possible condition while lowering costs, improving service levels, lowering greenhouse gas emissions, and enhancing road user safety [
8].
The traditional methods for identifying road damage and distress are deemed intricate and inadequate when handling a substantial volume of images. For assessing road damage distress, human or semi-automated data collection is presently one of the most utilized approaches [
10]. However, these methods, like human visual inspection, may be an easy way to monitor a road’s condition, but they are impractical because they are costly, time-consuming, and labor-intensive. Instead, more efficient road condition monitoring systems are needed to thwart the challenges that are faced by the traditional methods, which are still in common use [
11].
Numerous efforts have been made to create a system that combines in-car camera recordings with image processing technology to analyze road properties and examine road surfaces more effectively [
12]. Image-based analysis and exploitation remain one of the most studied ways of monitoring and determining defects in roads. The creation of model-based pothole detection algorithms has been prompted by the advancement of image processing techniques and the accessibility of inexpensive camera equipment. The different types of cameras, like smartphone cameras, rangefinder cameras, mirrorless cameras, single-lens reflex cameras, etc., produce a large source of road data [
13].
Deep learning, in particular, convolutional neural networks (CNNs), has demonstrated significant potential among artificial intelligence (AI) techniques for identifying road defects like potholes, cracks, rutting, and patches. High-accuracy real-time analysis is now feasible thanks to object detection algorithms like YOLO (You Only Look Once), which empowers municipal planners and infrastructure managers to make well-informed maintenance decisions. Despite these developments, a gap still exists in evaluating the generalizability of modern algorithms such as YOLOv8 when used on geographically and visually distinct datasets. Few studies have specifically looked at how models trained on datasets obtained in one geographical and climatic region might perform when applied to a different geographical and climatic region.
In this study, two objectives have been identified: to implement the YOLOv8s algorithm for automated detection and classification of road damage using smartphone images and to evaluate the model’s performance across five distinct training–testing scenarios involving datasets obtained from Pretoria and Johannesburg in South Africa, and India.
4. CNN Algorithm-Based Classification
One of the key elements in the testing process is a fully connected neural network known as a convolutional neural network (CNN). CNNs are known for their superiority over other artificial neural networks, given their ability to process visual, textual, and audio data. The primary benefit of CNN is the automatic detection of significant features without human guidance. The CNN architecture is made of three main layers, namely: convolutional layers, pooling layers, and fully connected layers (FC).
The first two layers (convolutional and pooling layers) of the CNN model are for feature learning/extraction, while the second layer (connected layer) is for classification. In the first two layers, a convolution tool, called feature extraction/learning, attempts to recognize and isolate distinct aspects of a picture for analysis. The extracted features are then passed from the convolutional layer to the pooling layer, which is responsible for reducing the spatial size of the convolved feature map to reduce computing needs. The decrease in the required computing power is necessary for processing the data with a significant reduction in data dimension. There are two types of pooling, namely, max pooling and average pooling. In max pooling, the largest element is calculated from the feature map, while in average pooling, the element’s average in a predefined image size segment is obtained. The second layer constitutes the output of the convolution process and is made up of a fully connected layer that predicts the image’s class by using previously extracted features.
4.1. YOLOv8 Model Overview
YOLOv8 is a computer vision model architecture developed by Ultralytics, which is used for object detection, image classification, and instance segmentation tasks. YOLO is a series of models that has become famous in the computer vision world and has grown tremendously since its first launch in 2015 with the introduction of the 5 series (YOLOv5). Subsequently, there have been other releases up to the 11 series.
There are five sizes of YOLO models in the 8 series-nano (8n), small (8s), medium (8m), large (8l), and extra-large (8x). In this study, the YOLOv8s algorithm model is utilized. The adoption of YOLOv8s is because of its accuracy, latency balance, enhanced detection head, and training pipeline versus earlier YOLO versions. It also has reliable open tooling, which together supports reproducible experiments on resource-constrained setups. As our focus is on domain transferability, we treat YOLOv8s as a strong baseline.
The algorithm structure consists of four main components: the input layer, the Backbone, the neck and detection head. The Backbone layer is the deep learning architecture that acts as a feature extractor for our road dataset images. The features gathered from the different Backbone module layers are combined in the Neck layer. The Head layer, which is the result of the object detection model, forecasts the classes and the bounding box of the objects.
Figure 2 is an architecture of the YOLOv8 model.
4.2. Performance Metrics
Three metrics have been used to measure the performance of the various damage classification algorithms [
36]. These metrics include the following: recall, precision, and accuracy. The following are the mathematical expressions that define each of the metrics.
Recall assesses the capability of the model to detect all relevant instances. It examines the completeness of the positive prediction. The number of accurately predicted positive cases is known as TP, where FN is the number of true positive cases that were incorrectly identified as negative.
Precision assesses how well the model predicts positive outcomes. It evaluates how accurate the positive predictions are. The number of true negative cases that were incorrectly reported as positive is known as false positives (FP).
Mean Average Precision (mAP) represents the mean value of the Average Precision (AP) scores computed across all different categories. The AP quantifies the area under the Precision–Recall (P–R) curve for each class, describing the relationship between precision and recall at different confidence thresholds. A higher mAP indicates superior overall detection performance. The standard evaluation practice compares AP values at an Intersection-over-Union (IoU) threshold of 0.5. The metric is expressed mathematically as:
where
APk is the average precision for the kth class, and
n is the total number of classes.
Statistical uncertainty: For each metric, performance variability will be reported as mean ± standard deviation across repeated training seeds, together with bootstrap 95% confidence intervals obtained by resampling test images. Owing to the pilot scale of the RDD2024_SA dataset and computational limitations, the current results are based on single-run estimates and should therefore be interpreted as point values. This limitation is acknowledged, and future work will incorporate repeated-seed experiments to strengthen statistical reliability.
7. Discussion
Scenario 1: The object detection model’s performance evaluation across six road damage types, namely, the longitudinal cracks (D00), transverse/linear cracks (D01), alligator cracks (D20), pothole, separation and rutting defects (D40), white line blur (D43), and crosswalk blur (D44), shows consistently high accuracy, with the mean Average Precision (mAP) at an IoU threshold of 0.5 approaching perfect scores for most classes with precision and recall values significantly exceeding 0.90. Interestingly, the model demonstrated a remarkable capacity to accurately recognize all real cases of transverse/linear cracks, alligator cracks, white line blur, and crosswalk blur, achieving perfect recall of 1.0 for five of the six damage types. Pothole, separation and rutting, and longitudinal cracks resulted in the highest precision equal to 1.0 and 0.998, respectively, with the fewest false positives. For four damage types (transverse/linear cracks, alligator cracks, white line blur, and crosswalk blur), the mAP values, with both false positives and false negatives, represent the model’s overall classification performance exceeding 0.99, showing almost faultless localization and classification accuracy.
Although the model’s precision was excellent, it had a comparatively poor recall for pothole, with separation and rutting equal to 0.742, suggesting that it missed several cases of this damage type despite its excellent overall performance. This challenge implies that even if the model has a high degree of confidence in detecting potholes, separation, and rutting, it might need more training data or architectural adjustments to more effectively generalize all instances of this distress class. In addition, longitudinal cracks showed a somewhat poorer recall value of 0.938 as compared to their extremely high precision, suggesting that recall tuning is also necessary. However, the model shows strong and consistent road surface distress identification capabilities, especially for transverse/linear cracks, alligator cracks, white line blur, and crosswalk blur, which makes it feasible for practical implementation in pavement condition monitoring systems.
Scenario 2: Using the publicly accessible RDD2022_India dataset for training and the locally measured RDD2024_SA dataset from Pretoria and Johannesburg areas for testing, Scenario 2 assesses the performance of the YOLOv8s model using big data acquired in one region for training and validation, and thereafter applying the trained model for defect identification over small data from another region. Significant differences in detection performance between the damage classes were tabulated in
Table 5. Some of the distress classes, such as alligator cracks, failed to register any precision or recall at all, whereas white line blur and crosswalk blur had the greatest mAPs, at 0.605 and 0.408, respectively. Common longitudinal and transverse cracks were not well detected, as evidenced by the comparatively low mAP scores for D00 and D01 equal to 0.283 and 0.293, respectively. This is probably because different datasets have different image resolutions, lighting circumstances, and annotation styles. More unique features or comparable representations in both datasets might have benefited the best-performing distress classes, such as white line blur and crosswalk blur. All things considered, Scenario 2 draws attention to a key issue in this study concerning generalization: models developed using big data acquired in one region might not yield good results when applied to small data for damage identification in another region without specific modification or adjustment to fit certain conditions in the application region.
Scenario 3: Scenario 3 is an intra-dataset evaluation where the RDD2022_India dataset is the only dataset used to train and test the model. Unlike earlier experiments involving six classes, this configuration focuses on four dominant damage categories, D00, D20, D40, and D44, based on clearer class representation and annotation quality. The model performs better in this refined setup, achieving mAP@0.5 = 0.7755, Precision = 0.859, and Recall = 0.718 (
Table 8), a substantial improvement over the mAP of 0.156 obtained when all six classes were used (
Table 6).
The performance breakdown reveals mAP values of 0.759 for longitudinal cracks, 0.824 for alligator cracks, 0.732 for pothole/separation/rutting, and 0.787 for crosswalk blur. These results indicate more stable and reliable predictions across the retained classes. Removing D01 and D43 helped reduce annotation inconsistencies and intra-class variability that previously introduced noise during training. By simplifying the class taxonomy, the model was able to focus on clearer visual patterns and achieve more stable convergence.
These findings emphasize that refining class definitions and limiting training to well-annotated, consistently labeled categories can improve model generalizability in single-domain learning. Thoughtful class curation, therefore, plays a crucial role in enhancing feature learning and ensuring reliable performance in practical road-damage detection systems.
Scenario 4: Scenario 4, which involved training and validation exclusively on the RDD2024_SA dataset and testing on the RDD2022_India dataset, resulted in complete failure across all damage classes, recording negative to zero values for precision, recall, and mAP. This outcome is indicative of a significant domain shift between the two datasets. Key differences, such as road surface textures, damage patterns, lighting conditions, image resolution, and annotation styles, likely hindered the model’s ability to generalize learned features from the small data to the big data environment. Another contributing factor is the limited scale and variability of the small data, which may not have captured a sufficient range of road damage scenarios to train a robust and generalizable feature extractor. The model may have overfitted to specific characteristics unique to the South African context, resulting in an inability to detect visually and structurally different damage types present in RDD2022_India. This scenario strongly reinforces the importance of training on diverse, representative, and sufficiently large datasets when building deep learning models for global or cross-regional deployment. This scenario strongly reinforces the importance of training on diverse, representative, and sufficiently large datasets when building deep learning models for global or cross-regional deployment.
Scenario 5: The transfer-learning experiment applied fine-tuning to the YOLOv8s model originally trained on the RDD2022_India dataset using a 20% subset of RDD2024_SA images. This adaptation led to a substantial recovery in performance, with mAP@0.5 = 0.862, precision = 1.00, and recall = 0.88, compared with the direct cross-domain baseline of Scenario 2 (mAP = 0.32). The precision–recall curves (
Figure 11a) demonstrate stable class-level behavior, with consistently high reliability for D20, D40, D43, and D44. The normalized confusion matrix (
Figure 11b) further confirms reduced misclassification and improved inter-class separability after fine-tuning. However, D01 remains under-detected, which aligns with its sparse representation in both the Indian and South-African datasets. These results verify that even limited exposure to localized data allows the pretrained model to adjust its learned feature space, thereby closing a large portion of the performance gap created by domain shift.
From an engineering perspective, this outcome demonstrates the practical value of lightweight model adaptation for real-world deployment. Rather than retraining large models from scratch, municipalities and contractors can reuse pretrained networks from related environments and fine-tune them with a small, representative sample of local road imagery. This approach minimizes annotation cost, reduces computational demand, and ensures rapid scaling of AI-based pavement-inspection systems to new regions with distinct lighting, pavement materials, or camera configurations. The success of Scenario 5, therefore, highlights transfer learning as a sustainable and resource-efficient pathway for achieving robust cross-regional road-damage detection.
Collectively, the outcomes of Scenarios 1–5 demonstrate that dataset domain alignment, class balance, and moderate fine-tuning are decisive factors for achieving stable and transferable road-defect detection performance across heterogeneous environments.
From an engineering standpoint, these results emphasize that the deployment of road-defect detection systems must account for domain-specific conditions and data characteristics. Localized training delivers high accuracy within the region of acquisition, but performance can degrade substantially when applied to different environments without adaptation. In practice, modest fine-tuning using a small subset of local data, improved class balancing, and illumination normalization can significantly enhance reliability across diverse road and climatic conditions. These insights provide actionable guidance for transportation engineers and municipal maintenance teams integrating AI-based pavement inspection into real-world operations.
Recent studies using earlier YOLO frameworks have demonstrated the potential of deep learning for automatic pavement-distress detection. For instance, Jiang et al. [
39] RDD-YOLOv5 introduced a transformer-enhanced architecture with Gaussian Error Linear Units, achieving a mAP of 91.48%, about 2.5% higher than the baseline YOLOv5 in UAV-based inspection tasks. Similarly, Pham et al. [
40], using YOLOv7 augmented with coordinate attention, label smoothing, and ensemble techniques, obtained F1-scores of 81.7% (U.S. subset) and 74.1% (overall dataset) using Google Street View imagery. Earlier, Sarmiento [
41] utilized YOLOv4-based detection and segmentation work on Philippine road images, which also showed that even small, manually collected datasets can yield effective distress identification.
In comparison, the present YOLOv8s-based study achieved mAP@0.5 = 0.95 in the in-domain (RDD2024_SA) scenario and ≈ 0.78 after class reduction on RDD2022_India, performing within or above the ranges reported for earlier YOLO versions. More importantly, the Scenario 5 transfer-learning experiment demonstrated that fine-tuning the India-trained model using only 20% of the South-African dataset restored cross-domain performance to mAP@0.5 = 0.862, substantially higher than the direct cross-domain baseline (mAP = 0.32).
Unlike prior studies that focus primarily on architectural improvements, this work provides a cross-domain and transferability analysis, showing how dataset diversity, annotation style, class structuring, and limited fine-tuning govern generalization across regions. Hence, the study complements architecture-oriented research by offering an evidence-based understanding of domain alignment requirements for real-world, AI-enabled road-maintenance systems.
8. Conclusions
Comparative results across all five scenarios clearly demonstrate the importance of dataset–domain alignment, class structuring, and limited fine-tuning in determining the performance of deep-learning models for road-damage detection. With mAP@0.5 values ranging from 0.877 to 0.995 across all damage classes, Scenario 1, in which the YOLOv8s model was trained, validated, and tested solely on the Pretoria/Johannesburg (RDD2024_SA) dataset, achieved excellent accuracy. This confirms that context-specific localized datasets provide strong feature representations that support reliable detection of road defects. Conversely, Scenarios 2 and 4, which involved cross-domain testing, revealed severe degradation or complete detection failure, highlighting the difficulty of applying models trained in one region to another without adaptation.
Observed cross-domain degradation is primarily linked to variations in illumination and weather conditions, camera viewpoint, road pavement materials, annotation style, and inter-dataset class-distribution differences, all of which affect feature transferability. Scenario 3, which reduced class complexity from six to four dominant categories (D00, D20, D40, and D44), achieved improved performance (mAP@0.5 = 0.776, precision = 0.858, recall = 0.718), demonstrating that curated and consistent class definitions enhance model stability and generalization. Scenario 5, a transfer-learning configuration in which the India-trained model was fine-tuned on only 20% of the South-African dataset, achieved mAP@0.5 = 0.862, precision = 1.00, and recall = 0.88. This significant recovery underscores the engineering practicality of lightweight model adaptation for bridging domain gaps.
Overall, the findings establish that domain-aligned data preparation, targeted fine-tuning, and balanced class representation are decisive for developing robust, transferable, and resource-efficient AI systems for pavement-condition monitoring and sustainable road-maintenance management. Future work should incorporate interpretability methods such as class-activation mapping and feature-attribution analysis to ensure that YOLO-based road damage detectors operate transparently and provide engineering practitioners with insight into the visual cues and decision patterns that underlie each prediction.