An Integrated Framework with SAM and OCR for Pavement Crack Quantification and Geospatial Mapping

Sovanneth, Nut; Angelo, Asnake Adraro; Obonguta, Felix; Kaito, Kiyoyuki

doi:10.3390/infrastructures10120348

Open AccessArticle

An Integrated Framework with SAM and OCR for Pavement Crack Quantification and Geospatial Mapping

¹

Department of Civil Engineering, The University of Osaka, Suita, Osaka 565-0871, Japan

²

Department of Road Infrastructure, Ministry of Public Works and Transport, Khan Russey Keo, Phnom Penh 12000, Cambodia

^*

Author to whom correspondence should be addressed.

Infrastructures 2025, 10(12), 348; https://doi.org/10.3390/infrastructures10120348

Submission received: 9 October 2025 / Revised: 29 November 2025 / Accepted: 11 December 2025 / Published: 15 December 2025

Download

Browse Figures

Versions Notes

Abstract

Pavement condition assessment using computer vision has emerged as an efficient alternative to traditional manual surveys, which are often labor-intensive and time-consuming. Leveraging deep learning, pavement distress such as cracks can be automatically detected, segmented, and quantified from high-resolution images captured by survey vehicles. Although numerous segmentation models have been proposed to generate crack masks, they typically require extensive pixel-level annotations, leading to high labeling costs. To overcome this limitation, this study integrates the Segmentation Anything Model (SAM), which produces accurate segmentation masks from simple bounding box prompts while leveraging its zero-shot capability to generalize to unseen images with minimal retraining. However, since SAM alone is not an end-to-end solution, we incorporate YOLOv8 for automated crack detection, eliminating the need for manual box annotation. Furthermore, the framework applies local refinement techniques to enhance mask precision and employs Optical Character Recognition (OCR) to automatically extract embedded GPS coordinates for geospatial mapping. The proposed framework is empirically validated using open-source pavement images from Yamanashi, demonstrating effective automated detection, classification, quantification, and geospatial mapping of pavement cracks. The results support automated pavement distress mapping onto real-world road networks, facilitating efficient maintenance planning for road agencies.

Keywords:

pavement distress detection; deep learning; segment anything model (SAM); optical character recognition (OCR); geospatial mapping

1. Introduction

Efficient pavement distress assessment plays a vital role in modern road asset management, enabling informed infrastructure preservation, enhanced traffic safety, and cost-effective maintenance planning. Traditionally, pavement condition evaluation has relied on manual visual inspections, where trained personnel identify surface defects such as cracks, potholes, rut depths, and patches directly on site. While this method provides detailed observations, it suffers from several limitations: it is labor-intensive, time-consuming, subjective, and inconsistent, especially when applied to large-scale road networks [1,2]. Moreover, manual surveys expose inspectors to hazardous traffic conditions, making frequent, network-wide assessments impractical under limited resources [3,4].

To overcome these challenges, automated pavement condition survey systems have been introduced, integrating high-resolution imaging sensors, onboard computing, distance measurement tools, and Global Positioning System (GPS) modules to improve efficiency, accuracy, and consistency in pavement data collection [5,6]. Automated pavement inspection generally involves multiple stages, including pavement surface classification, distress detection and quantification, and condition rating based on the type and severity of detected distress [2]. In Cambodia, for example, the Road Measurement Data Acquisition System (ROMDAS) is employed by the Ministry of Public Works and Transport (MPWT) for pavement monitoring, as shown in Figure 1. ROMDAS utilizes downward-facing high-resolution cameras mounted on survey vehicles to continuously capture pavement images at driving speeds, as illustrated in Figure 2. The collected imagery is synchronized with GPS and distance measurements, allowing accurate georeferencing of pavement distress such as cracks, potholes, and patches. This approach has significantly enhanced Cambodia’s capability to maintain comprehensive and up-to-date pavement condition records, facilitating data-driven maintenance planning.

Although data acquisition technologies have advanced, the automated interpretation of captured images into pavement condition assessment remains crucial for effective maintenance planning and optimal repair strategies. Instead of relying on labor-intensive measurements, image processing methods enable the automated quantification of pavement distress severity. These quantified features can then be systematically converted into standardized evaluation indices such as the Pavement Condition Index (PCI) or the Maintenance Control Index (MCI), thereby supporting data-driven maintenance intervention [7,8]. In Japan, the MCI is widely adopted by road administrators to guide maintenance interventions based on its severity. This index is primarily derived from distress density, such as crack ratios, which require careful measurement to ensure accuracy. The definition and calculation of the MCI can be found in relevant references [9,10,11,12,13]. In recent years, advances in image processing have further improved this process by reducing subjectivity and enhancing consistency. However, early approaches that depended on handcrafted features and heuristic algorithms (e.g., thresholding, region-based segmentation) often failed to generalize across diverse pavement surface conditions and struggled with detecting low-contrast features like fine cracks against complex backgrounds [14,15,16,17].

Latest advances in deep learning have revolutionized and overcome the limitations of conventional methods in pavement distress detection by enabling models to learn discriminative features automatically from large-annotated datasets [18,19,20,21]. Among deep learning-based detectors, two major approaches dominate the literature: region-based convolutional neural network (CNN) detectors and one-stage detectors such as You Only Look Once (YOLO). Convolutional architectures, such as Faster R-CNN, have been widely applied to pavement distress detection, often incorporating ResNet and Feature Pyramid Network (FPN) backbones to improve feature extraction and enhance detection accuracy [22]. However, this approach still requires intensive feature computation when handling numerous proposals [23]. On the other hand, the YOLO model has gained popularity for its high inference speed, real-time processing capability, and robust performance in complex environments [3,24,25]. Additional improvements have been proposed by integrating sub-networks and advanced IoU-based loss functions to enhance detection precision [19,26]. Despite these advancements, such detection models typically yield bounding boxes rather than precise distress quantification, as they cannot provide pixel-level measurements of critical metrics such as crack width, length, or area, which are essential for severity assessment and maintenance decision-making [27,28].

To overcome this shortcoming, segmentation-based methods have been introduced to provide pixel-level contours of pavement distress. A key advantage over detection methods is their capability to obtain the actual area of distress, achieving a result with pixel-level precision [29]. This is attributed to the higher model complexity and richer parameterization of segmentation networks [30]. By distinguishing between distressed and non-distressed pixels, these methods enable a fine-grained analysis, allowing for a prediction label to be assigned to individual pixels.

Among the most widely used deep learning architectures for pavement distress segmentation are convolutional neural network (CNN) based models such as U-Net, DeepLabV3+, Mask R-CNN, SegNet, and domain-specific frameworks like CrackNet. U-Net, a classic U-shaped encoder–decoder architecture, efficiently localizes features using skip connections that fuse feature maps between the downsampling (encoder) and upsampling (decoder) paths, enabling faster processing and refined segmentation, as utilized by Li et al. [31] and adapted by Lau et al. [32] with a pre-trained ResNet-34 encoder to better distinguish distress regions from intact pavement. Another state-of-the-art approach is DeepLabV3+, an encoder–decoder model for pavement crack segmentation that commonly incorporates a pre-trained CNN encoder and an Atrous Spatial Pyramid Pooling (ASPP) module to capture multi-scale context, often enhanced with an attention mechanism to handle diverse environmental conditions, as demonstrated by Li et al. [33] and Sun et al. [34]. For instance segmentation, Mask R-CNN, an improved R-CNN variant, integrates image data with pixel-segmented annotations to precisely identify and position cracks, a method applied by Wang et al. [35] and modified by Dong et al. [36] to extract detailed crack properties. Alternatively, SegNet provides an efficient encoder–decoder structure for scene understanding tasks and crack segmentation in concrete and asphalt pavement, with Chen et al. [37] presenting a modified version capable of end-to-end, pixel-by-pixel training for arbitrary image sizes. Finally, models like CrackNet developed by Kyem et al. [38] integrate specialized modules and refinement operations to address the challenge of accurately segmenting tiny and subtle cracks across varying scales in high-resolution images.

Despite their documented success, a critical limitation shared by these traditional supervised models is their reliance on large volumes of meticulously annotated ground-truth data. Their generalization to real-world conditions is directly dependent on the quality and diversity of the training datasets. The necessity for extensive labeling introduces a significant bottleneck in practical deep learning applications for remote sensing image segmentation [13]. Xu et al. [39] proposed an end-to-end framework for the automatic detection and segmentation of tunnel cracks, which balances annotation cost with performance by enhancing YOLOv8. Although the model effectively detects and segments thin cracks and ignores complex backgrounds, the annotation process remains labor-intensive. Specifically, the pixel-level labeling required for segmentation, which involves drawing precise polygons, can still take several minutes per image. Moreover, the inherent variability, resolution differences, and environmental complexity of remote sensing data make the labeling process even more demanding [40]. Consequently, developing segmentation models that can operate effectively with minimal labeled data offers the potential for reductions in annotation costs and time.

Recent breakthroughs in foundation models introduce a paradigm shift that minimizes the need for extensive manual labeling while ensuring better generalization to unseen images. The Segment Anything Model (SAM), a foundational model from Meta AI Research, significantly advances image segmentation by leveraging generalized learning across diverse, massive datasets. This training allows SAM to establish a robust pre-training objective, covering a wide array of applications and often outperforming other models in complex or noisy environments [41,42]. SAM offers powerful zero-shot segmentation capabilities, generating high-fidelity masks from simple input prompts (e.g., points, bounding boxes) without task-specific training [43]. Nonetheless, SAM is not a complete solution for pavement distress analysis; its performance is contingent on receiving accurate prompts, and it can underperform on slender, low-contrast features like cracks without further refinement [44].

Furthermore, if the image includes relevant details, such as coordinates as shown in the ROMDAS image, it is crucial to extract this information in addition to any identified distress. Optical Character Recognition (OCR) technology plays a crucial role in retrieving text from different types of documents, including handwritten content, and converting it into digital text [45]. As an OCR engine, Tesseract, becomes an efficient and reliable tool for recognizing and extracting text from images [46]. For instance, Tesseract OCR has found extensive use in applications such as reading letters and numbers on vehicle license plates when paired with deep learning detection systems [47].

This study bridges this critical gap by proposing a lightweight framework that synergizes the strengths of object detection, foundation models, and metadata parsing to create an end-to-end solution for pavement distress analysis. Our methodology leverages a YOLOv8 model to automatically generate high-quality bounding box prompts for distress regions. These prompts are then fed into SAM to produce initial pixel-level segmentation masks. To address SAM’s limitations with thin cracks, we introduced a local refinement module to enhance mask accuracy. Finally, to enable geospatial localization, the Tesseract OCR engine is employed to extract embedded GPS coordinates directly from the survey imagery, linking each quantified distress to a precise geographical location. The proposed framework is validated using an open-source pavement image dataset from Yamanashi, Japan. The results demonstrate robust performance in automated detection, precise segmentation, accurate quantification of distress dimensions, and seamless geospatial mapping. By integrating advanced deep learning and foundation segmentation models, this work provides a scalable, efficient, and comprehensive tool for automated pavement condition assessment, contributing to data-driven infrastructure management practices.

2. Methods

2.1. Proposed Framework

The objective of this research is to develop and implement an end-to-end framework for automated pavement distress detection and quantification. In addition to quantitative assessment, this process also aims to localize the distress on a map for improved visualization and informed decision-making by road agencies. Figure 3 illustrates the process in this study, which begins with data preparation, where images from a vehicle-mounted camera are transformed from a horizontal perspective to a top-down view. This transformation, known as Inverse Perspective Mapping (IPM), uses a homography to map pixels from the camera’s perspective onto a different 2D coordinate frame, creating a bird’s-eye view of the scene [48,49]. After this transformation, the images are cropped to a 1:1 aspect ratio to prepare them for input to the subsequent detection and segmentation model.

A two-step process based on a modified YOLOv8 and the SAM was proposed to handle distress detection and segmentation. The dataset of pavement cracks was divided into training and validation sets, and each image was manually annotated to create ground truth data for both detection and segmentation. The first step involves using YOLOv8 for pavement crack detection, which classifies and localizes cracks by generating bounding boxes around them. These bounding boxes serve as prompts for the second step, which utilizes the SAM for crack segmentation. By using the bounding boxes from the first step as prompts and local refinement, the proposed method can segment the detected pavement cracks with greater purpose and accuracy, isolating the exact pixels that constitute the distress.

Simultaneously, a region containing pixel-based coordinate digits, typically in a blue header, is extracted using Tesseract OCR. This enables the system to save the latitude and longitude alongside the identified cracks. Since these coordinate digits are standardized and associated with the ROMDAS, the pre-trained Tesseract OCR model can be used directly without retraining, unlike scenarios involving distorted or irregular text (e.g., license plate or traffic signs). Once the entire process is complete, pavement cracks are automatically detected, their severity quantified, and their location mapped onto a geospatial platform. This integrated approach allows road agencies to efficiently manage and implement timely maintenance planning.

2.2. Top-Down Perspective Transformation Based on Homography

The transformation from a camera’s perspective view to a metrically accurate top-down (bird’s-eye) view is fundamentally governed by homography, a projective transformation matrix. This process corrects for perspective distortion and yields a rectified image with a uniform scale, making it suitable for subsequent metric analysis, such as measuring actual object sizes or distances. The core of this transformation is the homography matrix, which maps points from one plane to another.

As illustrated in Figure 4, four reference points were manually selected on the pavement region. These points form a trapezoidal region due to perspective distortion caused by the oblique camera angle. Under an ideal bird’s-eye view, however, these same points would form a perfect rectangle, demonstrating the geometric deformation present in the raw images. The homography matrix mathematically maps the pixel coordinates from the original image plane to their corresponding positions on a rectified, top-down plane. The homography matrix H encapsulates this transformation and is defined as:

[\begin{matrix} x^{'} \\ y^{'} \\ 1 \end{matrix}] = H \cdot [\begin{matrix} x \\ y \\ 1 \end{matrix}]

(1)

where H is the 3 × 3 homography matrix:

H = [\begin{matrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{matrix}]

(2)

where H is estimated using the four pairs of the corresponding points:

p_{i} = (x_{i}, y_{i}) ⟶ p_{i}^{'} = (x_{i}^{'}, y_{i}^{'}) For i = 1, 2, 3, 4 .

(3)

In this study:

$p_{i}$ = manually selected source points in the original pavement image (See Figure 4 (left)).
$p_{i}^{'}$ = destination points on the target rectangular plane (See Figure 4 (right)).

For implementation, the homography matrix H was computed in a Python (3.12) script utilizing cv2.getPerspectiveTransform from the OpenCV library [50]. The four corner points of the trapezoidal pavement region were manually defined. Their corresponding destination points were calculated based on a known physical scale (pixels per meter) to create a rectangular output image of specified real-world dimensions (e.g., 3.0 m × 3.8 m). The perspective warping was then performed using cv2.warpPerspective, generating a rectified pavement image with a uniform scale. This rectified view eliminates perspective distortion, providing a metrically accurate representation of the pavement for further analysis. All the images used in this study, including those for training, validation, and testing, underwent this transformation process prior to being input into the crack detection and segmentation framework.

2.3. YOLOv8 Model for Pavement Crack Detection

YOLOv8 represents a state-of-the-art advancement in single-stage object detection. While retaining the backbone-neck-head design paradigm established in earlier YOLO models, it introduces several architectural refinements that improve both accuracy and computational efficiency. The model processes an input image in a single forward pass, simultaneously predicting bounding boxes, object classes, and confidence scores.

The backbone is responsible for hierarchical feature extraction. It employs convolutional layers with cross-stage partial (CSP) connections and bottleneck structures, which progressively reduce spatial resolution while expanding channel depth. This design produces a multi-scale feature hierarchy that captures both fine-grained details and high-level semantic information. The neck aggregates these multi-scale features through an enhanced feature fusion strategy. Building upon the Feature Pyramid Network (FPN), YOLOv8 incorporates a Path Aggregation Network (PAN), which adds a complementary bottom-up pathway to the FPN’s top-down flow [51,52]. This bi-directional information exchange enriches the feature maps with both semantic and localization cues, thereby improving detection robustness across objects of different sizes. The head executes the final detection tasks. Unlike earlier coupled designs, YOLOv8 adopts a decoupled head structure that separates classification (object categories) from regression (bounding box coordinates). This reduces inference between two objectives, stabilizes training, and yields more accurate predictions. The output includes the bounding boxes’ coordinates, class, and confidence scores.

Compared to two-stage detectors such as Faster R-CNN, which rely on a Region Proposal Network (RPN) followed by classification, YOLOv8 performs detection in a single stage [53]. This streamlined design significantly reduces computational overhead and enables real-time inference without sacrificing detection quality. The complete architecture of YOLOv8 is depicted in Figure 5.

2.4. Architecture of SAM for Pavement Crack Segmentation

Traditional segmentation models typically require training for specific tasks and are limited in their applicability across different domains. SAM by Meta AI has gained attention for its impressive zero-shot performance and capability to produce high-quality object masks from diverse input prompts. The primary advantage of SAM compared to other state-of-the-art segmentation models lies in its ability to generalize across a wide range of tasks without task-specific fine-tuning. This adaptability makes SAM a versatile tool, especially when high accuracy is needed across diverse datasets. However, the specific visual examples in the figure might not fully convey this strength, and we will consider adding more representative images to better illustrate SAM’s capabilities. SAM functions as a class-agnostic segmentation model, utilizing a Vision Transformer (ViT) for image encoding and a sophisticated two-layer mask decoder. SAM’s architecture features an image encoder with ViT to extract detailed embeddings, a prompt encoder to interpret various user inputs, and a lightweight mask decoder for precise pixel-level segmentation decisions. This design enables SAM to effectively adapt to new segmentation tasks with minimal additional training, ensuring high accuracy.

The architecture of SAM, as shown in Figure 6, primarily consists of three components: the image encoder (Vision Transformer (ViT)), the prompt encoder, and the mask decoder. The SAM utilizes the ViT as the image encoder, serving as the backbone network for image feature extraction. The choice of ViT architecture stems from its ability to capture long-range dependencies and intricate visual information. The prompt encoder processes user inputs (e.g., boxes) to guide the segmentation process. Specifically, it is based on box encoding, where user-drawn bounding boxes were encoded to provide spatial constraints for segmentation. The mask decoder integrates features from both the ViT backbone and the prompt encoder to generate segmentation masks. The decoder comprises: feature fusion, upsampling layers, and mask generation.

The pre-training process of SAM occurs on a large-scale and diverse dataset (SA-1B dataset), which comprises over a billion finely annotated images and segmentation masks covering various image types and segmentation tasks. The pretraining process adopts a multitask learning strategy, combining Cross-Entropy loss and Dice loss to optimize segmentation accuracy. The objective of pretraining was to enable SAM to learn rich feature representations, allowing it to generalize to different segmentation tasks and application scenarios without the need for retraining. However, the conventional SAM does not perform well on pavement distress because pavement distress is thin and irregular, unlike the natural images that the SAM was trained on. Therefore, SAM is finetuned with the bounding box prompt to create a pavement distress-specific segmentation model, called modified SAM. To reduce the computational cost, the image encoder is frozen, while only the prompt encoder and the mask decoder are fine-tuned for pavement distress segmentation. Masks are generated for top-down pavement images using polygon-based annotations on LabelMe. The model was trained for 200 epochs. The loss was the summation of the dice loss and the cross-entropy loss, which together provide a robust loss for segmentation tasks. Adam optimizer was used with different learning rates.

3. Results

3.1. Dataset Preparation

MPWT has recently introduced the ROMDAS for road network assessment. However, due to limited server capacity and the lack of dedicated software tools for image-based analysis, high-resolution pavement images captured by ROMDAS are generally retained only for a short period before being deleted. In contrast, conventional outputs such as the International Roughness Index (IRI), Rut Depth, and Mean Profile Depth (MPD), which require far less storage space, are systematically preserved. Moreover, the ROMDAS pavement image available from a small number of survey routes in Cambodia does not provide sufficient coverage of surface distress for comprehensive analysis.

To address these limitations, a synthetic dataset was created that emulates the characteristics of ROMDAS imagery. Specifically, pavement surface regions were cropped from Cambodian ROMDAS survey images while retaining the blue header that contains embedded coordinate information. These cropped regions were then replaced with pavement surface images captured in Yamanashi, Japan, using an action camera [54]. This strategy ensured that the resulting dataset maintained the perspective and visual style of ROMDAS data while expanding the availability of distress cases required for training and validation. Nonetheless, as this synthesis primarily demonstrates the feasibility of the proposed framework, variations in pavement texture, illumination, and imaging conditions between two sources may influence the model’s generalization to purely local datasets. Therefore, retraining or fine-tuning with larger, fully local datasets is recommended for practical deployment.

In total, 1260 images were prepared for model development and randomly divided into two subsets: 80% for training and 20% for validation. Additionally, a separate set of 700 unseen images was used for independent testing to assess model generalization performance. The dataset was annotated into three primary categories of cracks (transverse, longitudinal, and pattern cracks), using the LabelMe tool. Polygonal annotations were subsequently converted into binary ground truth masks to facilitate segmentation tasks.

3.2. Model

The proposed framework integrates two models: YOLOv8 for crack detection and SAM for segmentation. YOLOv8 was chosen for its simplified training and deployment through the Ultralytics library and for its anchor-free detection mechanisms, which allow faster inference without compromising accuracy. The YOLOv8n variant, in particular, was selected due to its lightweight architecture, making it suitable for deployment on limited hardware. As reported by Li and Gu [55], comparative experiments were conducted using multiple YOLOv8 versions (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x) for road crack detection. Their results showed that YOLOv8n achieved competitive accuracy with significantly lower inference time and reduced hardware requirements compared with the larger models; therefore, it was selected as the baseline in their study and further modified to improve accuracy and efficiency. Despite its compact design, YOLOv8n achieves higher accuracy than comparable lightweight models such as MobileNet-SSD, which rely on older CNN backbones. Furthermore, YOLOv8 supports deployment formats like TensorRT, enabling performance optimization on embedded NVIDIA Jetson platforms and ensuring stable operation in continuous field environments.

For segmentation, SAM was incorporated to generate fine-grained crack masks from bounding box prompts. SAM is a transformer-based vision model designed for general-purpose segmentation, and in this study, its ViT-H backbone was fine-tuned using a transfer learning strategy. The image encoder and prompt encoder were frozen to preserve SAM’s feature extraction capability, while the mask decoder was optimized for pavement crack imagery. To improve the accuracy of segmenting thin and irregular crack structures, a hybrid loss function combining Binary Cross-Entropy (BCE) loss and Dice loss was employed. In addition, to refine segmentation quality when bounding box prompts were overly broad, a simpler local refinement, crop-and-refine, was adopted in the test dataset to work on image patches [56,57]. This refinement module crops local regions in the bounding boxes, performs localized corrections based on error maps, and then reinserts the refined patches into the global region.

3.3. Training Setup

All experiments were conducted on a Windows 11 operating system with PyCharm as the development environment. To ensure reproducibility, the code and experimental configurations used for both training and testing are publicly accessible on GitHub at https://github.com/Nut-Sovanneth/YOLOv8-SAM-OCR.git (accessed on 11 November 2025). Model training and validation were performed on a NVIDIA GeForce RTX 4070 Ti GPU. The optimizer was set to automatic selection, enabling adaptive tuning of optimization parameters during training. Parameters quantify the learnable elements within a model, reflecting the complexity and capacity of the model. The detailed configurations pertaining to the hyperparameters for network training are outlined in Table 1.

YOLOv8 was trained for 200 epochs with an input size of 640 × 640 pixels and a batch size of eight, using pretrained weights (YOLOv8n.pt). SAM was fine-tuned for 200 epochs with a batch size of one image, an input size of 1024 × 1024 pixels, and pretrained weights (sam_vit_h_4b8939.pth). For YOLOv8, the initial learning rate was set to 0.01 and gradually decreased to near zero following a cosine schedule, where an initial warm-up phase enables stable early training and the non-linear decay shows the reduction toward the end of the process. For SAM, learning rates of 0.001, 0.0001, and 0.00001 were evaluated based on sensitivity analyses to balance training stability, convergence speed, and computational efficiency, ensuring optimal mask segmentation performance while maintaining feasible resource usage. Both models were trained on the same three classes of cracks: transverse, longitudinal, and pattern cracks.

Hyperparameters such as image size, batch size, and learning rate were carefully selected based on sensitivity analyses to balance training stability, convergence speed, and computational efficiency.

3.4. Evaluation Metrics

The performance of the YOLOv8 detection model was evaluated using precision, recall, and mean average precision (mAP). Precision measures the proportion of correctly identified cracks among all positive predictions, while recall reflects the proportion of actual cracks successfully detected by the model. Average Precision (AP) corresponds to the area under the precision-recall curve for each crack type. The mAP at IoU = 0.5 (mAP50) quantifies performance when a 50% overlap is required between predicted and ground truth bounding boxes, whereas mAP50–95 average AP across thresholds from 0.5 to 0.95 in increments of 0.05, providing a more comprehensive measure of detection accuracy. YOLOv8 was optimized using a composite loss function consisting of box regression loss (box_loss), classification loss (cls_loss), and distribution focal loss (dfl_loss), which together improved bounding box localization, crack classification, and boundary refinement. These evaluation metrics can be calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

R e c a l l = \frac{T P}{T P + F N}

(5)

A P = \int_{0}^{1} P (R) d r

(6)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(7)

m A P 50 = \frac{1}{m} \sum_{i = 1}^{m} {A P}_{i} (w i t h I o U = 0.5)

(8)

m A P 50 - 95 = \frac{1}{N} \sum_{t \in \{0.5, 0.55, \dots, 0.95\}} {A P}_{i} (w i t h I o U = t)

(9)

Here, TP (true positives) refers to correctly detected cracks, FP (false positives) represents background regions incorrectly identified as cracks, and FN (false negatives) denotes missed cracks. N represents the number of crack classes (e.g., transverse, longitudinal, and pattern cracks).

During the YOLOv8 training, key evaluation metrics, including Precision, Recall, mAP50, and mAP50–95, were monitored at each epoch, as shown in Figure 7. All metrics exhibited a gradual improvement with increasing iterations. By the end of training, Precision, Recall, mAP50, and mAP50–95 reached 0.704, 0.694, 0.733, and 0.468, respectively, indicating that YOLOv8 achieves strong performance in detecting pavement cracks.

Figure 8 illustrates the Precision-Recall curves for YOLOv8. The horizontal axis represents Recall, and the vertical axis corresponds to Precision. Figure 8 also reports the AP values for each crack class as well as the mAP across all classes. Specifically, the areas under the curves for transverse crack, longitudinal crack, and pattern crack are 0.75, 0.763, and 0.712, respectively, demonstrating that YOLOv8 effectively identifies different types of pavement cracks.

The performance of SAM was evaluated using Precision, Recall, F1-score, and Intersection over Union (IoU). The F1-score, defined as the harmonic mean of Precision and Recall, offers a balanced indicator of segmentation accuracy and robustness, especially in the case of imbalanced datasets. IoU quantifies the geometric overlap between predicted (

A_{p}

) and ground truth labels (

A_{r}

), thereby evaluating the fidelity of predicted crack boundaries.

F 1 - s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(10)

I o U = \frac{a r e a (A_{p} \cap A_{r})}{a r e a (A_{p} \cup A_{r})}

(11)

For the SAM training process, metrics such as Precision, Recall, F1-score, and IoU were tracked over the same 200 epochs using three distinct learning rates: 0.001, 0.0001, and 0.00001. As depicted in Figure 9, all metrics improved progressively as the number of iterations increased. Among the tested learning rates, 0.0001 yielded the best performance, with Precision, Recall, F1-score, and IoU values of 0.938, 0.947, 0.942, and 0.891, respectively, suggesting that this configuration allowed the model to capture crack features more accurately than the other learning rates. Although full convergence was not reached within the 200 epochs, the attained F1-score of 0.942 demonstrates robust segmentation performance, as values above 0.9 are widely regarded as indicative of highly accurate prediction, whereas scores below 0.5 typically reflect substantial misclassification [44]. The implication is that while marginal gains might be realized with extended training, the current performance is operationally sufficient and robust for the application domain.

To assess practical deployment feasibility, we measured model size, computational complexity, and inference performance. The YOLOv8-based detector contains 3.01 million parameters and requires 8.2 GSLOPs per 640 × 640 input image, achieving an average inference time of 3.65 ms per image, corresponding to 274 FPS. On the other hand, the fine-tuned SAM with ViT-H backbone contains 631.6 million parameters and requires 2733.6 GFLOPs for 1024 × 1024 inputs, with an inference time of 431.7 ms per image, corresponding to 2.32 FPS.

For each connected component within the coarse segmented mask, an expanded region of interest (ROI) was extracted from the original input image using direct array slicing. This hard-crop operation supplies the refinement process with a high-resolution contextual pixel centered on the target component, without any interpolation artifacts. The SAM is then applied to the isolated patch, using a translated bounding-box prompt, to produce a high-fidelity local segmentation. The refined-level masks are subsequently mapped back into global coordinates and composited into the final output mask. This local refinement procedure improves boundary precision and detail while maintaining consistency across the entire image domain. The inference results on unseen images from the combined YOLOv8 and SAM framework are visualized in Figure 10, where transverse, longitudinal, and pattern cracks are indicated by blue, green, and red bounding boxes and masks, respectively.

Upon completion of the two-stage process, the segmented mask can be quantified by counting the pixels, allowing computation of distress density, as described in [58]. Simultaneously, a region containing pixel-level coordinates is extracted using Tesseract OCR to enable geospatial mapping of the detected and segmented distress, as illustrated in Figure 11. The OCR reliably recognizes the coordinate digits at the pixel level, producing digital values that accurately correspond to the road alignment when imported into a GIS.

4. Discussion

The cornerstone of the proposed framework lies in the integration of efficient pavement crack detection using YOLOv8n and high-fidelity segmentation using SAM. The choice of the lightweight YOLOv8n model as the baseline detector was deliberate, prioritizing an optimal balance between computational efficiency and detection efficacy. The results presented in the previous section demonstrate that YOLOv8n achieves satisfactory performance given its compact size and high inference speed. This efficiency is particularly critical for large-scale deployment by road agencies.

However, pavement distress datasets inherently pose challenges such as thin cracks or complex pavement textures, which can lead to false positives, missed detections, and reduced accuracy in the baseline YOLOv8n. To address these limitations, potential improvements could involve modifications to the backbone and neck structures to strengthen feature extraction and multi-scale representation. Such architectural refinements would enable the model to better capture fine-grained crack details more effectively, minimize redundant computation, and enhance localization accuracy across varying spatial scales.

For the segmentation task, the SAM component delivered exceptional results following fine-tuning. This high fidelity in mask generation is essential for accurate quantification of crack severity, which underpins efficient maintenance planning. Two complementary loss functions, including BCE and Dice loss, were employed during training. BCE promotes precise pixel-level classification, while Dice loss enhances overlap quality between predicted and ground-truth masks. This hybrid objective facilitates robust boundary delineation and improves the model’s ability to generalize across diverse crack morphologies. The fine-tuned SAM’s mask decoder, optimized under this combined loss, effectively captured both global structure and local boundary details.

As summarized in Table 2, a comparative evaluation on the Crack500 dataset demonstrated that the proposed SAM notably outperformed several state-of-the-art segmentation frameworks. In particular, the SAM attains the highest IoU of 76.69%, demonstrating its strong capability in accurately delineating crack regions. Although its Precision (79.07%) and F1-score (79.62%) are marginally lower than those of W-segnet, which achieved 79.86% and 80.70%, respectively, the SAM maintains highly competitive results across these metrics. Furthermore, its Recall (80.53%) is comparable to that of W-segnet (81.56%) and the modified DeepLabv3+ (80.00%) and remains close to the top-performing DeepCrack (89.82%). These findings confirm that fine-tuning SAM with appropriate loss functions and learning configurations significantly enhances segmentation accuracy for pavement crack analysis.

5. Conclusions

This study introduces a novel, integrated framework for end-to-end pavement crack detection and segmentation. The proposed methodology leverages the YOLOv8 model for efficient crack detection and a modified Segmentation Anything Model (SAM) for high-precision, pixel-level segmentation. This two-step approach addresses the significant challenge of high labeling costs by utilizing bounding box prompts from the detection phase to guide the segmentation process.

The framework’s effectiveness was validated using a real-world, open-source dataset from Yamanashi, Japan. The YOLOv8-based detection model demonstrated robust performance with a Precision of 0.704, a Recall of 0.694, and mean Average Precision (mAP) scores of 0.733 (at an IoU of 0.5) and 0.468 (averaged from 0.5 to 0.95). These metrics confirm the model’s capacity for crack detection on smart road-measuring devices with minimal computational requirements.

A key contribution of this research is the integration of SAM, which offers significant advantages in reducing annotation burdens. Unlike traditional segmentation models that are limited to pre-trained classes and require extensive annotated datasets, our modified SAM approach enables zero-shot segmentation. This allows the model to accurately delineate crack boundaries on new, unseen images without the need for additional training, a capability that no current segmentation model offers for this application. The segmentation results, with a Precision of 0.938, a Recall of 0.947, an F1-score of 0.942, and an IoU of 0.891, underscore the model’s precision in identifying the exact pixels that constitute a crack.

To further illustrate the performance of the proposed framework, a comparative evaluation was conducted on the Crack500 dataset. The SAM-based approach achieved the highest IoU of 76.69%, indicating superior ability in delineating crack regions. While its Precision (79.07%) and F1-score (79.62%) are slightly lower than W-segnet, the SAM remains highly competitive. Its Recall (80.53%) is comparable to other leading methods, such as W-segnet and DeepLabv3+, and approaches the performance of DeepCrack.

Furthermore, the framework integrates Optical Character Recognition (OCR) to automatically extract geospatial coordinates, enabling the automated mapping of detected distress onto real-world road networks. This end-to-end solution provides a quantitative assessment of distress, including its type, severity, and extent, which is essential for calculating the Maintenance Control Index (MCI) and facilitating efficient maintenance planning for road agencies.

Despite the demonstrated effectiveness of the proposed framework, several challenges remain. For instance, some cracks obscured by shadows exhibit reduced identifiability, as illustrated in Figure 10. Environmental factors, including uneven lighting and motion blur caused by vehicle movement, can adversely affect model accuracy. Moreover, the complexity of real-world road scenes, such as the presence of debris or irregular pavement textures, may result in missed detections or incomplete segmentations, even when detection is successful. Future work will aim to enhance the model’s robustness by refining the YOLO and SAM architecture with advanced attention mechanisms to better handle complex and noisy environments, as well as by improving the local refinement strategy, for example, through the incorporation of techniques such as RoIAlign.

Author Contributions

Conceptualization, N.S., A.A.A., F.O. and K.K.; methodology, N.S., A.A.A., F.O. and K.K.; software, N.S., A.A.A., F.O. and K.K.; formal analysis, N.S.; investigation, N.S.; resources, N.S.; data curation, N.S.; writing—original draft preparation, N.S.; writing—review and editing, N.S., A.A.A., F.O. and K.K.; supervision, K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was (partially) supported by the Council for Science, Technology and Innovation (CSTI), Cross-ministerial Strategic Innovation Promotion Program (SIP), 3rd period of SIP “Smart Infrastructure Management System”, Grant Number JPJ012187 (funding agency: Public Works Research Institute).

Data Availability Statement

All code and experimental settings used in this study have been made publicly available at https://github.com/Nut-Sovanneth/YOLOv8-SAM-OCR.git (accessed on 11 November 2025). The dataset used for training and testing is available from the open-source repository at: https://jstagedata.jst.go.jp/articles/dataset/Image_files_of_road_surface_by_an_action_camera/23507232 (accessed on 11 November 2025).

Acknowledgments

The authors thank the Road Data Collection and Management Unit (RDCMU), Ministry of Public Works and Transport (MPWT) of Cambodia, for providing the data utilized in this study. The authors also thank Junji Yoshida and Kouichi Takeya for kindly sharing the open-source road surface images in Yamanashi prefecture for use in research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hu, G.X.; Hu, B.L.; Yang, Z.; Huang, L.; Li, P. Pavement Crack Detection Method Based on Deep Learning Models. Wirel. Commun. Mob. Comput. 2021, 2021, 5573590. [Google Scholar] [CrossRef]
Qureshi, W.S.; Hassan, S.I.; McKeever, S.; Power, D.; Mulry, B.; Feighan, K.; O’Sullivan, D. An Exploration of Recent Intelligent Image Analysis Techniques for Visual Pavement Surface Condition Assessment. Sensors 2022, 22, 9019. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Zhang, H.; Zhang, T. Enhanced YOLOv8-Based Pavement Crack Detection: A High-Precision Approach. PLoS ONE 2025, 20, e0324512. [Google Scholar] [CrossRef] [PubMed]
Zakeri, H.; Nejad, F.M.; Fahimifar, A. Image Based Techniques for Crack Detection, Classification and Quantification in Asphalt Pavement: A Review. Arch. Comput. Methods Eng. 2017, 24, 935–977. [Google Scholar] [CrossRef]
Mathavan, S.; Kamal, K.; Rahman, M. A Review of Three-Dimensional Imaging Technologies for Pavement Distress Detection and Measurements. IEEE Trans. Intell. Transp. Syst. 2015, 16, 2353–2362. [Google Scholar] [CrossRef]
Fei, Y.; Wang, K.C.; Zhang, A.; Chen, C.; Li, J.Q.; Liu, Y.; Yang, G.; Li, B. Pixel-Level Cracking Detection on 3D Asphalt Pavement Images through Deep-Learning-Based CrackNet-V. IEEE Trans. Intell. Transp. Syst. 2019, 21, 273–284. [Google Scholar] [CrossRef]
Ibragimov, E.; Kim, Y.; Lee, J.H.; Cho, J.; Lee, J.-J. Automated Pavement Condition Index Assessment with Deep Learning and Image Analysis: An End-to-End Approach. Sensors 2024, 24, 2333. [Google Scholar] [CrossRef]
Chun, P.; Yamane, T.; Tsuzuki, Y. Automatic Detection of Cracks in Asphalt Pavement Using Deep Learning to Overcome Weaknesses in Images and GIS Visualization. Appl. Sci. 2021, 11, 892. [Google Scholar] [CrossRef]
Minami, M.; Suzuki, T. Pavement Maintenance Level and Annual Budget over the Regional Road Network. J. Constr. Manag. JSCE 2008, 15, 71–79. [Google Scholar] [CrossRef]
Miyamoto, A.; Yoshitake, T. Development of a Remote Condition Assessment System for Road Infrastructures. In Proceedings of the International ECCE Conference, EUROINFRA 2009, Current State and Challenges for Sustainable Development of Infrastructure, Helsinki, Finland, 15–16 October 2009; pp. 21–22. [Google Scholar]
Maintenance, J. Repair Guide Book of the Pavement 2013; Japan Road Association: Tokyo, Japan, 2013. [Google Scholar]
Yoshida, T. Composite Indicators for Assessing Maintenance Needs for Road Pavements from the View Point of Road Functions. J. Jpn. Soc. Civ. Eng. Ser. E1 (Pavement Eng.) 2016, 72, 12–20. [Google Scholar]
Kubo, K. Pavement Maintenance in Japan. In Proceedings of the Road Conference 2017 International Symposium, Tokyo, Japan, 31 October–1 November 2017. 32nd Japan Road Conference. [Google Scholar]
Ali, L.; AlJassmi, H.; Swavaf, M.; Khan, W.; Alnajjar, F. Rs-Net: Residual Sharp U-Net Architecture for Pavement Crack Segmentation and Severity Assessment. J. Big Data 2024, 11, 116. [Google Scholar] [CrossRef]
Li, Q.; Wu, T.; Xu, T.; Lei, J.; Liu, J. A Novel YOLO Algorithm Integrating Attention Mechanisms and Fuzzy Information for Pavement Crack Detection. Int. J. Comput. Intell. Syst. 2025, 18, 158. [Google Scholar] [CrossRef]
Peraka, N.S.P.; Biligiri, K.P. Pavement Asset Management Systems and Technologies: A Review. Autom. Constr. 2020, 119, 103336. [Google Scholar] [CrossRef]
Zhou, W.; Zhan, Y.; Zhang, H.; Zhao, L.; Wang, C. Road Defect Detection from On-Board Cameras with Scarce and Cross-Domain Data. Autom. Constr. 2022, 144, 104628. [Google Scholar] [CrossRef]
Chong, D.; Liao, P.; Fu, W. Multi-Objective Optimization for Sustainable Pavement Maintenance Decision Making by Integrating Pavement Image Segmentation and TOPSIS Methods. Sustainability 2024, 16, 1257. [Google Scholar] [CrossRef]
Han, Z.; Cai, Y.; Liu, A.; Zhao, Y.; Lin, C. MS-YOLOv8-Based Object Detection Method for Pavement Diseases. Sensors 2024, 24, 4569. [Google Scholar] [CrossRef]
Zhong, J.; Zhu, J.; Huyan, J.; Ma, T.; Zhang, W. Multi-Scale Feature Fusion Network for Pixel-Level Pavement Distress Detection. Autom. Constr. 2022, 141, 104436. [Google Scholar] [CrossRef]
Koch, C.; Georgieva, K.; Kasireddy, V.; Akinci, B.; Fieguth, P. A Review on Computer Vision Based Defect Detection and Condition Assessment of Concrete and Asphalt Civil Infrastructure. Adv. Eng. Inform. 2015, 29, 196–210. [Google Scholar] [CrossRef]
He, Y.; Jin, Z.; Zhang, J.; Teng, S.; Chen, G.; Sun, X.; Cui, F. Pavement Surface Defect Detection Using Mask Region-Based Convolutional Neural Networks and Transfer Learning. Appl. Sci. 2022, 12, 7364. [Google Scholar] [CrossRef]
Ren, J.; Bi, Z.; Niu, Q.; Liu, J.; Peng, B.; Zhang, S.; Pan, X.; Wang, J.; Chen, K.; Yin, C.H. Deep Learning and Machine Learning-Object Detection and Semantic Segmentation: From Theory to Applications. arXiv 2024, arXiv:2410.15584. [Google Scholar]
Ghosh, R.; Smadi, O. Automated Detection and Classification of Pavement Distresses Using 3D Pavement Surface Images and Deep Learning. Transp. Res. Rec. 2021, 2675, 1359–1374. [Google Scholar] [CrossRef]
Zhang, A.A.; Shang, J.; Li, B.; Hui, B.; Gong, H.; Li, L.; Zhan, Y.; Ai, C.; Niu, H.; Chu, X. Intelligent Pavement Condition Survey: Overview of Current Researches and Practices. J. Road Eng. 2024, 4, 257–281. [Google Scholar] [CrossRef]
Zeng, K.; Fan, R.; Tang, X. Efficient and Accurate Road Crack Detection Technology Based on YOLOv8-ES. Auton. Intell. Syst. 2025, 5, 4. [Google Scholar] [CrossRef]
Oliveira, H.; Correia, P.L. Automatic Road Crack Detection and Characterization. IEEE Trans. Intell. Transp. Syst. 2012, 14, 155–168. [Google Scholar] [CrossRef]
Elsharkawy, Z.F.; Kasban, H.; Abbass, M.Y. Efficient Surface Crack Segmentation for Industrial and Civil Applications Based on an Enhanced YOLOv8 Model. J. Big Data 2025, 12, 16. [Google Scholar] [CrossRef]
Wen, T.; Ding, S.; Lang, H.; Lu, J.J.; Yuan, Y.; Peng, Y.; Chen, J.; Wang, A. Automated Pavement Distress Segmentation on Asphalt Surfaces Using a Deep Learning Network. Int. J. Pavement Eng. 2023, 24, 2027414. [Google Scholar] [CrossRef]
Alshawabkeh, S.; Wu, L.; Dong, D.; Cheng, Y.; Li, L. A Hybrid Approach for Pavement Crack Detection Using Mask R-CNN and Vision Transformer Model. Comput. Mater. Contin. 2025, 82, 561–577. [Google Scholar] [CrossRef]
Li, D.; Duan, Z.; Hu, X.; Zhang, D. Pixel-Level Recognition of Pavement Distresses Based on U-Net. Adv. Mater. Sci. Eng. 2021, 2021, 5586615. [Google Scholar] [CrossRef]
Lau, S.L.; Chong, E.K.; Yang, X.; Wang, X. Automated Pavement Crack Segmentation Using U-Net-Based Convolutional Neural Network. IEEE Access 2020, 8, 114892–114899. [Google Scholar] [CrossRef]
Li, F.; Mou, Y.; Zhang, Z.; Liu, Q.; Jeschke, S. A Novel Model for the Pavement Distress Segmentation Based on Multi-Level Attention DeepLabV3+. Eng. Appl. Artif. Intell. 2024, 137, 109175. [Google Scholar] [CrossRef]
Sun, X.; Xie, Y.; Jiang, L.; Cao, Y.; Liu, B. DMA-Net: DeepLab with Multi-Scale Attention for Pavement Crack Segmentation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18392–18403. [Google Scholar] [CrossRef]
Wang, P.; Wang, C.; Liu, H.; Liang, M.; Zheng, W.; Wang, H.; Zhu, S.; Zhong, G.; Liu, S. Research on Automatic Pavement Crack Recognition Based on the Mask R-CNN Model. Coatings 2023, 13, 430. [Google Scholar] [CrossRef]
Dong, J.; Liu, J.; Wang, N.; Fang, H.; Zhang, J.; Hu, H.; Ma, D. Intelligent Segmentation and Measurement Model for Asphalt Road Cracks Based on Modified Mask R-CNN Algorithm. Comput. Model. Eng. Sci. 2021, 128, 541–564. [Google Scholar] [CrossRef]
Chen, T.; Cai, Z.; Zhao, X.; Chen, C.; Liang, X.; Zou, T.; Wang, P. Pavement Crack Detection and Recognition Using the Architecture of segNet. J. Ind. Inf. Integr. 2020, 18, 100144. [Google Scholar] [CrossRef]
Kyem, B.A.; Asamoah, J.K.; Aboah, A. Context-Cracknet: A Context-Aware Framework for Precise Segmentation of Tiny Cracks in Pavement Images. Constr. Build. Mater. 2025, 484, 141583. [Google Scholar] [CrossRef]
Xu, H.; Wang, M.; Liu, C.; Li, F.; Xie, C. Automatic Detection of Tunnel Lining Crack Based on Mobile Image Acquisition System and Deep Learning Ensemble Model. Tunn. Undergr. Space Technol. 2024, 154, 106124. [Google Scholar] [CrossRef]
Amani, M.; Ghorbanian, A.; Ahmadi, S.A.; Kakooei, M.; Moghimi, A.; Mirmazloumi, S.M.; Moghaddam, S.H.A.; Mahdavi, S.; Ghahremanloo, M.; Parsian, S. Google Earth Engine Cloud Computing Platform for Remote Sensing Big Data Applications: A Comprehensive Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5326–5350. [Google Scholar] [CrossRef]
Rakshitha, R.; Srinath, S.; Vinay Kumar, N.; Rashmi, S.; Poornima, B. Crack SAM: Enhancing Crack Detection Utilizing Foundation Models and Detectron2 Architecture. J. Infrastruct. Preserv. Resil. 2024, 5, 11. [Google Scholar] [CrossRef]
Ahmadi, M.; Gholizadeh Lonbar, A.; Kazemi Naeini, H.; Tarlani Beris, A.; Nouri, M.; Sharifzadeh Javidi, A.; Sharifi, A. Application of Segment Anything Model for Civil Infrastructure Defect Assessment. Innov. Infrastruct. Solut. 2025, 10, 269. [Google Scholar] [CrossRef]
Li, J.; Yuan, C.; Wang, X.; Chen, G.; Ma, G. Semi-Supervised Crack Detection Using Segment Anything Model and Deep Transfer Learning. Autom. Constr. 2025, 170, 105899. [Google Scholar] [CrossRef]
Owor, N.J.; Adu-Gyamfi, Y.; Aboah, A.; Amo-Boateng, M. PaveSAM–Segment Anything for Pavement Distress. Road Mater. Pavement Des. 2025, 26, 593–617. [Google Scholar] [CrossRef]
Mursari, L.R.; Wibowo, A. The Effectiveness of Image Preprocessing on Digital Handwritten Scripts Recognition with the Implementation of OCR Tesseract. Comput. Eng. Appl. J. 2021, 10, 177–186. [Google Scholar] [CrossRef]
Saoji, S.; Eqbal, A.; Vidyapeeth, B. Text Recognition and Detection from Images Using Pytesseract. J. Interdiscip. Cycle Res. 2021, 13, 1674–1679. [Google Scholar]
Güler, A.K.; Musa, A. An End-to-End Automatic Number Plate Recognition System Based on YOLOv12 and Tesseract OCR. Int. J. New Horiz. Sci. 2025, 3, 42–52. [Google Scholar]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, MA, USA, 2006. [Google Scholar]
Bruls, T.; Porav, H.; Kunze, L.; Newman, P. The Right (Angled) Perspective: Improving the Understanding of Road Scenes Using Boosted Inverse Perspective Mapping. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 302–309. [Google Scholar]
Bradski, G. The Opencv Library. Dr. Dobb’s J. Softw. Tools Prof. Program. 2000, 25, 120–123. [Google Scholar]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of Yolo Architectures in Computer Vision: From Yolov1 to Yolov8 and Yolo-Nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Ju, R.-Y.; Cai, W. Fracture Detection in Pediatric Wrist Trauma X-Ray Images Using YOLOv8 Algorithm. Sci. Rep. 2023, 13, 20077. [Google Scholar] [CrossRef]
Choi, Y.; Bae, B.; Han, T.H.; Ahn, J. Application of Mask R-CNN and Yolov8 Algorithms for Concrete Crack Detection. IEEE Access 2024, 12, 165314–165321. [Google Scholar] [CrossRef]
Junji, Y.; Kouichi, T. Image Files of Road Surface in Yamanashi by an Action Camera, Version 2.0; Japan Society of Civil Engineers: Tokyo, Japan, 2023. [Google Scholar]
Li, R.; Gu, M. Road Crack Detection Method Based on Lightweight YOLOv8-LUAPD and UAV Images. Meas. Sci. Technol. 2025, 36, 045411. [Google Scholar] [CrossRef]
Chen, X.; Zhao, Z.; Zhang, Y.; Duan, M.; Qi, D.; Zhao, H. Focalclick: Towards Practical Interactive Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2022; pp. 1300–1309. [Google Scholar]
Sofiiuk, K.; Petrov, I.A.; Konushin, A. Reviving Iterative Training with Mask Guidance for Interactive Segmentation. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3141–3145. [Google Scholar]
Obunguta, F.; Matsushima, K.; Susaki, J. Probabilistic Management of Pavement Defects with Image Processing Techniques. J. Civ. Eng. Manag. 2024, 30, 114–132. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A Deep Hierarchical Feature Learning Architecture for Crack Segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. SwinUNet: UNet-like Pure Transformer for Medical Image Segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar]
Fan, Y.; Hu, Z.; Li, Q.; Sun, Y.; Chen, J.; Zhou, Q. CrackNet: A Hybrid Model for Crack Segmentation with Dynamic Loss Function. Sensors 2024, 24, 7134. [Google Scholar] [CrossRef]

Figure 1. ROMDAS survey vehicle.

Figure 2. Pavement image from ROMDAS survey vehicle.

Figure 3. Pipeline of the proposed framework.

Figure 4. Given camera perspective (left), transformed top-down perspective (right).

Figure 5. Overall architecture of YOLOv8.

Figure 6. Overview of modified SAM architecture.

Figure 7. YOLOv8 training process: (a) Precision, (b) Recall, (c) mAP50, and (d) mAP50–95.

Figure 8. YOLOv8 Precision-Recall curve.

Figure 9. SAM training process: (a) Precision, (b) Recall, (c) F1-score, and (d) IoU.

Figure 10. Test results of combined YOLOv8 and SAM.

Figure 11. GIS-based visualization of detected pavement distress with quantitative assessment.

Table 1. Hyperparameter configuration for distress detection and segmentation.

Parameter	YOLOv8	SAM
Pretrained Model	YOLOv8n.pt	sam_vit_h_4b8939.pth
Epochs	200	200
Batch size	8	1
Input size	640 × 640	1024 × 1024
Learning rate	0.01	0.001, 0.0001, 0.00001
Classes	T_Crack, L_Crack, P_Crack ¹	T_Crack, L_Crack, P_Crack ¹

¹ T_Crack = Transverse Crack, L_Crack = Longitudinal Crack, P_Crack = Pattern Crack.

Table 2. Comparison of different segmentation models on the Crack500 dataset.

Model	Precision	Recall	F1-Score	IoU
DeepCrack [59]	0.3607	0.8982	0.4911	0.3425
U-Net [60]	0.6998	0.6876	0.6693	0.5279
TransUNet [61]	0.7014	0.6587	0.6520	0.5053
Swin-Unet [62]	0.6915	0.6519	0.6426	0.4971
CrackNet [63]	0.6474	0.7595	0.6744	0.5293
W-segnet [20]	0.7986	0.8156	0.8070	0.7326
Modified DeepLabv3+ [34]	0.6950	0.8000	0.7440	0.5590
Our SAM	0.7907	0.8053	0.7962	0.7669

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sovanneth, N.; Angelo, A.A.; Obonguta, F.; Kaito, K. An Integrated Framework with SAM and OCR for Pavement Crack Quantification and Geospatial Mapping. Infrastructures 2025, 10, 348. https://doi.org/10.3390/infrastructures10120348

AMA Style

Sovanneth N, Angelo AA, Obonguta F, Kaito K. An Integrated Framework with SAM and OCR for Pavement Crack Quantification and Geospatial Mapping. Infrastructures. 2025; 10(12):348. https://doi.org/10.3390/infrastructures10120348

Chicago/Turabian Style

Sovanneth, Nut, Asnake Adraro Angelo, Felix Obonguta, and Kiyoyuki Kaito. 2025. "An Integrated Framework with SAM and OCR for Pavement Crack Quantification and Geospatial Mapping" Infrastructures 10, no. 12: 348. https://doi.org/10.3390/infrastructures10120348

APA Style

Sovanneth, N., Angelo, A. A., Obonguta, F., & Kaito, K. (2025). An Integrated Framework with SAM and OCR for Pavement Crack Quantification and Geospatial Mapping. Infrastructures, 10(12), 348. https://doi.org/10.3390/infrastructures10120348

Article Menu

An Integrated Framework with SAM and OCR for Pavement Crack Quantification and Geospatial Mapping

Abstract

1. Introduction

2. Methods

2.1. Proposed Framework

2.2. Top-Down Perspective Transformation Based on Homography

2.3. YOLOv8 Model for Pavement Crack Detection

2.4. Architecture of SAM for Pavement Crack Segmentation

3. Results

3.1. Dataset Preparation

3.2. Model

3.3. Training Setup

3.4. Evaluation Metrics

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI